wenlxie.github.io

Request Delay 200ms Becauseof Mtu Setting In Ipvs Loadbalancer With Ipip Forwarding Method

wenlxie — 2024-12-31T21:02:32+00:00

Backgroud

IPVS can be used as a load balancer to provide virtual services, and it is common to use the IPIP protocol to preserve client IPs. The typical traffic flow is as follows:

The client pod has envoy sidecar injected.

Recently, some customers have reported that their requests have a high timeout rate after moving to this network flow.

The client creates a new connection to the VIP 10.0.0.2:443 with the client IP 10.0.0.1 and port 12345 (using 12345 as an example). The connection is 10.0.0.1:12345 → 10.0.0.2:443.
Iptables rules intercept the request and redirect the traffic to port 15001, changing the connection’s destination IP and port to 127.0.0.1 and 15006. The connection is now 10.0.0.1:12345 → 127.0.0.1:15006.
After the TCP handshake succeeds, the client sends an SSL hello on the established connection. The sidecar’s Envoy receives the request from the loopback (lo) interface. Since the outbound traffic is passthrough, Envoy does not perform any checks at the HTTP/HTTPS layer and creates a new connection to the destination 10.0.0.2. The request is forwarded to the VIP 10.0.0.2 with the connection 10.0.0.1:54321 → 10.0.0.2:443.
The load balancer uses IPVS with the IPIP packet forwarding method and has the following IPVS rule:
The TCP SYN packet from 10.0.0.1:54321 to 10.0.0.2:443 is forwarded to a real server with an IPIP tunnel header, where the outer IP header has the source IP 10.0.0.5 (load balancer’s IP) and the destination IP 10.0.0.6 (real server’s IP), while the inner IP/TCP header remains unchanged..
After receiving the TCP SYN packet from 10.0.0.1:54321 to 10.0.0.2:443, the real server strips the outer IP header, processes the request, and sends TCP SYN,ACK directly to the client, bypassing the load balancer

The sidecar(envoy) in the client side and Loadbalancer only handles the request in the TCP layer, so the TLS and HTTP interaction happens between the client and server directly.

Tcpdump Analysis

We used TCPDUMP to check why the request was delayed, here is the dump result:

This is the tcpdump file captured from the ‘lo’ and ‘eth0’ interfaces on the client side..

Packets numbered 257571 to 257797 are TCP and TLS handshake packets.
Packet number 257858 is sent from the client and received on the ‘lo’ interface.
Instead of forwarding this packet directly to 10.0.0.2, there are four ICMP packets with the source IP 10.0.0.1 and destination IP 10.0.0.1 that were captured.. The details of this ICMP packet:

Packet number 259044 is the first packet sent from 10.0.0.1 to 10.0.0.2. The timestamp is 09:05:53.638466, indicating a delay of approximately 207.76 ms compared to the timestamp 09:05:53.430710 of packet number 257858.

But the strange thing is, there was another request almost at the same time, but it exhibited different behavior.

Packets numbered 257563 to 257776 are the TCP and TLS handshake packets.
Packet number 257809 is the packet sent from the client and received on the loopback (lo) interface.
Then, packet number 257812 is the packet that forwards the contents of packet number 257809 to 10.0.0.2.
Then, a destination unreachable ICMP packet was received from source IP 10.0.0.2 to client 10.0.0.1 (packet number 257815).
Packet number 257819 is the retransmission of packet number 257812, with only a 0.1 ms time difference between 09:05:53.430290 and 09:05:53.430181.

This request (referred to as request A below) occurred slightly earlier than the previous request (referred to as request B below), which encountered issues. It seems that packet number 257815 triggered something in the TCP stack, ultimately causing a 200 ms delay.

ICMP destination unreachable packet

This error message occurs when the size of a packet exceeds the Maximum Transmission Unit (MTU) of the interface that needs to route it. Since IP Virtual Server (IPVS) adds an extra IP header for IPIP tunneling, it increases the packet size by 20 bytes. This change in packet length can lead to issues if the packet is marked with the ‘Don’t Fragment’ (DF) flag. In such cases, IPVS will send an ICMP Destination Unreachable message back to the client..

ip_vs_xmit.c

static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
					  int rt_mode,
					  struct ip_vs_iphdr *ipvsh,
					  struct sk_buff *skb, int mtu)
{
#ifdef CONFIG_IP_VS_IPV6
	if (skb_af == AF_INET6) {
		struct net *net = ipvs->net;

		if (unlikely(__mtu_check_toobig_v6(skb, mtu))) {
			if (!skb->dev)
				skb->dev = net->loopback_dev;
			/* only send ICMP too big on first fragment */
			if (!ipvsh->fragoffs && !ip_vs_iph_icmp(ipvsh))
				icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
			IP_VS_DBG(1, "frag needed for %pI6c\n",
				  &ipv6_hdr(skb)->saddr);
			return false;
		}
	} else
#endif
	{
		/* If we're going to tunnel the packet and pmtu discovery
		 * is disabled, we'll just fragment it anyway
		 */
		if ((rt_mode & IP_VS_RT_MODE_TUNNEL) && !sysctl_pmtu_disc(ipvs))
			return true;

		if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
			     skb->len > mtu && !skb_is_gso(skb) &&
			     !ip_vs_iph_icmp(ipvsh))) {
			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
				  htonl(mtu));
			IP_VS_DBG(1, "frag needed for %pI4\n",
				  &ip_hdr(skb)->saddr);
			return false;
		}
	}

	return true;
}

The details of this ICMP packet also include the MTU information, which is 1480..

The MTU setting is 1500 at each hop in the entire traffic flow. However, IPIP tunnels require the injection of an additional IP header, which is 20 bytes. This is why IPVS requests the client to send packets with an MTU of 1480.

If we change the MTU of the interface or the route to 1480 on either the client or server side, this issue can be resolved. But there are still something not clear:

Question 1: How does the ICMP destination unreachable packet impact the client behavior?

Let’s examine the source code to understand how Linux handles ICMP Destination Unreachable (FRAG NEEDED) packets.

You can find the relevant code here:

tcp_ipv4.c icmp.c

Upon receiving an ICMP Unreachable packet, the system begins to adjust the MTU through the icmp_socket_deliver() function, which calls tcp_v4_err().

	if (code == ICMP_FRAG_NEEDED) { /* PMTU discovery (RFC1191) */

			/* We are not interested in TCP_LISTEN and open_requests

			 * (SYN-ACKs send out by Linux are always <576bytes so

			 * they should go through unfragmented).

			 */

			if (sk->sk_state == TCP_LISTEN)

				goto out;

			WRITE_ONCE(tp->mtu_info, info);

			if (!sock_owned_by_user(sk)) {

				tcp_v4_mtu_reduced(sk);

			} else {

				if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, &sk->sk_tsq_flags))

					sock_hold(sk);

			}

			goto out;

		}

If the socket is not held by user, the MTU is reduced immediately using tcp_v4_mtu_reduced(). However, if the TCP socket is held by user, the TCP small queue flag TCP_MTU_REDUCED_DEFERRED will be set. Once the socket is released by user, tcp_release_cb() is called to process the queue, and tcp_v4_mtu_reduced() is invoked.

You can find more details here:

tcp_output.c

So after the ICMP packet has been handled, the route MTU will be changed to 1480 and linux kernel will keep this MTU cache for 600s.

ip route get 10.0.0.2

10.0.0.2 via 10.x.x.x dev eth0 src 10.0.0.1 uid 0 
    cache 421

net.ipv4.route.mtu_expires = 600
net.ipv6.route.mtu_expires = 600

Question 2: Why are there multiple ICMP Destination Unreachable (FRAG NEEDED) packets with the source IP equal to the destination IP?

Let’s trace the kernel stack of the function icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)). The icmp_send() function in the Linux kernel is responsible for sending this ICMP packet. By tracing it, we can understand how this is triggered:

        __icmp_send
        __ip_finish_output
        ip_finish_output
        ip_output
        ip_local_out
        __ip_queue_xmit
        ip_queue_xmit
        __tcp_transmit_skb
        tcp_write_xmit
        __tcp_push_pending_frames
        tcp_push
        tcp_sendmsg_locked
        tcp_sendmsg
        inet_sendmsg
        sock_sendmsg
        sock_write_iter
        do_iter_readv_writev
        do_iter_write
        vfs_writev
        do_writev
        __x64_sys_writev
        do_syscall_64
        entry_SYSCALL_64_after_hwframe

From the call stack, we can see that it checks the route MTU and attempts to perform IP fragmentation. If the DF (Don’t Fragment) bit is set in the IP header, the kernel will send an ICMP packet with type 3 (ICMP_DEST_UNREACH) and code 4 (ICMP_FRAG_NEEDED).

There were two requests occurring almost simultaneously. The first request triggers the ICMP Destination Unreachable packet from the load balancer, which then changes the route MTU to 1480. The second request performs MSS (Maximum Segment Size) negotiation based on an MTU of 1500, but the route MTU has already been changed to 1480. Thus, when the packet length exceeds 1500, an ICMP packet with type 3 (ICMP_DEST_UNREACH) and code 4 (ICMP_FRAG_NEEDED) is sent from the kernel with the source IP equal to the destination IP.

If we examine the statistics in the Linux kernel using netstat, we can see some relevant data:

Icmp:
    6957 ICMP messages received
    22 input ICMP message failed
    ICMP input histogram:
        destination unreachable: 6940
        timeout in transit: 17
    3601 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 65
        time exceeded: 3536

The “destination unreachable: 6940” in the ICMP input histogram represents the statistics for receiving ICMP type 3 (ICMP_DEST_UNREACH) and code 4 (ICMP_FRAG_NEEDED), typically for cases like request A. The “destination unreachable: 65” in the ICMP output histogram represents the statistics for sending ICMP type 3 (ICMP_DEST_UNREACH) and code 4 (ICMP_FRAG_NEEDED), also typically for cases like request B.

Question 3: Why does the ICMP packet not trigger the tcp retransmission?

In Question 1, we have the details for how the Linux kernel handles the ICMP packet with type: 3 (ICMP_DEST_UNREACH) and code: 4 (ICMP_FRAG_NEEDED). It can be handled in tcp_v4_mtu_reduced(struct sock *sk)

tcp_ipv4.c tcp_ipv4.c

tcp.c sock.c tcp_output.c

This is based on the held status of the socket tcp_ipv4.c

From the tcpdump, request A triggers TCP retransmission, which helps reduce delay duration. However, request B does not trigger TCP retransmission, resulting in a delay of more than 200ms.

To understand how the ICMP packet triggered by request B is handled and why it experiences a delay of over 200ms, we need to examine how this ICMP packet is processed in the Linux kernel:

        b'tcp_v4_err+0x1'
        b'icmp_unreach+0x91'
        b'icmp_rcv+0x19f'
        b'ip_protocol_deliver_rcu+0x1da'
        b'ip_local_deliver_finish+0x48'
        b'ip_local_deliver+0xf3'
        b'ip_rcv+0x16b'
        b'__netif_receive_skb_one_core+0x86'
        b'__netif_receive_skb+0x15'
        b'process_backlog+0x9e'
        b'__napi_poll+0x33'
        b'net_rx_action+0x126'
        b'__do_softirq+0xd9'
        b'do_softirq+0x75'
        b'__local_bh_enable_ip+0x50'
        b'__icmp_send+0x55a'
        b'ip_fragment.constprop.0+0x7a'
        b'__ip_finish_output+0x13d'
        b'ip_finish_output+0x2e'
        b'ip_output+0x78'
        b'ip_local_out+0x5a'
        b'__ip_queue_xmit+0x180'
        b'ip_queue_xmit+0x15'
        b'__tcp_transmit_skb+0x8d9'
        b'tcp_write_xmit+0x3a7'
        b'__tcp_push_pending_frames+0x37'
        b'tcp_push+0xd2'
        b'tcp_sendmsg_locked+0x87f'
        b'tcp_sendmsg+0x2d'
        b'inet_sendmsg+0x43'
        b'sock_sendmsg+0x5e'
        b'sock_write_iter+0x93'
        b'do_iter_readv_writev+0x14d'
        b'do_iter_write+0x88'
        b'vfs_writev+0xaa'
        b'do_writev+0xe5'
        b'__x64_sys_writev+0x1c'
        b'do_syscall_64+0x5c'
        b'entry_SYSCALL_64_after_hwframe+0x44'

Upon further tracing, in the function tcp_v4_err(), the packet is handled by tcp_v4_mtu_reduced(). This occurs while the packet is still in the send context, meaning the socket is still held by the userspace, so tcp_v4_mtu_reduced() is not called directly.

tcp_v4_mtu_reduced() is invoked when the socket is released by release_sock(). In tcp_v4_mtu_reduced(), tcp_simple_retransmit() is called tcp_simple_retransmit(), but the packet is not sent out, so TCP retransmission is not triggered in this scenario, unlike request A tcp_input.c.

Upon further tracing, in this condition, the packet is sent again during the handling of the TCP probe timer, resulting in a delay of more than 200ms.

        b'tcp_v4_send_check+0x1'
        b'tcp_write_wakeup+0x120'
        b'tcp_send_probe0+0x1d'
        b'tcp_probe_timer.constprop.0+0x17e'
        b'tcp_write_timer_handler+0x79'
        b'tcp_write_timer+0x9e'
        b'call_timer_fn+0x2b'
        b'__run_timers.part.0+0x1dd'
        b'run_timer_softirq+0x2a'
        b'__do_softirq+0xd9'
        b'irq_exit_rcu+0x8c'
        b'sysvec_apic_timer_interrupt+0x7c'
        b'asm_sysvec_apic_timer_interrupt+0x12'
        b'cpuidle_enter_state+0xd9'
        b'cpuidle_enter+0x2e'
        b'cpuidle_idle_call+0x13e'
        b'do_idle+0x83'
        b'cpu_startup_entry+0x20'
        b'start_secondary+0x12a'
        B'secondary_startup_64_no_verify+0xc2'

Question 4: If the route MTU cache changes on the fly, does it impact connections that are already established?

From the tcpdump files for request A and request B, we can observe that this impacts new connections. But does it affect connections that are already established with MSS negotiated with an MTU of 1480?

Unfortunately, existing traffic will be impacted even if the MSS is negotiated with an MTU of 1480.

The MSS is periodically checked during message sending in the TCP stack: tcp_output.c

If the route MTU expires (the route MTU is cached for 600 seconds by default), the MSS will change back from 1440 （MTU 1480) to 1460 (MTU 1500). Consequently, subsequent requests may also experience the 200ms delay issue.

curl errors

wenlxie — 2024-02-12T21:02:32+00:00

connection reset

curl -Svv -k https://xxx.yyy.com:443/mysql/MysqlStatus


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Host https://xxx.yyy.com:443 was resolved.
* IPv6: (none)
* IPv4: 10.10.10.205
*   Trying 10.10.10.205:443...
* Connected to xxx.yyy.com (10.10.10.205) port 443
* ALPN: curl offers h2,http/1.1
} [5 bytes data]
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
} [512 bytes data]
  0     0    0     0    0     0      0      0 --:--:--  0:00:09 --:--:--     0* Recv failure: Connection reset by peer
* OpenSSL SSL_connect: Connection reset by peer in connection to 10.10.10.205:443
  0     0    0     0    0     0      0      0 --:--:--  0:00:10 --:--:--     0
* Closing connection
curl: (35) Recv failure: Connection reset by peer

connection refused

curl -ivf https://xxx.yyy.com:443/mysql/MysqlStatus
*   Trying 10.10.10.205...
* TCP_NODELAY set
* Connection failed
* connect to 10.10.10.205 port 443 failed: Connection refused
* Failed to connect to xxx.yyy.com port 443: Connection refused
* Closing connection 0
curl: (7) Failed to connect to xxx.yyy.com port 443: Connection refused
support $curl -ivf http://xxx.yyy.com
*   Trying 10.10.10.205...
* TCP_NODELAY set
* Connection failed
* connect to 10.10.10.205 port 80 failed: Connection refused
* Failed to connect to xxx.yyy.com port 80: Connection refused
* Closing connection 0
curl: (7) Failed to connect to xxx.yyy.com port 80: Connection refused

connection timeout

curl -skv --connect-timeout 10 --max-time 30  http://10.10.10.205:9091/metrics ; date
Fri Oct 21 02:52:11 UTC 2022
*   Trying 10.10.10.205:9091...
* After 10000ms connect time, move on!
* connect to 10.10.10.205 port 9091 failed: Connection timed out
* Connection timeout after 10001 ms
* Closing connection 0

connection can’t assign requested address

curl -vvv http://10.10.10.205:80/metrics

*   Trying 10.10.10.205:80...
* TCP_NODELAY set
* Immediate connect fail for 10.10.10.205: Cannot assign requested address
* Closing connection 0
curl: (7) Couldn't connect to server
command terminated with exit code 7

cilium pwru implementation

wenlxie — 2022-11-26T21:02:32+00:00

Introduction

PROJECT Link and ReadMe: https://github.com/cilium/pwru/

Implementation

eBPF progs

Defined 5 eBPF prog with skb_buff args in different field

https://github.com/cilium/pwru/blob/v0.0.6/bpf/kprobe_pwru.c#L383-L416

The other functions are:

metadata related

https://github.com/cilium/pwru/blob/v0.0.6/bpf/kprobe_pwru.c#L24-L32
filter related

https://github.com/cilium/pwru/blob/v0.0.6/bpf/kprobe_pwru.c#L266
output related

https://github.com/cilium/pwru/blob/v0.0.6/bpf/kprobe_pwru.c#L344

Filter functions and args

Get the functions which can be kprobed from ‘/sys/kernel/debug/tracing/available_filter_functions’ (Only the function names, no args info)
Get the functions from kmod from /sys/kernel/btf/
Get the functions from vmlinux
Get the functions with args sk_buff and its index which can be kprobed
Link the functions to the bpf progs with kprobe

pwru usage

pwru uses kprobe to hook for more than 500 functions on the system if no function filter specified, which can be showed by bpftool perf list

It has performance impact , should not be used in production environment which has high traffic.

cilium hubble implementation

wenlxie — 2022-11-26T21:02:32+00:00

Hubble event generate

DataPlane

Call the API: send_drop_notify_error(), send_trace(), send_drop_notify(), cilium_dbg() to send events perf_buff cilium_event. Cilium_event is a perf_buff, which is is 64 pages. Difference between perf buff vs ring buff : https://nakryiko.com/posts/bpf-ringbuf/

static __always_inline void
send_trace_notify(struct __ctx_buff *ctx, enum trace_point obs_point,
		  __u32 src, __u32 dst, __u16 dst_id, __u32 ifindex,
		  enum trace_reason reason, __u32 monitor)
{
	__u64 ctx_len = ctx_full_len(ctx);
	__u64 cap_len = min_t(__u64, monitor ? : TRACE_PAYLOAD_LEN,
			      ctx_len);
	struct trace_notify msg __align_stack_8;

	update_trace_metrics(ctx, obs_point, reason);

	if (!emit_trace_notify(obs_point, monitor))
		return;

	msg = (typeof(msg)) {
		__notify_common_hdr(CILIUM_NOTIFY_TRACE, obs_point),
		__notify_pktcap_hdr(ctx_len, (__u16)cap_len),
		.src_label	= src,
		.dst_label	= dst,
		.dst_id		= dst_id,
		.reason		= reason,
		.ifindex	= ifindex,
	};
	memset(&msg.orig_ip6, 0, sizeof(union v6addr));

	ctx_event_output(ctx, &EVENTS_MAP,
			 (cap_len << 32) | BPF_F_CURRENT_CPU,
			 &msg, sizeof(msg));
}

ControlPlane

Cilium agent calls api daemon.SendNotification when endpoint add/del and policy add/del.

// SendNotification sends an agent notification to the monitor
func (d *Daemon) SendNotification(notification monitorAPI.AgentNotifyMessage) error {
	if option.Config.DryMode {
		return nil
	}
	return d.monitorAgent.SendEvent(monitorAPI.MessageTypeAgent, notification)
}

hubble agent

hubble agent register listener and consumer of the events, and the start a go routine to handle the events It only handles the events from data plane, Control plane events will be sent to listeners and consumers directly.

// startPerfReaderLocked starts the perf reader. This should only be
// called if there are no other readers already running.
// The goroutine is spawned with a context derived from m.Context() and the
// cancelFunc is assigned to perfReaderCancel. Note that cancelling m.Context()
// (e.g. on program shutdown) will also cancel the derived context.
// Note: it is critical to hold the lock for this operation.
func (a *Agent) startPerfReaderLocked() {
	if a.events == nil {
		return // not attached to events map yet
	}

	a.perfReaderCancel() // don't leak any old readers, just in case.
	perfEventReaderCtx, cancel := context.WithCancel(a.ctx)
	a.perfReaderCancel = cancel
	go a.handleEvents(perfEventReaderCtx)
}

It will get the events from perf buff cilium_events, and then start to handle the events.

These events include lost events and normal events. Agent sent the events to the listeners and consumsers

In this step, events are still in raw data format.

hubble consumer

hubble observer

This consumer only enabled when hubble enabled.

func (d *Daemon) launchHubble() {
    ...
	d.hubbleObserver, err = observer.NewLocalServer(payloadParser, logger,
		observerOpts...,
	)
	if err != nil {
		logger.WithError(err).Error("Failed to initialize Hubble")
		return
	}
	go d.hubbleObserver.Start()
	d.monitorAgent.RegisterNewConsumer(monitor.NewConsumer(d.hubbleObserver))
    ...
}

hubble recorder

This consumer is enabled by config option.Config.EnableRecorder && option.Config.EnableHubbleRecorderAPI

func (d *Daemon) launchHubble() {
    ...
	if option.Config.EnableRecorder && option.Config.EnableHubbleRecorderAPI {
		dispatch, err := sink.NewDispatch(option.Config.HubbleRecorderSinkQueueSize)
		if err != nil {
			logger.WithError(err).Error("Failed to initialize Hubble recorder sink dispatch")
			return
		}
		d.monitorAgent.RegisterNewConsumer(dispatch)
		svc, err := recorder.NewService(d.rec, dispatch,
			recorderoption.WithStoragePath(option.Config.HubbleRecorderStoragePath))
		if err != nil {
			logger.WithError(err).Error("Failed to initialize Hubble recorder service")
			return
		}
		localSrvOpts = append(localSrvOpts, serveroption.WithRecorderService(svc))
	}
    ...
}

hubble listener

	// We can only attach the monitor agent once cilium_event has been set up.
	if option.Config.RunMonitorAgent {
		err = d.monitorAgent.AttachToEventsMap(defaults.MonitorBufferPages)
		if err != nil {
			log.WithError(err).Error("encountered error configuring run monitor agent")
			return nil, nil, fmt.Errorf("encountered error configuring run monitor agent: %w", err)
		}

		if option.Config.EnableMonitor {
			err = monitoragent.ServeMonitorAPI(d.monitorAgent)
			if err != nil {
				log.WithError(err).Error("encountered error configuring run monitor agent")
				return nil, nil, fmt.Errorf("encountered error configuring run monitor agent: %w", err)
			}
		}
	}

// ServeMonitorAPI serves the Cilium 1.2 monitor API on a unix domain socket.
// This method starts the server in the background. The server is stopped when
// monitor.Context() is cancelled. Each incoming connection registers a new
// listener on monitor.
func ServeMonitorAPI(monitor *Agent) error {
	listener, err := buildServer(defaults.MonitorSockPath1_2)
	if err != nil {
		return err
	}

	s := &server{
		listener: listener,
		monitor:  monitor,
	}

	log.Infof("Serving cilium node monitor v1.2 API at unix://%s", defaults.MonitorSockPath1_2)

	go s.connectionHandler1_2(monitor.Context())

	return nil
}

// connectionHandler1_2 handles all the incoming connections and sets up the
// listener objects. It will block until ctx is cancelled.
func (s *server) connectionHandler1_2(ctx context.Context) {
	go func() {
		<-ctx.Done()
		s.listener.Close()
	}()

	for !isCtxDone(ctx) {
		conn, err := s.listener.Accept()
		switch {
		case isCtxDone(ctx):
			if conn != nil {
				conn.Close()
			}
			return
		case err != nil:
			log.WithError(err).Warn("Error accepting connection")
			continue
		}

		newListener := newListenerv1_2(conn, option.Config.MonitorQueueSize, s.monitor.RemoveListener)
		s.monitor.RegisterNewListener(newListener)
	}
}

events handling by consumer

hubble observer

start a go routine the handle the events
```
  go d.hubbleObserver.Start()
```
Get the events from channel , events are sent from hubble agent

Call OnMonitorEvent to run the hook before decode the events. For this consumer, actually there is nothing to handle here.

        for _, f := range s.opts.OnMonitorEvent {
            stop, err := f.OnMonitorEvent(ctx, monitorEvent)
            if err != nil {
                s.log.WithError(err).WithField("event", monitorEvent).Info("failed in OnMonitorEvent")
            }
            if stop {
                continue nextEvent
            }
        }

Decode the events

perf event
dbg events

Add endpoint info by ip

l34 events

Add L3, L4 metadata to the events. Medata includes but not limited: Endpoint info, pod info, 5 tuples ... 

agent event

- Handle the message from L7, then decode the L7 event and add metadata
  
- Handle the message from the monitor agent

lost event
decode flows to add metrics.

```

    if flow, ok := ev.Event.(*flowpb.Flow); ok {
        for _, f := range s.opts.OnDecodedFlow {
            stop, err := f.OnDecodedFlow(ctx, flow)
            if err != nil {
                s.log.WithError(err).WithField("event", monitorEvent).Info("failed in OnDecodedFlow")
            }
            if stop {
                continue nextEvent
            }
        }

        atomic.AddUint64(&s.numObservedFlows, 1)
    }

```

Call onDecodeEvent() to execute the hook after event decoded

       for _, f := range s.opts.OnDecodedEvent {
           stop, err := f.OnDecodedEvent(ctx, ev)
           if err != nil {
               s.log.WithError(err).WithField("event", ev).Info("failed in OnDecodedEvent")
           }
           if stop {
               continue nextEvent
           }
       }

hubble recorder

Get the request from the client, and the start to record

         startRecording := req.GetStart()
         if startRecording == nil {
             return fmt.Errorf("received invalid request %q, expected start request", req)
         }

         // The startRecording helper spawns a clean up go routine to remove all
         // state associated with this recording when the context ctx is cancelled.
         recording, filePath, err = s.startRecording(ctx, startRecording)
         if err != nil {
             return err
         }

Create the pcap file, get the events from queue and then send response

 ```
 func (s *Service) startRecording(
 ctx context.Context,
 req *recorderpb.StartRecording,
 ) (handle *sink.Handle, filePath string, err error) {
 ---
 filters, err := parseFilters(req.GetInclude())
 if err != nil {
     return nil, "", err
 }
 ---
 var f *os.File
 f, filePath, err = createPcapFile(s.opts.StoragePath, prefix)
 if err != nil {
     return nil, "", err
 }
 ---
 handle, err = s.dispatch.StartSink(ctx, config)
 if err != nil {
     return nil, "", err
 }
 ---
 }

 ```

 ``` 
 func startSink(ctx context.Context, p PcapSink, queueSize int) *sink {
 ---
 for {
 select {
 // s.queue will be closed when the sink is unregistered
 case rec := <-s.queue:
 pcapRecord := pcap.Record{
 Timestamp:      rec.timestamp,
 CaptureLength:  rec.inclLen,
 OriginalLength: rec.origLen,
 }

 if err = p.Writer.WriteRecord(pcapRecord, rec.data); err != nil {
                 return
             }

             stats := s.addToStatistics(Statistics{
                 PacketsWritten: 1,
                 BytesWritten:   uint64(rec.inclLen),
             })
             if (stop.PacketsCaptured > 0 && stats.PacketsWritten >= stop.PacketsCaptured) ||
                 (stop.BytesCaptured > 0 && stats.BytesWritten >= stop.BytesCaptured) {
                 return
             }
         case <-s.shutdown:
             return
         case <-stopAfter:
             // duration of stop condition has been reached
             return
         case <-ctx.Done():
             err = ctx.Err()
             return
         }
     }
 ---
 }
 ```

events handling by listener

ServeMonitorAPI() accept the monitor request, and create a listener for the request

// ServeMonitorAPI serves the Cilium 1.2 monitor API on a unix domain socket.
// This method starts the server in the background. The server is stopped when
// monitor.Context() is cancelled. Each incoming connection registers a new
// listener on monitor.
func ServeMonitorAPI(monitor *Agent) error {
	listener, err := buildServer(defaults.MonitorSockPath1_2)
	if err != nil {
		return err
	}

	s := &server{
		listener: listener,
		monitor:  monitor,
	}

	log.Infof("Serving cilium node monitor v1.2 API at unix://%s", defaults.MonitorSockPath1_2)

	go s.connectionHandler1_2(monitor.Context())

	return nil
}

// connectionHandler1_2 handles all the incoming connections and sets up the
// listener objects. It will block until ctx is cancelled.
func (s *server) connectionHandler1_2(ctx context.Context) {
	go func() {
		<-ctx.Done()
		s.listener.Close()
	}()

	for !isCtxDone(ctx) {
		conn, err := s.listener.Accept()
		switch {
		case isCtxDone(ctx):
			if conn != nil {
				conn.Close()
			}
			return
		case err != nil:
			log.WithError(err).Warn("Error accepting connection")
			continue
		}

		newListener := newListenerv1_2(conn, option.Config.MonitorQueueSize, s.monitor.RemoveListener)
		s.monitor.RegisterNewListener(newListener)
	}
}

drain the queue

func newListenerv1_2(c net.Conn, queueSize int, cleanupFn func(listener.MonitorListener)) *listenerv1_2 {
	ml := &listenerv1_2{
		conn:      c,
		queue:     make(chan *payload.Payload, queueSize),
		cleanupFn: cleanupFn,
	}

	go ml.drainQueue()

	return ml
}

In drainQuque(), the events with be encoded, and then remove the listener.

// drainQueue encodes and sends monitor payloads to the listener. It is
// intended to be a goroutine.
func (ml *listenerv1_2) drainQueue() {
	defer func() {
		ml.cleanupFn(ml)
	}()

	enc := gob.NewEncoder(ml.conn)
	for pl := range ml.queue {
		if err := pl.EncodeBinary(enc); err != nil {
			switch {
			case listener.IsDisconnected(err):
				log.Debug("Listener disconnected")
				return

			default:
				log.WithError(err).Warn("Removing listener due to write failure")
				return
			}
		}
	}
}

hubble observer vs cilium monitor

hubble observer has the metadata like endpoint related infos.

client

hubble observe client sends the requests to hubble observer grpc server in the cilium agent. Filter will be applied in grpc server side
cilium monitor get the events fro the agent listener, then it will add the filed name to the events in client side.

Hubble functions

kretprobe poor performance

wenlxie — 2022-11-18T21:02:32+00:00

Issue

Recently I met an kretporbe performance issue on our prod environment.

We want to use kretprobe to hook for function ipt_do_table() in kernel to get the return value, which is a verdict to indicate whether the packet be ACCEPT, DROP or STOLEN by iptables.

But after the eBPF program deployed, some of the nodes has high SI usage.

%Cpu(s): 11.8 us,  4.7 sy,  0.0 ni, 49.3 id,  0.0 wa,  0.0 hi, 34.2 si,  0.0 st
MiB Mem : 385361.6 total, 212019.8 free, 122891.7 used,  50450.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 260467.6 avail Mem 

Ping latency can up to 100ms sometimes

Check for what CPU is busy for

From the flame graph, we can see that the CPU is busy handing the kretprobe hooks.

We are using ubuntu 20.04 and which is 5.4.0 kernel.

In function pre_handler_kretprobe(), it will

get a free instance, and bind the instance with the current task
execute the handler and then free the instance.
call arch_prepare_kretprobe to replace the return value in reg.

/*
 * This kprobe pre_handler is registered with every kretprobe. When probe
 * hits it will set up the return probe.
 */
static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
{
	struct kretprobe *rp = container_of(p, struct kretprobe, kp);
	unsigned long hash, flags = 0;
	struct kretprobe_instance *ri;

	/*
	 * To avoid deadlocks, prohibit return probing in NMI contexts,
	 * just skip the probe and increase the (inexact) 'nmissed'
	 * statistical counter, so that the user is informed that
	 * something happened:
	 */
	if (unlikely(in_nmi())) {
		rp->nmissed++;
		return 0;
	}

	/* TODO: consider to only swap the RA after the last pre_handler fired */
	hash = hash_ptr(current, KPROBE_HASH_BITS);
	raw_spin_lock_irqsave(&rp->lock, flags);
	if (!hlist_empty(&rp->free_instances)) {
		ri = hlist_entry(rp->free_instances.first,
				struct kretprobe_instance, hlist);
		hlist_del(&ri->hlist);
		raw_spin_unlock_irqrestore(&rp->lock, flags);

		ri->rp = rp;
		ri->task = current;

		if (rp->entry_handler && rp->entry_handler(ri, regs)) {
			raw_spin_lock_irqsave(&rp->lock, flags);
			hlist_add_head(&ri->hlist, &rp->free_instances);
			raw_spin_unlock_irqrestore(&rp->lock, flags);
			return 0;
		}

		arch_prepare_kretprobe(ri, regs);

		/* XXX(hch): why is there no hlist_move_head? */
		INIT_HLIST_NODE(&ri->hlist);
		kretprobe_table_lock(hash, &flags);
		hlist_add_head(&ri->hlist, &kretprobe_inst_table[hash]);
		kretprobe_table_unlock(hash, &flags);
	} else {
		rp->nmissed++;
		raw_spin_unlock_irqrestore(&rp->lock, flags);
	}
	return 0;
}
NOKPROBE_SYMBOL(pre_handler_kretprobe);

There are raw_spin_lock_irqsave called to acquire lock for rp->lock, which is a gloabl lock for that retprobe.

That should be the reason for why it has poor performance.

Improvement

Checked for 5.15 kernel, the implementation for the list and been changed to be lockless.

/*
 * This kprobe pre_handler is registered with every kretprobe. When probe
 * hits it will set up the return probe.
 */
static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
{
	struct kretprobe *rp = container_of(p, struct kretprobe, kp);
	struct kretprobe_instance *ri;
	struct freelist_node *fn;

	fn = freelist_try_get(&rp->freelist);
	if (!fn) {
		rp->nmissed++;
		return 0;
	}

	ri = container_of(fn, struct kretprobe_instance, freelist);

	if (rp->entry_handler && rp->entry_handler(ri, regs)) {
		freelist_add(&ri->freelist, &rp->freelist);
		return 0;
	}

	arch_prepare_kretprobe(ri, regs);

	__llist_add(&ri->llist, ¤t->kretprobe_instances);

	return 0;
}
NOKPROBE_SYMBOL(pre_handler_kretprobe);

solution

Upgrade kernel to make fexit() supported, and use fexit() instead of kretprobe()
Add a tracepoint to the ipt_do_table function, then use raw_tracepoint instead of kretprobe After load test. Compare with kreprobe/kprobe, raw_tp and tp perroamnce are much more better.

tcp-shaker

wenlxie — 2022-10-11T21:02:32+00:00

Issues

Recently we met an issue that related with tcp healthy check. Our software load balancer need to do health check with backend VIP (DSR mode, so need to check for the VIP) periodically via TCP. Since it needs to specific the source IP for the connection, so it will do bind() operation, but sometimes it met the error of “bind: address already in use” . This is caused by the ip source port exhausted, lots of health check connections are in TIME_WAIT status and not release the source port.

tcp-shaker

We need to make the source port to be released quickly.

One solution is:

We sent tcp syn request with a port server not listen, and then server will reply with an RST packet.

Another solution is use tcp-shaker

Implementation of tcp-shaker

Readme

In most cases when you establish a TCP connection(e.g. via net.Dial), these are the first three packets between the client and server(TCP three-way handshake):

Client -> Server: SYN
Server -> Client: SYN-ACK
Client -> Server: ACK
This package tries to avoid the last ACK when doing handshakes.

By sending the last ACK, the connection is considered established.

However, as for TCP health checking the server could be considered alive right after it sends back SYN-ACK,

that renders the last ACK unnecessary or even harmful in some cases.

tcp-shaker acheived this by set the socket with option
SO_LINGER with timeout=0
https://github.com/tevino/tcp-shaker/blob/master/socket_linux.go#L60
Disable TCP_QUICKACK

https://github.com/tevino/tcp-shaker/blob/master/socket_linux.go#L53

Disable the QuickAck makes the last ack in tcp handshake to be hold and not sent immediately. The max hold time is 200ms ( HZ/5 in code) in Linux

SO_LINGER with timeout=0 makes the close() (https://github.com/tevino/tcp-shaker/blob/master/checker_linux.go#L154) sent out RST to finish the connection instead of FIN.

Ref: https://stackoverflow.com/questions/3757289/when-is-tcp-option-so-linger-0-required

This is a smart solution for handshake.

In client side, the socket can be closed quickly, no need kept in TIME_WAIT status, so source port can be released quickly
In server side, the socket is not in ESTABLISHED status since tcp handshake not finished, so it will not impact the application, which is calling accept().
If the RST not received by server, server’s socket will retry to sent out synack to client, then what will client do? From the source code, there will be no socket find for this request, so client will send a RST to server again.
In client, the source port can be reused, so it may use this source port and sent syn request to server, but if server is still in TCP_SYN_RECV status because of RST lost, then what will happen for this request? From the tcpdump, there will be an ack packet from server, and then client will do reset.

53:41.175810 IP 10.aa.bb.63.48528 > 10.9.yy.xx.10250: Flags [S], seq 3212740630, win 64240, options [mss 1460,sackOK,TS val 4150202543 ecr 0,nop,wscale 7], length 0
53:41.183790 IP 10.9.yy.xx.10250 > 10.aa.bb.63.48528: Flags [.], ack 1, win 509, options [nop,nop,TS val 4102712572 ecr 4150189740], length 0
53:41.183824 IP 10.aa.bb.63.48528 > 10.9.yy.xx.10250: Flags [R], seq 3012700204, win 0, length 0

So disable TCP_QUICKACK is a key step, which can handle the situation when RST packet lost

pid id in container

wenlxie — 2022-09-02T21:02:32+00:00

Question

I am working on making eBPF tools can be triggered by user. One feature provided to user is:

User can profile all the Processes/Tasks for specific container
User can profile any process for specific container

For 2, User can provide the container ns/name and process id inside the container. Since there is pid namespace isolation, so we need to convert process id provided by user to the id in host pid namespace.

bpf helper function

There is a bpf helper function for pid mapping.

Context info:

https://man7.org/linux/man-pages/man7/bpf-helpers.7.html

https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#12-bpf_get_ns_current_pid_tgid

https://lore.kernel.org/bpf/20191017150032.14359-3-cneirabustos@gmail.com/

https://lwn.net/Articles/807741/

The API is:

       long bpf_get_ns_current_pid_tgid(u64 dev, u64 ino, struct
       bpf_pidns_info *nsdata, u32 size)
              Description
                     Returns 0 on success, values for pid and tgid as
                     seen from the current namespace will be returned in
                     nsdata.

              Return 0 on success, or one of the following in case of
                     failure:

                     -EINVAL if dev and inum supplied don't match dev_t
                     and inode number with nsfs of current task, or if
                     dev conversion to dev_t lost high bits.

                     -ENOENT if pidns does not exists for the current
                     task.

Test

How to get the dev and ino for a pid?

stat -L /proc/190579/ns/pid
  File: /proc/190579/ns/pid
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: 4h/4d	Inode: 4026533397  Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-09-01 19:54:44.448346607 -0700
Modify: 2022-09-01 19:54:44.448346607 -0700
Change: 2022-09-01 19:54:44.448346607 -0700
 Birth: -

Test code

#!/usr/bin/python
from bcc import BPF
from bcc.utils import printb
import sys, os
from stat import *

# define BPF program
prog = """
#include 
// define output data structure in C
struct data_t {
    u32 pid;
    u32 tgid;
    u64 ts;
    char comm[TASK_COMM_LEN];
};
BPF_PERF_OUTPUT(events);
int hello(struct pt_regs *ctx) {
    struct data_t data = {};
    struct bpf_pidns_info ns = {};
    if(bpf_get_ns_current_pid_tgid(DEV, INO, &ns, sizeof(struct bpf_pidns_info)))
        return 0;
    data.pid = ns.pid;
    data.tgid = ns.tgid;
    data.ts = bpf_ktime_get_ns();
    bpf_get_current_comm(&data.comm, sizeof(data.comm));
    events.perf_submit(ctx, &data, sizeof(data));
    return 0;
}
"""

devinfo = os.stat("/proc/71088/ns/pid")

print(devinfo.st_dev,devinfo.st_ino)

for r in (("DEV", str(devinfo.st_dev)), ("INO", str(devinfo.st_ino))):
    prog = prog.replace(*r)

# load BPF program
b = BPF(text=prog)
b.attach_kprobe(event=b.get_syscall_fnname("write"), fn_name="hello")

# header
print("%-18s %-16s %-6s %-6s %s" % ("TIME(s)", "COMM", "PID", "TGID", "MESSAGE"))

# process event
start = 0


def print_event(cpu, data, size):
    global start
    event = b["events"].event(data)
    if start == 0:
        start = event.ts
    time_s = (float(event.ts - start)) / 1000000000
    printb(
        b"%-18.9f %-16s %-6d %-6d %s"
        % (time_s, event.comm, event.pid, event.tgid, b"Hello, perf_output!")
    )


# loop with callback to print_event
b["events"].open_perf_buffer(print_event)
while 1:
    try:
        b.perf_buffer_poll()
    except KeyboardInterrupt:
        exit()

Test Result

root@xxxxx:~# python2 test.py 
(4, 4026536448)
TIME(s)            COMM             PID    TGID   MESSAGE
000000000        etcd             34     1      Hello, perf_output!
003570885        etcd             34     1      Hello, perf_output!
009293641        etcd             142    1      Hello, perf_output!
009399447        etcd             142    1      Hello, perf_output!
009768254        etcd             34     1      Hello, perf_output!
009982884        etcd             142    1      Hello, perf_output!
154406187        etcd             142    1      Hello, perf_output!
156417123        etcd             144    1      Hello, perf_output!
156780100        etcd             144    1      Hello, perf_output!
157019835        etcd             142    1      Hello, perf_output!

root@xxxxxx:~# crictl exec -it bae49309d79fc sh  
/ # ps -ef
PID   USER     TIME  COMMAND
    1 root      1h23 /usr/local/bin/etcd --data-dir=/var/etcd/data --name=etcd-events-0037 --initial-advertise-peer-urls=https://xxxxxx:2380 --listen-peer-urls=https://0.0.0.0:2380 --listen-clie
50615 root      0:00 sh
50621 root      0:00 ps -ef

Support kernel

Linux 5.6+

cpu sys usage high

wenlxie — 2022-08-29T17:50:32+00:00

Phenomenon

When I login into a node to check for a network latency issue. I found a wired issue in one of the nodes.

When runs top, two cpu cores are always in cpu high

sys usage kepts high.

Debug

Use perf to check for what these two cpus are busy for

We can see that they are busy do something related with USB.

Check for dmesg

root@xxxxxxx:~# journalctl -k > kernel
root@xxxxxxx:~# cat kernel |grep -i usb
Jul 12 23:00:37 localhost kernel: ACPI: bus type USB registered
Jul 12 23:00:37 localhost kernel: usbcore: registered new interface driver usbfs
Jul 12 23:00:37 localhost kernel: usbcore: registered new interface driver hub
Jul 12 23:00:37 localhost kernel: usbcore: registered new device driver usb
Jul 12 23:00:37 localhost kernel: ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
Jul 12 23:00:37 localhost kernel: ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
Jul 12 23:00:37 localhost kernel: uhci_hcd: USB Universal Host Controller Interface driver
Jul 12 23:00:37 localhost kernel: xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 1
Jul 12 23:00:37 localhost kernel: xhci_hcd 0000:00:14.0: new USB bus registered, assigned bus number 2
Jul 12 23:00:37 localhost kernel: xhci_hcd 0000:00:14.0: Host supports USB 3.0 SuperSpeed
Jul 12 23:00:37 localhost kernel: usb usb1: New USB device found, idVendor=1d6b, idProduct=0002, bcdDevice= 5.04
Jul 12 23:00:37 localhost kernel: usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1
Jul 12 23:00:37 localhost kernel: usb usb1: Product: xHCI Host Controller
Jul 12 23:00:37 localhost kernel: usb usb1: Manufacturer: Linux 5.4.0-96-generic xhci-hcd
Jul 12 23:00:37 localhost kernel: usb usb1: SerialNumber: 0000:00:14.0
Jul 12 23:00:37 localhost kernel: hub 1-0:1.0: USB hub found
Jul 12 23:00:37 localhost kernel: usb usb2: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.04
Jul 12 23:00:37 localhost kernel: usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1
Jul 12 23:00:37 localhost kernel: usb usb2: Product: xHCI Host Controller
Jul 12 23:00:37 localhost kernel: usb usb2: Manufacturer: Linux 5.4.0-96-generic xhci-hcd
Jul 12 23:00:37 localhost kernel: usb usb2: SerialNumber: 0000:00:14.0
Jul 12 23:00:37 localhost kernel: hub 2-0:1.0: USB hub found
Jul 12 23:00:37 localhost kernel: usb: port power management may be unreliable
Jul 12 23:00:37 localhost kernel: usb usb1-port3: over-current condition

There are logs related with USB

Fix the issue

Actually there are no usb device for that node, so try to unbind the use device to make it recover.

root@xxxxxxx:/sys/bus/pci/drivers# cd xhci_hcd/
root@xxxxxxx:/sys/bus/pci/drivers/xhci_hcd# ls
0000:00:14.0  bind  new_id  remove_id  uevent  unbind
root@xxxxxxx:/sys/bus/pci/drivers/xhci_hcd# ls -lah
total 0
drwxr-xr-x  2 root root    0 Jul 12 23:00 .
drwxr-xr-x 30 root root    0 Jul 12 23:00 ..
lrwxrwxrwx  1 root root    0 Aug 10 23:16 0000:00:14.0 -> ../../../../devices/pci0000:00/0000:00:14.0
--w-------  1 root root 4.0K Aug 10 23:16 bind
--w-------  1 root root 4.0K Aug 10 23:16 new_id
--w-------  1 root root 4.0K Aug 10 23:16 remove_id
--w-------  1 root root 4.0K Jul 12 23:00 uevent
--w-------  1 root root 4.0K Aug 10 23:15 unbind
root@xxxxxxx:/sys/bus/pci/drivers/xhci_hcd# echo "0000:00:14.0" > unbind

CPU usage after ubind the usb device

issues left

What’s error log really means for usb driver? And what caused this issue? Is it related with the HW of usb bus?
Why it caused cpu sys usage high?

mtr issues

wenlxie — 2022-08-23T21:02:32+00:00

Purpose

Use MTR to trace the packet path from pod to pod

Issues

When try to run mtr inside src pod’s network namespace, it shows following error

mtr -T xxxxx.com

My traceroute  [v0.93]
yyyyy.com (10.xxx.xxx.xxx)                                                                                                                                           2022-08-17T09:28:24+0000
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                                            Packets               Pings
 Host                                                                     Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. (no route to host)

But if we use UDP/ICMP to probe the path instead of TCP, then mtr can show the full path as expectation

ICMP
root@test:~/mtr# ./mtr xxxxx.com
Start: 2022-08-18T09:12:07+0000
HOST: test                        Loss%   Snt   Last   Avg  Best  Wrst StDev
|-- xxxxxx                    0.0%    10    0.1   0.1   0.1   0.2   0.0
|-- xxxxxx                    0.0%    10   19.6  18.4  11.2  22.8   4.3
|-- xxxxxx                    0.0%    10    0.9   1.1   0.8   1.7   0.3
|-- xxxxxx                    0.0%    10    0.3   0.3   0.3   0.3   0.0
|-- xxxxxx                    0.0%    10   14.7  15.0  11.8  22.4   2.9
|-- xxxxxx                    0.0%    10    3.2   1.8   0.9   4.9   1.4
|-- xxxxxx                    0.0%    10    0.8   3.1   0.4  14.2   4.5
|-- xxxxxx                    0.0%    10   15.4  16.4  15.0  21.1   2.1
|-- xxxxxx                    0.0%    10   12.1  15.1  12.0  30.5   6.5
|-- xxxxxx                    0.0%    10   17.3  17.2  17.2  17.3   0.0
|-- xxxxxx                    0.0%    10   15.1  15.1  15.1  15.2   0.0
|-- xxxxxx                    0.0%    10   14.5  14.4  14.3  14.5   0.1
|-- xxxxxx                    0.0%    10   12.0  15.7  12.0  19.7   2.7
|-- xxxxxx                    0.0%    10   14.3  14.3  14.2  14.4   0.1

root@test:~/mtr# ./mtr xxxxxxx.com  -u
Start: 2022-08-18T09:14:50+0000
HOST: test                        Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- xxxx                      0%         10    0.2   0.2   0.1   0.2   0.0
  2.|-- xxxx                      0%         10   99.1  35.7   3.9 103.2  36.5
  3.|-- xxxx                      0%         10    1.0   1.0   0.9   1.2   0.1
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
  4.|-- xxxx                      0%         10    0.3   0.3   0.3   0.4   0.0
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%            
        xxxx                      0%         
        xxxx                      0%         
  5.|-- xxxx                      0%         10   23.1  67.2   4.9 325.7 101.3
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
  6.|-- xxxx                      0%         10    1.2   5.0   1.0  20.9   6.4
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
  7.|-- xxxx                      0%         10    5.6   3.4   0.4  15.2   4.7
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
  8.|-- xxxx                      0%         10   11.7  16.1  11.1  29.2   5.8
        xxxx                      0%         
        xxxx                      0%         
  9.|-- xxxx                      0%         10   11.5  15.7  11.4  25.0   4.2
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
 10.|-- xxxx                      0%         10   14.4  14.9   7.9  19.4   3.3
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
 11.|-- xxxx                      0%         10   15.1  13.8  10.8  16.6   2.2
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
 12.|-- xxxx                      0%         10   11.5  12.9  10.8  17.3   2.1
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
        xxxx                      0%         
 13.|-- xxxx                      0%         10   12.3  27.6  11.0 117.4  32.0
        xxxx                      0%         
        xxxx                      0%         
 14.|-- xxxx                      0%         10   16.6  14.2  10.8  16.6   1.6

Something happened when use TCP

Use tcpdump to capture for what happened when use TCP

09:02.112033 IP 10.xxx.xxx.xxx.35564 > xxxxxxxxxxx.com.http: Flags [S], seq 1656127004, win 64240, options [mss 1460,sackOK,TS val 784970555 ecr 0,nop,wscale 8], length 0

09:02.135481 IP yyyyyyyyyy.com > 10.18.196.21: ICMP xxxxxxxxxxx.com unreachable - need to frag (mtu 9000), length 36

The yyyyyyyyyy.com replied an ICMP packet with error: unreachable - need to frag (mtu 9000), length 36 when it received an IP packet with TTL=1

The formal packet should be: ICMP time exceeded in-transit, length 72

So if we set the –first-ttl to bypass that hop, then TCP probe works fine.

It is obviously that mtr will handle the packets that received with indicate IP is unreachable

MTR change to handle ‘no route to host’

I‘d think that mtr not continue to do probe when it received such kind of ICMP error msg, so can mtr continue to do more found probe when it met this kind of issue ? Maintainer’s reply: https://github.com/traviscross/mtr/issues/434#issuecomment-1220502725

But after read the code and do tcpdump to trace the probe packets. Mtr will continue to do probe, but it didn’t show the result when it met no route to host issue. So if you want to show all the pathes, then you can change the code as: https://github.com/wenlxie/mtr/commit/8a9baf561943f6229cbf936e376bced87e543d79

Why use ICMP probe can’t show the ECMP path

From tcpdump, we can see that there is only one device replied the ICMP packet for the ICMP ECHO with specific TTL, that means

This is not MTR code error to handle the Multipath when USE ICMP
Switchs’ ECMP hash configuration for the ICMP should not use ID/SN field in ICMP header, but only use SIP/DIP in our DC, that is why when run traceroute/mtr many times, the same route path will be discoverd.

–report option not work

When use option –report to collect the reports, but the ECMP path info will be lost, need to upgrade MTR to higher version like 0.95

auditbeat deadlock

wenlxie — 2021-06-03T21:02:32+00:00

Phenomenon

User can’t run sudo in their container

Debug

There are lots of cron and sudo processes stuck in D status

Check the process’s stack. Sudo and cron process will send audit logs through netlink to auditbeat, but they stucked, and then become to D state.

    [<0>] audit_receive+0x28/0xc0
    [<0>] netlink_unicast+0x197/0x220
    [<0>] netlink_sendmsg+0x227/0x3d0
    [<0>] sock_sendmsg+0x63/0x70
    [<0>] __sys_sendto+0x114/0x1a0
    [<0>] __x64_sys_sendto+0x28/0x30
    [<0>] do_syscall_64+0x57/0x190
    [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

From the disassembly code of function audit_receive(), we can see that it stuck at try to acquire a lock in audit_receive()

Try to find out which process is holding this lock, and we can see that it is hold by auditbeat.

cat /proc/248085/stat;cat /proc/248085/stack;cat /proc/248085/stat;cat /proc/248085/stack;cat /proc/248085/net/netlink

248085 (auditbeat) S 247759 247832 247832 0 -1 4194624 656934 0 5080 0 47087 8935 0 0 20 0 77 0 48038941 4420079616 53081 18446744073709551615 4194304 38844573 140728827925408 0 0 0 0 0 2143420159 0 0 0 -1 55 0 0 0 0 0 65563728 67515520 74981376 140728827930392 140728827930481 140728827930481 140728827932639 0
       [<0>] netlink_attachskb+0x1ab/0x1d0
       [<0>] netlink_unicast+0xab/0x220
       [<0>] audit_receive_msg+0x54c/0xeb0
       [<0>] audit_receive+0x57/0xc0
       [<0>] netlink_unicast+0x197/0x220
       [<0>] netlink_sendmsg+0x227/0x3d0
       [<0>] sock_sendmsg+0x63/0x70
       [<0>] __sys_sendto+0x114/0x1a0
       [<0>] __x64_sys_sendto+0x28/0x30
       [<0>] do_syscall_64+0x57/0x190
       [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

248085 (auditbeat) S 247759 247832 247832 0 -1 4194624 656934 0 5080 0 47087 8935 0 0 20 0 77 0 48038941 4420079616 53081 18446744073709551615 4194304 38844573 140728827925408 0 0 0 0 0 2143420159 0 0 0 -1 55 0 0 0 0 0 65563728 67515520 74981376 140728827930392 140728827930481 140728827930481 140728827932639 0
       [<0>] netlink_attachskb+0x1ab/0x1d0
       [<0>] netlink_unicast+0xab/0x220
       [<0>] audit_receive_msg+0x54c/0xeb0
       [<0>] audit_receive+0x57/0xc0
       [<0>] netlink_unicast+0x197/0x220
       [<0>] netlink_sendmsg+0x227/0x3d0
       [<0>] sock_sendmsg+0x63/0x70
       [<0>] __sys_sendto+0x114/0x1a0
       [<0>] __x64_sys_sendto+0x28/0x30
       [<0>] do_syscall_64+0x57/0x190
       [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

From the disassemble code, we can see that the process become sleep and schedule out at:

The reason is sk->sk_rmem_alloc > sk->sk_rcvbuf, because we can see that the rmem of auditbeat is full by

‘ss -f netlink -e

grep -i auditbeat’

    214272 0           audit:auditbeat/247832       *       sk=0 cb=0 groups=0x00000000

Use pprof to get auditbeat’s stack, and the related stack for this lock is:

            syscall.Syscall6
            syscall.sendto
            syscall.Sendto
            github.com/elastic/go-libaudit/v2.(*NetlinkClient).Send
            github.com/elastic/go-libaudit/v2.(*AuditClient).set
            github.com/elastic/go-libaudit/v2.(*AuditClient).Close.func1
            sync.(*Once).doSlow
            sync.(*Once).Do (inline)
            github.com/elastic/go-libaudit/v2.(*AuditClient).Close
            github.com/elastic/beats/v7/auditbeat/module/auditd.(*MetricSet).Run
            github.com/elastic/beats/v7/metricbeat/mb/module.(*metricSetWrapper).run
            github.com/elastic/beats/v7/metricbeat/mb/module.(*Wrapper).Start.func1

There is a goroutine want to close the audit client, and is blocked at setPid() operation.

Check auditbeat’s code It will do setPID() operation before closing the netlink socket. setPID() makes the kernel send the audit info back to auditbeat, but since the auditbeat’s rcvbuf is full, it sleeps after holding the lock. Then other processes like sudo and cron want to send the messages to auditbeat will also be stuck and become D.
Since auditbeat process is in Sleep status, so we can kill the auditbeat process to make the system recover.
How to fix

There is an upstream issue for this in almost the same time: https://github.com/elastic/beats/issues/26031

Fix for this: https://github.com/elastic/beats/pull/26032