Detective Work and Dropped TCP Connections
I had problems with TCP connections (mostly long-lasting ssh sessions) getting dropped on my ADSL line. In the end, I found that the problem had two different roots. The detective work behind establishing them is, I believe, interesting. It also shows how accessible source code, and the will to use it, can be a tremendous boost to difficult system administration problems.
The Idle ssh Connections
The first problem involved idle ssh connections getting disconnected after some time. I knew that the cause was my router clearing NAT entries of idle connections, and ensured that the sshd KeepAlive option was set. This did not solve the problem. However, I notice that not all hosts I used were dropping their idle connections, so I started by tracing packets to a host that dropped and one that didn't:
windump -i 6 host freefall.freebsd.org or host istlab.dmst.aueb.gr windump: listening on \Device\NPF_{688215B7-A2BE-4953-BC81-114456AEE710} 18:20:52.424148 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 3065 637357 win 58400 18:20:52.424168 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF) 18:30:52.742158 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 1 win 58400 18:30:52.742188 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF) 18:40:52.992644 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 1 win 58400 18:40:52.992669 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF) 18:50:53.247049 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 1 win 58400 18:50:53.247066 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 win 62972 (DF) 74742 packets received by filter 0 packets dropped by kernelAs you can see, freefall was sending keep-alive packets, but istlab wasn't.
Next step: examine the sshd source to see how the KeepAlive option is implemented:
$ grep KeepAlive *.c
readconf.c: { "keepalive", oKeepAlives },
case oKeepAlives:
intptr = &options->keepalives;
goto parse_flag;
$ grep keepalives *.c
sshd.c: if (options.keepalives &&
if (options.keepalives &&
setsockopt(sock_in, SOL_SOCKET, SO_KEEPALIVE, &on,
sizeof(on)) < 0)
error("setsockopt SO_KEEPALIVE: %.100s", strerror(errno));
$ grep SO_KEEPALIVE */*.c
netinet/tcp_timer.c: tp->t_inpcb->inp_socket->so_options & SO_KEEPALIVE) &&
if ((always_keepalive ||
tp->t_inpcb->inp_socket->so_options & SO_KEEPALIVE) &&
tp->t_state <= TCPS_CLOSING) {
if ((ticks - tp->t_rcvtime) >= tcp_keepidle + tcp_maxidle)
goto dropit;
int tcp_keepidle;
SYSCTL_PROC(_net_inet_tcp, TCPCTL_KEEPIDLE, keepidle, CTLTYPE_INT|CTLFLAG_RW,
&tcp_keepidle, 0, sysctl_msec_to_ticks, "I", "");
istlab$ sysctl net.inet.tcp.keepidle
net.inet.tcp.keepidle: 7200000
freefall$ sysctl net.inet.tcp.keepidle
net.inet.tcp.keepidle: 600000
coding ain't done till all tests run
windump -i 6 host freefall.freebsd.org or host istlab.dmst.aueb.gr windump: listening on \Device\NPF_{688215B7-A2BE-4953-BC81-114456AEE710} 21:09:11.331698 IP freefall.freebsd.org.22 > eagle.spinellis.gr.4316: . ack 3065 651769 win 58400 21:09:11.331708 IP eagle.spinellis.gr.4316 > freefall.freebsd.org.22: . ack 1 wi n 63080 (DF) 21:10:00.749281 IP istlab.dmst.aueb.gr.22 > eagle.spinellis.gr.4527: . ack 37848 16799 win 58400 21:10:00.749298 IP eagle.spinellis.gr.4527 > istlab.dmst.aueb.gr.22: . ack 1 win 63780 (DF)QED.
Moral
- Observe.
- Use the Source Luke!
The Dropped ppp Link
The other problem involved the ADSL PPP connection dropping every six hours. The suggestion offered by my ISPs helpdesk (at three different instances of the problem) was to reboot the SpeedTouch 530 rooter, because it was getting stuck (again, and again). They claimed that small routers tend to crash and often require rebooting. The persistent nature of the problem, and the fact that the connection was dropped after approximatelly six hours convinced me that the problem was more complicated.
I let the router operate without reboot for about a day, and observed the log. The router was picking a new global IP address after the link wend down, which happened 1h after the router got a new internal DHCP address (the same one). See the following three groups: