Linux TCP keepalive 计时器是错误的?

ss(1) 加上 -o 参数可以打印 tcp 连接的 timer 信息,例如:

[[email protected] ~]# ss -ntpo 'sport = 80'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 172.16.0.11:80 183.14.31.93:25144 users:(("socat",pid=10624,fd=6)) timer:(keepalive,1min32sec,0)

这是一个用 socat 设定了 120s 空闲(keepalive 的 idle)时间的连接,可以看到它正确打印了 timer 信息,执行 ss(1) 时候 timer 超时时间只剩下 92s 了。

问题:我们在连接的一端发送数据后,预期是 timer 应该把超时时间重制为 120s 开始倒数计算,然而实际并不是这样。客户端发送数据后,我们看下 timer:

[[email protected] ~]# ss -ntpo 'sport = 80'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 172.16.0.11:80 183.14.31.93:25144 users:(("socat",pid=10624,fd=6)) timer:(keepalive,1min28sec,0)

ss(1) 仍然在继续倒数连接空闲超时时间。但是当我们实际去抓包的时候,可以发现确实是以接收到数据为起点 120s 之后才会发起探测的,所以 keepalive 这个机制并没有错,错的是这个 timer 展示。应该是为了减少计算量而采取这种“惰性”展示方式,具体可以参考:https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_timer.c#L662

ps 打印的调度优先级是错误的?

通常我们在 Linux 平台会用 ps(1) 查看当前运行的所有进程信息(快照),它会输出 UID, PPID, PID, TIME, CMD 等信息,如下:

[[email protected] ~]# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Aug30 ? 00:03:46 /usr/lib/systemd/systemd --system --deserialize 17
root 2 0 0 Aug30 ? 00:00:00 [kthreadd]
root 4 2 0 Aug30 ? 00:00:00 [kworker/0:0H]
root 6 2 0 Aug30 ? 00:00:00 [mm_percpu_wq]
root 7 2 0 Aug30 ? 00:00:44 [ksoftirqd/0]
root 8 2 0 Aug30 ? 00:01:14 [rcu_sched]
root 9 2 0 Aug30 ? 00:00:00 [rcu_bh]
root 10 2 0 Aug30 ? 00:00:00 [migration/0]
...

以上信息有时候不足以排查问题,比如进程的 nice 值,调度优先级都没有显示。怎么办?根据 ps(1) 所述可以用 -l 参数显示包含了 nice 值,PRI 等信息,如下:

[[email protected]1-tlinux ~]# ps -el
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 0 1 0 0 80 0 - 13547 ep_pol ? 00:03:46 systemd
1 S 0 2 0 0 80 0 - 0 kthrea ? 00:00:00 kthreadd
1 I 0 4 2 0 60 -20 - 0 worker ? 00:00:00 kworker/0:0H
1 I 0 6 2 0 60 -20 - 0 rescue ? 00:00:00 mm_percpu_wq
1 S 0 7 2 0 80 0 - 0 smpboo ? 00:00:44 ksoftirqd/0
1 I 0 8 2 0 80 0 - 0 rcu_gp ? 00:01:14 rcu_sched
1 I 0 9 2 0 80 0 - 0 rcu_gp ? 00:00:00 rcu_bh
1 S 0 10 2 0 -40 - - 0 smpboo ? 00:00:00 migration/0
...

看起来调度优先级(PRI)和 nice 信息都有了,但是有些奇怪,它们的值对应关系为: PRI=NI+80。OK,看了手册页,它说 -c 参数可以打印关于调度器的相关信息,我们看看:

[[email protected] ~]# ps -elc
F S UID PID PPID CLS PRI ADDR SZ WCHAN TTY TIME CMD
4 S 0 1 0 TS 19 - 13547 ep_pol ? 00:03:46 systemd
1 S 0 2 0 TS 19 - 0 kthrea ? 00:00:00 kthreadd
1 I 0 4 2 TS 39 - 0 worker ? 00:00:00 kworker/0:0H
1 I 0 6 2 TS 39 - 0 rescue ? 00:00:00 mm_percpu_wq
1 S 0 7 2 TS 19 - 0 smpboo ? 00:00:44 ksoftirqd/0
1 I 0 8 2 TS 19 - 0 rcu_gp ? 00:01:14 rcu_sched
1 I 0 9 2 TS 19 - 0 rcu_gp ? 00:00:00 rcu_bh
1 S 0 10 2 FF 139 - 0 smpboo ? 00:00:00 migration/0

打印出了进程所使用的调度策略(CLS),然而,PRI 居然又变了!这次 pid 1(systemd) 的优先级显示的不是 80,而是变成了 19。这是为何?

如果现在就觉得有点混乱的话,我们再看看 top(1) 输出的:

[[email protected] ~]# top -b -o -PID
top - 17:20:13 up 20 days, 3:39, 3 users, load average: 0.00, 0.00, 0.00
Tasks: 87 total, 1 running, 46 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 6.2 sy, 0.0 ni, 93.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 876288 total, 69040 free, 96548 used, 710700 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 606368 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 20 0 54188 5212 3700 S 0.0 0.6 3:46.98 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.14 kthreadd
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
7 root 20 0 0 0 0 S 0.0 0.0 0:44.81 ksoftirqd/0
8 root 20 0 0 0 0 I 0.0 0.0 1:14.80 rcu_sched
9 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh
10 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0

top(1) 输出同样的进程 systemd,它的优先级又变成了 20. 这又是为什么?

top(1) 的手册页会告诉你,它显示的确实是调度优先级:

   16. PR  --  Priority
       The scheduling priority of the task.  If you see `rt' in this field, it means the task is running under real time scheduling priority.

事实上,大多数 Linux 发行版提供的 ps(1), top(1) 命令都属于 procps-ng(https://gitlab.com/procps-ng/procps) 这个软件包,但是为什么它们显示进程调度优先级的时候如此不同,以至于连 ps(1) 自己不同的参数下显示的也不一样?

我们直接看看 ps(1) 的代码,里面有如下注释:

// "PRI" is created by "opri", or by "pri" when -c is used.
//
// Unix98 only specifies that a high "PRI" is low priority.
// Sun and SCO add the -c behavior. Sun defines "pri" and "opri".
// Linux may use "priority" for historical purposes.
//
// According to the kernel's fs/proc/array.c and kernel/sched.c source,
// the kernel reports it in /proc via this:
//        p->prio - MAX_RT_PRIO
// such that "RT tasks are offset by -200. Normal tasks are centered
// around 0, value goes from -16 to +15" but who knows if that is
// before or after the conversion...
//
// <linux/sched.h> says:
// MAX_RT_PRIO is currently 100.       (so we see 0 in /proc)
// RT tasks have a p->prio of 0 to 99. (so we see -100 to -1)
// non-RT tasks are from 100 to 139.   (so we see 0 to 39)
// Lower values have higher priority, as in the UNIX standard.
//
// In any case, pp->priority+100 should get us back to what the kernel
// has for p->prio.
//
// Test results with the "yes" program on a 2.6.x kernel:
//
// # ps -C19,_20 -o pri,opri,intpri,priority,ni,pcpu,pid,comm
// PRI PRI PRI PRI  NI %CPU  PID COMMAND
//   0  99  99  39  19 10.6 8686 19
//  34  65  65   5 -20 94.7 8687 _20
//
// Grrr. So the UNIX standard "PRI" must NOT be from "pri".
// Either of the others will do. We use "opri" for this.
// (and use "pri" when the "-c" option is used)
// Probably we should have Linux-specific "pri_for_l" and "pri_for_lc"
//
// sched_get_priority_min.2 says the Linux static priority is
// 1..99 for RT and 0 for other... maybe 100 is kernel-only?
//
// A nice range would be -99..0 for RT and 1..40 for normal,
// which is pp->priority+1. (3-digit max, positive is normal,
// negative or 0 is RT, and meets the standard for PRI)
//


https://gitlab.com/procps-ng/procps/-/blob/master/ps/output.c#L590

ps(1) 为了兼容各种 UNIX 标准及其实现,包含了多个调度优先级的取值方法(pri, opri, intpri, priority)。PRI 默认取自 opri 变量,如果加了 -c 参数,则取自 pri 变量。总的来说,在 Linux 平台,其实取 priority 变量,即 /proc/[pid]/stat 中的第 18 列 priority 原始值应该是更好的,符合 Linux 调度优先级本意,也与 top(1) 一致,即:

[[email protected] ~]# ps -e -o 'priority,nice,cmd'
PRI NI CMD
20 0 /usr/lib/systemd/systemd --system --deserialize 17
20 0 [kthreadd]
0 -20 [kworker/0:0H]
0 -20 [mm_percpu_wq]
20 0 [ksoftirqd/0]
20 0 [rcu_sched]
20 0 [rcu_bh]
-100 - [migration/0]