通常我们在 Linux 平台会用 ps(1) 查看当前运行的所有进程信息(快照),它会输出 UID, PPID, PID, TIME, CMD 等信息,如下:
[root@VM-0-11-tlinux ~]# ps -ef UID PID PPID C STIME TTY TIME CMD root 1 0 0 Aug30 ? 00:03:46 /usr/lib/systemd/systemd --system --deserialize 17 root 2 0 0 Aug30 ? 00:00:00 [kthreadd] root 4 2 0 Aug30 ? 00:00:00 [kworker/0:0H] root 6 2 0 Aug30 ? 00:00:00 [mm_percpu_wq] root 7 2 0 Aug30 ? 00:00:44 [ksoftirqd/0] root 8 2 0 Aug30 ? 00:01:14 [rcu_sched] root 9 2 0 Aug30 ? 00:00:00 [rcu_bh] root 10 2 0 Aug30 ? 00:00:00 [migration/0] ...
以上信息有时候不足以排查问题,比如进程的 nice 值,调度优先级都没有显示。怎么办?根据 ps(1) 所述可以用 -l 参数显示包含了 nice 值,PRI 等信息,如下:
[root@VM-0-11-tlinux ~]# ps -el F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 4 S 0 1 0 0 80 0 - 13547 ep_pol ? 00:03:46 systemd 1 S 0 2 0 0 80 0 - 0 kthrea ? 00:00:00 kthreadd 1 I 0 4 2 0 60 -20 - 0 worker ? 00:00:00 kworker/0:0H 1 I 0 6 2 0 60 -20 - 0 rescue ? 00:00:00 mm_percpu_wq 1 S 0 7 2 0 80 0 - 0 smpboo ? 00:00:44 ksoftirqd/0 1 I 0 8 2 0 80 0 - 0 rcu_gp ? 00:01:14 rcu_sched 1 I 0 9 2 0 80 0 - 0 rcu_gp ? 00:00:00 rcu_bh 1 S 0 10 2 0 -40 - - 0 smpboo ? 00:00:00 migration/0 ...
看起来调度优先级(PRI)和 nice 信息都有了,但是有些奇怪,它们的值对应关系为: PRI=NI+80。OK,看了手册页,它说 -c 参数可以打印关于调度器的相关信息,我们看看:
[root@VM-0-11-tlinux ~]# ps -elc
F S UID PID PPID CLS PRI ADDR SZ WCHAN TTY TIME CMD
4 S 0 1 0 TS 19 - 13547 ep_pol ? 00:03:46 systemd
1 S 0 2 0 TS 19 - 0 kthrea ? 00:00:00 kthreadd
1 I 0 4 2 TS 39 - 0 worker ? 00:00:00 kworker/0:0H
1 I 0 6 2 TS 39 - 0 rescue ? 00:00:00 mm_percpu_wq
1 S 0 7 2 TS 19 - 0 smpboo ? 00:00:44 ksoftirqd/0
1 I 0 8 2 TS 19 - 0 rcu_gp ? 00:01:14 rcu_sched
1 I 0 9 2 TS 19 - 0 rcu_gp ? 00:00:00 rcu_bh
1 S 0 10 2 FF 139 - 0 smpboo ? 00:00:00 migration/0
打印出了进程所使用的调度策略(CLS),然而,PRI 居然又变了!这次 pid 1(systemd) 的优先级显示的不是 80,而是变成了 19。这是为何?
如果现在就觉得有点混乱的话,我们再看看 top(1) 输出的:
[root@VM-0-11-tlinux ~]# top -b -o -PID top - 17:20:13 up 20 days, 3:39, 3 users, load average: 0.00, 0.00, 0.00 Tasks: 87 total, 1 running, 46 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 6.2 sy, 0.0 ni, 93.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 876288 total, 69040 free, 96548 used, 710700 buff/cache KiB Swap: 0 total, 0 free, 0 used. 606368 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1 root 20 0 54188 5212 3700 S 0.0 0.6 3:46.98 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.14 kthreadd 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq 7 root 20 0 0 0 0 S 0.0 0.0 0:44.81 ksoftirqd/0 8 root 20 0 0 0 0 I 0.0 0.0 1:14.80 rcu_sched 9 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh 10 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
top(1) 输出同样的进程 systemd,它的优先级又变成了 20. 这又是为什么?
top(1) 的手册页会告诉你,它显示的确实是调度优先级:
16. PR -- Priority
The scheduling priority of the task. If you see `rt' in this field, it means the task is running under real time scheduling priority.
事实上,大多数 Linux 发行版提供的 ps(1), top(1) 命令都属于 procps-ng(https://gitlab.com/procps-ng/procps) 这个软件包,但是为什么它们显示进程调度优先级的时候如此不同,以至于连 ps(1) 自己不同的参数下显示的也不一样?
我们直接看看 ps(1) 的代码,里面有如下注释:
// "PRI" is created by "opri", or by "pri" when -c is used. // // Unix98 only specifies that a high "PRI" is low priority. // Sun and SCO add the -c behavior. Sun defines "pri" and "opri". // Linux may use "priority" for historical purposes. // // According to the kernel's fs/proc/array.c and kernel/sched.c source, // the kernel reports it in /proc via this: // p->prio - MAX_RT_PRIO // such that "RT tasks are offset by -200. Normal tasks are centered // around 0, value goes from -16 to +15" but who knows if that is // before or after the conversion... // // <linux/sched.h> says: // MAX_RT_PRIO is currently 100. (so we see 0 in /proc) // RT tasks have a p->prio of 0 to 99. (so we see -100 to -1) // non-RT tasks are from 100 to 139. (so we see 0 to 39) // Lower values have higher priority, as in the UNIX standard. // // In any case, pp->priority+100 should get us back to what the kernel // has for p->prio. // // Test results with the "yes" program on a 2.6.x kernel: // // # ps -C19,_20 -o pri,opri,intpri,priority,ni,pcpu,pid,comm // PRI PRI PRI PRI NI %CPU PID COMMAND // 0 99 99 39 19 10.6 8686 19 // 34 65 65 5 -20 94.7 8687 _20 // // Grrr. So the UNIX standard "PRI" must NOT be from "pri". // Either of the others will do. We use "opri" for this. // (and use "pri" when the "-c" option is used) // Probably we should have Linux-specific "pri_for_l" and "pri_for_lc" // // sched_get_priority_min.2 says the Linux static priority is // 1..99 for RT and 0 for other... maybe 100 is kernel-only? // // A nice range would be -99..0 for RT and 1..40 for normal, // which is pp->priority+1. (3-digit max, positive is normal, // negative or 0 is RT, and meets the standard for PRI) // https://gitlab.com/procps-ng/procps/-/blob/master/ps/output.c#L590
ps(1) 为了兼容各种 UNIX 标准及其实现,包含了多个调度优先级的取值方法(pri, opri, intpri, priority)。PRI 默认取自 opri 变量,如果加了 -c 参数,则取自 pri 变量。总的来说,在 Linux 平台,其实取 priority 变量,即 /proc/[pid]/stat 中的第 18 列 priority 原始值应该是更好的,符合 Linux 调度优先级本意,也与 top(1) 一致,即:
[root@VM-0-11-tlinux ~]# ps -e -o 'priority,nice,cmd'
PRI NI CMD
20 0 /usr/lib/systemd/systemd --system --deserialize 17
20 0 [kthreadd]
0 -20 [kworker/0:0H]
0 -20 [mm_percpu_wq]
20 0 [ksoftirqd/0]
20 0 [rcu_sched]
20 0 [rcu_bh]
-100 - [migration/0]