Out-of-memory victim selection with BPF

In its default configuration, the Linux kernel will allow processes to allocate more memory than the system can actually provide; this policy enables better utilization of physical memory and works just fine — most of the time. On occasions, though, the kernel may find itself unable to provide memory that processes may think already belongs to them. If the situation gets bad enough, the only solution (short of rebooting) is to declare a sort of memory bankruptcy and write off some of the kernel's debts by killing one or more processes. Over the years, a great deal of effort has gone into heuristics to select the processes that the user is least likely to miss. This problem is still clearly not solved to everybody's satisfaction, though, so it was only a matter of time before somebody introduced a way to select the out-of-memory (OOM) victim using BPF.
在默认配置下，Linux 内核允许进程分配超过系统实际可提供的内存；这种策略可以更好地利用物理内存，并且在大多数情况下运行良好。然而，有时内核可能无法提供那些进程认为已经属于它们的内存。如果情况严重到一定程度，唯一的解决方式（除了重启）就是宣布某种形式的“内存破产”：通过杀死一个或多个进程来冲销内核的部分“债务”。多年来，开发者投入了大量精力设计启发式方法，以选择用户最不可能想念的进程。然而，这个问题显然仍未让所有人满意，因此，终有一天会有人提出使用 BPF 来选择 out-of-memory（OOM）受害者的方式。

There are numerous ways to go hunting for a process to sacrifice when memory runs out. The process using the most memory is an obvious choice, but that process is often something important: a window-system server or a database manager, for example. So developers have naturally tried, over the years, to enable the kernel to make a better choice; see the LWN kernel index to see how things have evolved over time. In current kernels, this decision comes down to a function called oom_badness() which, after exempting processes that cannot be killed for one reason or another, makes a simple calculation. A process's “OOM score” comes down to the amount of memory it uses, adjusted by that process's oom_score_adj value. By tweaking those knobs, user space can shelter some processes from the OOM-killer's depredations while directing its attention toward others.
当内存耗尽时，有许多方法可以寻找一个“牺牲品”进程。使用内存最多的进程是显而易见的候选者，但这样的进程通常很重要：例如窗口系统服务器或数据库管理器。因此，多年来开发者自然而然尝试让内核做出更明智的选择；参见 LWN kernel index 可了解这些机制的演进。在当前内核中，这个决策归结为一个名为 oom_badness() 的函数，它会先排除那些因各种原因不能被杀死的进程，然后执行一个简单的计算。一个进程的 “OOM 分数” 主要基于其使用的内存量，并结合其 oom_score_adj 值进行调整。通过调节这些参数，用户空间可以保护某些进程不受 OOM-killer 的“伤害”，并将其注意力引导到其他进程上。

That, evidently, is not enough control for some users. The BPF patch series from Chuyi Zhou is the latest in a series of attempts to improve that control.
显然，这些控制能力仍不能满足某些用户的需求。周楚懿（Chuyi Zhou）提交的 BPF 补丁系列是改进这种控制能力的最新尝试。

In current kernels, the OOM killer will iterate through all of the possible target processes, call oom_badness() on each, then target the process that is given the highest score. Zhou's patch set allows the oom_badness() check to be replaced with a call to a BPF function, which should be defined as an “fmod_ret” tracing function (meaning it is invoked on return from an internal kernel function and can change that function's return value) with this name and prototype:
在当前内核中，OOM killer 会遍历所有可能的目标进程，对每个进程调用 oom_badness()，然后选择得分最高的进程。周的补丁集允许用一个 BPF 函数来替代 oom_badness() 的检查，该函数需定义为一个 “fmod_ret” tracing 函数（意味着它会在某个内核内部函数返回时被调用，并可修改该函数的返回值），其名称和原型如下：


int bpf_oom_evaluate_task(struct task_struct *task, struct oom_control *oc);

This function will be called at the beginning of the evaluation of each potential victim and, if it makes a decision on the given task, will cause the normal evaluation to be skipped. The oom_control structure describes the context in which the OOM kill is taking place; the BPF function has access to it but probably (the rules are not actually documented anywhere) should not make changes to it. That function can also look at the task under consideration and make a decision regarding its fate, as reflected in its return value:
这个函数会在评估每个潜在受害者的开始阶段被调用，如果它对该任务做出了决策，则会跳过正常的评估过程。oom_control 结构描述了 OOM kill 发生时的上下文；BPF 函数可以访问这些信息，但（规则并未被明确记录）大概不应该修改它。该函数也可以检查当前任务，并通过返回值决定其命运：

NO_BPF_POLICY: no policy is in effect, so the normal oom_badness() method should be used.
NO_BPF_POLICY：没有生效的策略，因此应使用正常的 oom_badness() 方法。

BPF_EVAL_ABORT: abort the selection process entirely with no process chosen to kill.
BPF_EVAL_ABORT：完全中断选择过程，不杀死任何进程。

BPF_EVAL_NEXT: move on to the next process, passing over this one.
BPF_EVAL_NEXT：跳过当前进程，继续评估下一个。

BPF_EVAL_SELECT: select this process as the one to kill.
BPF_EVAL_SELECT：选择当前进程作为要杀死的目标。

Returning BPF_EVAL_SELECT does not bring an end to the iteration through the list of processes; there will be further calls to bpf_oom_evaluate_task() if there are more processes to examine. As a result, the function can change its mind and return BPF_EVAL_SELECT again if a more appealing victim comes along later in the sequence.
返回 BPF_EVAL_SELECT 并不会结束遍历进程列表；如果还有进程需要评估，bpf_oom_evaluate_task() 将继续被调用。因此，如果在后续过程中遇到更“合适”的受害者，该函数也可以改变主意，再次返回 BPF_EVAL_SELECT。

It is possible to use BPF_EVAL_NEXT for some processes while using NO_BPF_POLICY for others. The end result will be to shield some processes from the OOM killer while letting the kernel make a decision in the usual way by looking at the rest. Mixing BPF_EVAL_SELECT and NO_BPF_POLICY looks like it could create surprising results, though; this combination does not appear to be intended and should, unless something changes in a future version, be avoided.
对于某些进程可以返回 BPF_EVAL_NEXT，而对其他进程返回 NO_BPF_POLICY。结果是部分进程将被 OOM killer 保护，而其他进程则交由内核按常规方式评估。然而，将 BPF_EVAL_SELECT 和 NO_BPF_POLICY 混合使用似乎会产生意外结果；这种组合看起来并非设计目标，除非未来版本中有所改变，否则应避免使用。

Specifically, the oom_control structure contains a pointer called chosen identifying the currently selected victim, and an integer chosen_points holding its badness score. In the absence of a BPF program, the kernel compares each process's score against chosen_points, and updates both if the new process has a higher score. Returning BPF_EVAL_SELECT sets chosen without setting chosen_points to anything. If BPF_NO_POLICY is returned for a later process, its score will be compared against a chosen_points that has no connection to the process selected earlier.
具体来说，oom_control 结构中包含一个名为 chosen 的指针，用于标识当前选中的受害者，以及一个名为 chosen_points 的整数，用于保存该受害者的 badness 分数。如果没有 BPF 程序，内核会将每个进程的分数与 chosen_points 比较，如果新的进程分数更高，则更新两者。而返回 BPF_EVAL_SELECT 会设置 chosen，却不会设置 chosen_points。如果在后续进程中返回 BPF_NO_POLICY，那么该进程的分数将会与一个与之前选择的进程毫无关联的 chosen_points 进行比较。

There are two related hooks provided by the patch set as well. One of them allows the name of the current victim-selection policy to be stored in the kernel; that name will be propagated through to the log when an actual kill is done. To do so, the program should define an fmod_ret function:
补丁集还提供了两个相关的挂钩。其中一个允许将当前的受害者选择策略的名称存储在内核中；当实际进行 kill 操作时，该名称会被记录到日志里。为此，程序需要定义一个 fmod_ret 函数：


void bpf_set_policy_name(struct oom_control *oc);

That function, which will be called at the beginning of the OOM-kill procedure, can then turn around and call:
该函数会在 OOM-kill 过程开始时被调用，然后可以进一步调用：


void set_oom_policy_name(struct oom_control *oc, const char *name, size_t sz);

Where oc is the oom_control structure passed to bpf_set_policy_name(), name is the policy name to use, and sz is the length of that name. Names are limited to 16 bytes, including the terminating NUL byte.
其中 oc 是传递给 bpf_set_policy_name() 的 oom_control 结构，name 是要使用的策略名称，sz 是该名称的长度。名称限制为 16 字节，包括结尾的 NUL 字节。

There is also a new tracepoint, select_bad_process_end, that fires if the OOM-kill procedure fails to find a process to kill. It is intended to be a helper for developers who are trying to develop a new OOM-kill policy.
补丁集中还新增了一个 tracepoint，名为 select_bad_process_end，当 OOM-kill 过程未能找到要杀死的进程时会触发。它旨在帮助正在开发新 OOM 策略的开发者进行调试。

This series is currently in its second revision. In response to the first posting, memory-management developer Michal Hocko suggested simplifying the interface somewhat. Roman Gushchin, instead, argued for taking a more general approach, where the BPF program is called once at OOM time and is expected to figure out a way to free some memory somewhere. Hocko responded that it would be better to start with “something that is good enough” add complexity later if it seems warranted. In response to the second revision, Alexei Starovoitov also supported a more general callback, though, and Zhou has started considering the implications of such a change.
该补丁系列目前处于第二版。在第一版发布后，内存管理开发者 Michal Hocko 建议适当简化接口。而 Roman Gushchin 则主张采用更通用的方法：在 OOM 时调用一次 BPF 程序，并让其自行决定如何释放一些内存。Hocko 回应称，最好从“足够好”的东西开始，如有必要再增加复杂性。在第二版发布后，Alexei Starovoitov 也支持更通用的回调方式，周已经开始考虑这种变化的影响。

Both Hocko and Gushchin expressed worries that introducing BPF into this code, which runs when the system is in a distressed state, could further reduce the stability of an out-of-memory situation. An attempt by a BPF program to allocate a lot of memory in this situation seems likely to end in tears, for example. That is true of any code that hooks into the OOM-killer, though, and is not a problem specific to BPF.
Hocko 和 Gushchin 都表达了担忧：在系统处于紧急状态时引入 BPF 代码，可能进一步降低内存耗尽场景下的稳定性。例如，若 BPF 程序在此时尝试分配大量内存，结局几乎可以肯定会很糟糕。不过，这对任何连接到 OOM-killer 的代码都成立，并不是 BPF 独有的问题。

The conversation has shown that there is some interest in the use of BPF to select victims for the out-of-memory killer. Thus far, though, there is not a clear consensus on the approach that this work should take. It would not be surprising, at this point, to see this feature go through some significant changes before it gets closer to the mainline; until then, the kernel will just have to continue choosing processes to sacrifice the old-fashioned way.
目前的讨论显示，一些人确实有兴趣使用 BPF 来选择 OOM-killer 的受害者。然而，到目前为止，对于这项工作应该采用的具体方法尚未达成明确共识。因此，在该功能接近主线之前经历一些重大变化并不令人意外；在此之前，内核仍将继续以传统方式选择需要被牺牲的进程。

Out-of-memory victim selection with BPF