BPF meets io_uring

Over the last couple of years, a lot of development effort has gone into two kernel subsystems: BPF and io_uring. The BPF virtual machine allows programs from user space to be safely run within the context of the kernel, while io_uring addresses the longstanding problem of running system calls asynchronously. As the two subsystems expand, it was inevitable that the two would eventually meet; the first encounter happened in mid-February with this patch set from Pavel Begunkov adding the ability to run BPF programs from within io_uring.

在过去的几年里，大量的开发工作集中在两个内核子系统上：BPF 和 io_uring。BPF 虚拟机允许用户空间的程序在内核上下文中安全地运行，而 io_uring 则解决了长久以来异步运行系统调用的问题。随着这两个子系统的扩展，它们最终会相遇几乎是必然的；第一次交汇发生在二月中旬，当时 Pavel Begunkov 提交了一个补丁集，为 io_uring 增加了在其内部运行 BPF 程序的能力。

The patch set itself is relatively straightforward, adding less than 300 lines of new code. It creates a new BPF program type (BPF_PROG_TYPE_IOURING) for programs that are meant to be run in the io_uring context. Any such programs must first be created with the bpf() system call, then registered with the ring in which they are intended to run using the new IORING_ATTACH_BPF command. Once that has been done, the IORING_OP_BPF operation will cause a program to be run within the ring. The final step in the patch series adds a helper function that BPF programs can use to submit new operations into the ring.

该补丁集本身相对简单，仅增加了不到 300 行的新代码。它为计划在 io_uring 上下文中运行的程序创建了一种新的 BPF 程序类型（BPF_PROG_TYPE_IOURING）。此类程序必须首先通过 bpf() 系统调用创建，然后使用新的 IORING_ATTACH_BPF 命令将其注册到预期运行的 ring 中。一旦完成这些操作，IORING_OP_BPF 操作就会触发该程序在 ring 内运行。补丁系列的最后一步增加了一个辅助函数，BPF 程序可以利用它向 ring 提交新的操作。

As a proof of concept, the patch series does a good job of showing how BPF programs might be run from an io_uring. This work does not, though, really enable any new capabilities in its current form, which may be part of why there have been no responses to it on the list. There is little value to running a BPF program asynchronously to submit another operation; one could simply submit that operation directly instead. As is acknowledged in the patch set, more infrastructure will be needed before this capability will become useful to users.

作为概念验证，该补丁系列很好地展示了如何在 io_uring 中运行 BPF 程序。然而，这项工作在目前的形式下并没有真正启用任何新的功能，这或许是邮件列表上没有收到回应的原因之一。异步运行一个 BPF 程序仅仅是为了提交另一个操作，其价值并不大；因为可以直接提交该操作即可。正如补丁集中所承认的那样，在这一能力对用户变得真正有用之前，还需要更多的基础设施。

The obvious place where BPF can add value is making decisions based on the outcome of previous operations in the ring. Currently, these decisions must be made in user space, which involves potential delays as the relevant process is scheduled and run. Instead, when an operation completes, a BPF program might be able to decide what to do next without ever leaving the kernel. “What to do next” could include submitting more I/O operations, moving on to the next in a series of files to process, or aborting a series of commands if something unexpected happens.

BPF 可以增加价值的明显场景是基于 ring 中前一个操作的结果来做决策。目前，这些决策必须在用户空间做出，这会导致在相关进程被调度和运行时产生潜在的延迟。相反，当一个操作完成时，BPF 程序可能能够在不离开内核的情况下决定下一步该做什么。“下一步”可能包括提交更多 I/O 操作、继续处理一系列文件中的下一个，或者在发生意外时中止一系列命令。

Making that kind of decision requires the ability to run BPF programs in response to other events in the ring. The sequencing mechanisms built into io_uring now would suffice to run a program once a specific operation completes, but that program will not have access to much useful information about how the operation completed. Fixing that will, as Begunkov noted, require a way to pass the results of an operation into a BPF program when it runs. An alternative would be to tie programs directly to submitted operations (rather than making them separate operations, as is done in the patch set) that would simply run at completion time.

要实现这种决策，就需要具备在 ring 中其他事件发生时运行 BPF 程序的能力。io_uring 内置的顺序机制现在足以在某个特定操作完成时运行程序，但该程序无法获取关于该操作是如何完成的太多有用信息。正如 Begunkov 所指出的那样，要解决这个问题，就需要一种机制，在 BPF 程序运行时将操作结果传递给它。另一种方案是将程序直接与提交的操作绑定（而不是像补丁集那样作为独立操作），这样它就会在操作完成时自动运行。

With that piece in place, and with the increasing number of system calls supported within io_uring, it will become possible to create complex, I/O-related programs that can run in kernel space for extended periods. Running BPF programs may look like an enhancement to io_uring, but it can also be seen as giving BPF the ability to perform I/O and run a wide range of system calls. It looks like a combination that people might do some surprising things with.

在这一机制到位之后，随着 io_uring 支持的系统调用数量不断增加，就有可能创建复杂的、与 I/O 相关的程序，并在内核空间中长时间运行。运行 BPF 程序看似是对 io_uring 的增强，但也可以视作赋予 BPF 执行 I/O 和运行各种系统调用的能力。这种组合看起来可能会被人们用来做一些令人意想不到的事情。

That said, this is not a feature that is likely to be widely used. On its own, io_uring brings a level of complexity that is only justified for workloads that will see a significant performance improvement from asynchronous processing. Adding BPF into the mix will increase the level of complexity significantly, and long sequences of operations and BPF programs could prove challenging to debug. Finally, loading io_uring programs requires either of the CAP_BPF or CAP_SYS_ADMIN capabilities, which means “root” in most configurations. As long as the current hostility toward unprivileged BPF programs remains, that is unlikely to change; as a result, relatively few programs are likely to use this feature.

尽管如此，这一特性并不太可能被广泛使用。io_uring 本身就带来了相当大的复杂性，只有在那些能够通过异步处理显著提升性能的负载下才是合理的。再将 BPF 引入其中，会让复杂性大幅增加，而长串的操作与 BPF 程序也可能会给调试带来挑战。最后，加载 io_uring 程序需要 CAP_BPF 或 CAP_SYS_ADMIN 能力，这在大多数配置下意味着需要 “root”。只要当前对非特权 BPF 程序的抵制态度不变，这种情况就不太可能改变；因此，可能会使用这一特性的程序仍然非常有限。

Still, the combination of these two subsystems provides an interesting look at where Linux may go in the future. Linux will (probably) never be a unikernel system, but the line between user space and the kernel does appear to be getting increasingly blurry.

尽管如此，这两个子系统的结合依然为我们提供了一个有趣的视角，让人看到 Linux 未来可能的发展方向。Linux （大概）永远不会成为一个 unikernel 系统，但用户空间与内核之间的界限似乎正变得越来越模糊。

BPF meets io_uring