Kexec handover and the live update orchestrator

内容分享6小时前发布
0 0 0

Rebooting a computer ordinarily brings an abrupt end to any state built up by the old system; the new kernel starts from scratch. There are, however, people who would like to be able to reboot their systems without disrupting the workloads running therein. Various developers are currently partway through the project of adding this capability, in the form of “kexec handover” and the “live update orchestrator”, to the kernel.
通常情况下,重启计算机会突然结束旧系统中建立的所有状态;新内核会从零开始。然而,有些人希望能够在不打断正在运行的工作负载的情况下重启系统。多个开发者目前正致力于在内核中加入这一能力,其形式就是 “kexec 交接(kexec handover)” 和 “实时更新协调器(live update orchestrator)”。

Normally, rebooting a computer is done out of the desire to start fresh, but sometimes the real objective is to refresh only some layers of the system. Consider a large machine running deep within some cloud provider's data center. A serious security or performance issue may bring about a need to update the kernel on that machine, but the kernel is not the only thing running there. The user-space layers are busily generating LLM hallucinations and deep-fake videos, and the owner of the machine would much rather avoid interrupting that flow of valuable content. If the kernel could be rebooted without disturbing the workload, there would be great rejoicing.
通常,重启计算机的目的是为了重新开始,但有时真正的目标只是刷新系统的某些层次。想象一台运行在云服务提供商数据中心深处的大型机器。一旦出现严重的安全或性能问题,就可能需要更新该机器上的内核,但内核并不是唯一在运行的东西。用户空间层正忙于生成大语言模型的“幻觉”以及深度伪造视频,而机器的所有者则更希望避免打断这些有价值内容的产出。如果能够在不影响工作负载的情况下重启内核,那将是令人非常高兴的事情。

Preserving a workload across a reboot requires somehow saving all of its state, from user-space memory to device-level information within the kernel. Simply identifying all of that state can be a challenge, preserving it even more so, as a look at the long effort behind the Checkpoint/Restore in Userspace project will make clear. All of that state must then be properly restored after the kernel is swapped out from underneath the workload. All told, it is a daunting challenge.
要在重启过程中保持工作负载,需要以某种方式保存其所有状态,从用户空间内存到内核中的设备级信息。光是识别所有这些状态就是一项挑战,更不用说保存它们了,这一点从长期进行的用户空间检查点/恢复(CRIU)项目中就能看出。然后,在内核从工作负载底层被替换掉后,所有这些状态必须被正确恢复。总体而言,这是一项艰巨的挑战。

The problem becomes a little easier, though, in the case of a system running virtualized guests. The state of the guests themselves is well encapsulated within the virtual machines, and there is relatively little hardware state to preserve. So it is not surprising that this is the type of workload that is being targeted for the planned kernel-switcheroo functionality.
然而,对于运行虚拟化客户机的系统来说,这个问题会变得稍微容易一些。客户机本身的状态在虚拟机内已经很好地封装,而需要保存的硬件状态也相对较少。因此,计划中的内核切换功能首先针对的工作负载类型就是这种情况也就不足为奇了。


Preserving state across a reboot
在重启过程中保持状态

The first piece of the solution, kexec handover (KHO), was posted by Mike Rapoport earlier this year and merged for the 6.16 kernel release. Rapoport discussed this work at the 2025 Linux Storage, Filesystem, Memory Management, and BPF Summit. KHO offers a deceptively simple API to any subsystem that needs to save data across a reboot; for example, a call to kho_preserve_folio() will save the contents of a folio. After the new kernel boots, that folio can be restored with kho_restore_folio(). A subsystem can use these primitives to ensure that the data it needs will survive a reboot and be available to the new kernel.
该解决方案的第一部分 kexec 交接(KHO) 由 Mike Rapoport 于今年早些时候提交,并已合并到 Linux 6.16 内核中。Rapoport 在 2025 年的 Linux 存储、文件系统、内存管理和 BPF 峰会上讨论了这项工作。KHO 为任何需要在重启过程中保存数据的子系统提供了一个看似简单的 API;例如,调用
kho_preserve_folio()
将保存一个 folio 的内容,而在新内核启动后,可以通过
kho_restore_folio()
恢复该 folio。子系统可以使用这些原语来确保所需的数据能在重启后依然存在并可供新内核使用。

Underneath the hood, KHO prepares the memory for preservation by coalescing it into specific regions. A data structure describing all of the preserved memory is created as a type of flattened devicetree that is passed through to the new kernel. Also described in that devicetree are the “scratch areas” of memory — the portions of memory that do not contain preserved data and which, consequently, are available for the new kernel to use during the initialization process. Once the bootstrap is complete and kernel subsystems have reclaimed the memory that was preserved, the system operates as usual, with the workload not even noticing that the foundation of the system was changed out from underneath it.
在底层,KHO 会通过合并内存到特定区域来准备数据的保存。描述所有已保存内存的数据结构会以一种扁平化设备树(flattened devicetree)的形式创建,并传递给新内核。在该设备树中,还会描述“临时区域”(scratch areas)的内存 —— 这些区域不包含需要保存的数据,因此在初始化过程中可以供新内核使用。一旦引导完成,并且内核子系统重新接管了被保存的内存,系统就会像往常一样运行,而工作负载甚至不会察觉到系统的底层已经被替换。

Every subsystem that will participate in KHO must necessarily be supplemented with the code that identifies the state to preserve and manages the transition. For the virtualization use case, much of that work can be done inside KVM, which contains most of the information about the virtual machines that are running. With support added to a few device drivers, it should be possible to save (and restore) everything that is needed. What is missing in current kernels, though, is the overall logic that tells each subsystem when it should prepare for the change and when to recover.
每一个参与 KHO 的子系统都必须补充相应的代码,用于识别需要保存的状态并管理过渡。对于虚拟化场景,大部分工作可以在 KVM 内完成,因为它包含了运行中的虚拟机的大多数信息。再加上一些设备驱动的支持,就应该能够保存(并恢复)所有所需的数据。然而,目前的内核仍然缺少一个整体逻辑来告诉各个子系统何时应该为切换做准备,以及何时应该恢复。

The live update orchestrator
实时更新协调器

The live update orchestrator (LUO) patches are the work of Pasha Tatashin; the series is currently in its second version. LUO is the control layer that makes the whole live-update process work as expected. To that end, it handles transitions between four defined system states:
实时更新协调器(LUO)的补丁由 Pasha Tatashin 开发,目前已进入第二个版本。LUO 是使整个实时更新过程如预期运行的控制层。为此,它处理系统在四个定义状态之间的转换:

Normal: the ordinary operating state of the system.
正常状态:系统的普通运行状态。

Prepared: once the decision has been made to perform a reboot, all LHO-aware subsystems are informed of a LIVEUPDATE_PREPARE event by way of a callback (described below), instructing them to serialize and preserve their state for a reboot. If this preparation is successful across the system, it will enter the prepared state, ready for the final acts of the outgoing kernel. The workload is still running at this time, so subsystems have to be prepared for their preserved state to change.
准备状态:一旦决定执行重启,所有支持 LHO 的子系统会通过回调收到一个 LIVEUPDATE_PREPARE 事件,指示它们序列化并保存自己的状态以便重启。如果系统中的准备工作都成功,它就会进入准备状态,等待旧内核的最后操作。此时工作负载仍在运行,因此子系统必须准备好其已保存状态可能会发生变化。

Frozen: brought about by a LIVEUPDATE_FREEZE event just prior to the reboot. At this point, the workload is suspended, and subsystems should finalize the data to be preserved.
冻结状态:在重启前触发 LIVEUPDATE_FREEZE 事件。在这一阶段,工作负载会被挂起,子系统应最终确认需要保存的数据。

Updated: the new kernel is booted and running; a LIVEUPDATE_FINISH event will be sent, instructing each subsystem to restore its preserved state and return to normal operation.
更新状态:新内核已启动并运行;此时会发送 LIVEUPDATE_FINISH 事件,指示每个子系统恢复其已保存的状态并回到正常运行。


To handle these events, every subsystem that will participate in the live-update process must create a set of callbacks to implement the transition between system states:
为了处理这些事件,每个参与实时更新过程的子系统都必须创建一组回调,用来实现系统状态之间的转换:



struct liveupdate_subsystem_ops {
    int (*prepare)(void *arg, u64 *data);  /* normal → prepared */
    int (*freeze)(void *arg, u64 *data);   /* prepared → frozen */
    void (*cancel)(void *arg, u64 data);   /* back to normal w/o reboot */
    void (*finish)(void *arg, u64 data);   /* updated → normal */
};

Those callbacks are then registered with the LUO core:
这些回调会被注册到 LUO 核心:



struct liveupdate_subsystem {
    const struct liveupdate_subsystem_ops *ops;
    const char *name;
    void *arg;
    struct list_head list;
    u64 private_data;
};
 
int liveupdate_register_subsystem(struct liveupdate_subsystem *subsys);

The arg value in this structure reappears as the arg parameter to each of the registered callbacks (though this behavior seems likely to change in future versions of the series). The prepare() callback can store a data handle in the space pointed to by data; that handle will then be passed to the other callbacks. Each callback returns the usual “zero or negative error code” value indicating whether it was successful.
结构中的
arg
值会作为参数传递给各个已注册的回调(尽管这一行为在未来版本中可能会改变)。
prepare()
回调可以在
data
指针所指向的空间中存储一个数据句柄,该句柄会传递给其他回调。每个回调返回通常的 “零或负错误码”,以表明是否成功。


There is a separate in-kernel infrastructure for the preservation of file descriptors across a reboot; the set of callbacks (defined in this patch) looks similar to those above with a couple of additions. For example, the can_preserve() callback returns an indication of whether a given file can be preserved at all. Support will need to be added to every filesystem that will host files that may be preserved across a reboot.
内核中还有一套独立的基础设施,用于在重启过程中保存文件描述符;这组回调(在该补丁中定义)与上面类似,但有一些扩展。例如,
can_preserve()
回调会返回一个结果,表明某个文件是否可以被保存。对于可能需要在重启后保留文件的文件系统,都需要增加相应的支持。


LUO provides an interface to user space, both to control the update process and to enable the preservation of data across an update. For the control side, there is a new device file (/dev/liveupdate) supporting a set of ioctl() operations to initiate state transitions; the LIVEUPDATE_IOCTL_PREPARE command, for example, will attempt to move the system into the “prepared” state. The current state can be queried at any time, and the whole process aborted before the reboot if need be. The patch series includes a program called luoctl that can be used to initiate transitions from the command line.
LUO 还为用户空间提供了接口,既可以控制更新过程,也能启用数据在更新过程中的保存。在控制层面,新增了一个设备文件
/dev/liveupdate
,它支持一组
ioctl()
操作,用于触发状态转换。例如,
LIVEUPDATE_IOCTL_PREPARE
命令会尝试将系统切换到 “准备” 状态。用户可以随时查询当前状态,并且在重启前中止整个过程。补丁系列还包括一个名为 luoctl 的程序,可用于从命令行发起状态转换。


The preservation of specific files across a reboot can be requested with the LIVEUPDATE_IOCTL_FD_PRESERVE ioctl() command. The most common anticipated use of this functionality would appear to preserve the contents of memfd files, which are often used to provide the backing memory for virtual machines. There is a separate document describing how memfd preservation works that gives some insights into the limitations of file preservation. For example, the close-on-exec and sealed status of a memfd will not be preserved, but its contents will. In the prepared phase, reading from and writing to the memfd are still supported, but it is not possible to grow or shrink the memfd. So reboot-aware code probably needs to be prepared for certain operations to be unavailable during the (presumably short) prepared phase.
可以通过
LIVEUPDATE_IOCTL_FD_PRESERVE
ioctl() 命令请求在重启过程中保留特定文件。该功能最常见的预期用途是保留 memfd 文件 的内容,它们通常用作虚拟机的后备内存。另有单独的文档描述了 memfd 保留的工作方式,并揭示了一些文件保留的局限性。例如,memfd 的 close-on-execsealed 状态不会被保存,但其内容会被保留。在准备阶段,memfd 仍然支持读写,但无法增长或缩小。因此,支持重启的代码可能需要为某些操作在(预计很短的)准备阶段不可用做好准备。


This series has received a number of review comments and seems likely to go through a number of changes before it is deemed ready for inclusion. There does not, however, seem to be any opposition to the objective or core design of this work. Once the details are taken care of, LUO seems likely to join KHO in the kernel and make kernel updates easier for certain classes of Linux users.
该补丁系列已经收到了许多审查意见,在被认为可以合入之前可能还会经历一些修改。然而,这项工作的目标和核心设计似乎并没有遇到反对意见。一旦细节问题得到解决,LUO 很可能会与 KHO 一起进入内核,从而让某些 Linux 用户群体更容易完成内核更新。

Kexec handover and the live update orchestrator

© 版权声明

相关文章

暂无评论

none
暂无评论...