Introduction
This blog post introduces the concept of CPU hotplugging (and hot-unplugging) in a virtualized environment, and includes high level coverage of some of the key mechanisms involved. The focus is on QEMU and KVM, primarily with a Linux guest, but general concepts are applicable to most other virtualization environments.
介绍
这篇博文介绍了在虚拟化环境中 CPU 热插拔(以及热拔出)的概念,并高层次地涵盖了一些关键机制。重点放在 QEMU 和 KVM 上,主要是 Linux 客户机,但一般概念同样适用于大多数其他虚拟化环境。
Use Cases
A common use case for CPU hotplugging is scaling up and down virtual machines. As demand or workload increases on a VM, more vCPUs can be added without a reboot. Similarly, when workload is reduced vCPUs can be easily removed. This scaling typically takes in the order of seconds.
使用场景
CPU 热插拔的一个常见使用场景是对虚拟机进行弹性伸缩。当虚拟机的需求或负载增加时,可以在无需重启的情况下添加更多 vCPU。同样地,当负载降低时,可以轻松移除 vCPU。这种伸缩通常只需几秒钟。
Public cloud providers typically charge their customers an amount proportional to the CPU count. Moreover, some software licensing models charge based on the number of CPUs the software is running on. Thus, a popular usage model is for users to scale up their vCPU count only when needed and then decreasing vCPU count when demand is low, in order to reduce costs.
公有云服务提供商通常根据 CPU 数量向客户收费。此外,一些软件的许可模式也是按运行时的 CPU 数量收费。因此,一种常见的使用模式是用户在需要时才扩展 vCPU 数量,在需求低时减少 vCPU 数量,从而降低成本。
Another use case for CPU hotplug is capacity management. Consider this simple example where a user has 10 hosts, with 40 total VMs running across those hosts. There is a finite total CPU count, 200 CPUs, across the fleet of hosts. If the user now has a need to run 10 additional VMs (10 + 40 total), but the full count of CPUs is in use, how can the user launch these additional VMs?
CPU 热插拔的另一个使用场景是容量管理。举个简单的例子:一个用户有 10 台主机,上面总共运行着 40 台虚拟机。这些主机总共有 200 个 CPU。如果用户现在需要再运行 10 台虚拟机(总数变为 50 台),但所有 CPU 都已被使用,那么用户如何启动这些额外的虚拟机呢?
One option is to simply buy more hosts. This increases the available CPU count, but there is additional cost associated with this option.
一种选择是直接购买更多主机。这样可以增加可用 CPU 数量,但会带来额外成本。
Another option is to reduce the number of vCPUs allocated to each VM, keeping the host count unchanged. In this example, each VM has been allocated 5 vCPUs. If we reduce each VM’s CPU count to 4, we can now run 50 total VMs. The cost remains the same, we’ve increased the total VMs by 25%, and only reduced each VM’s CPU count by 20%.
另一种选择是减少分配给每台虚拟机的 vCPU 数量,而不改变主机数量。在这个例子中,每台虚拟机分配了 5 个 vCPU。如果我们将每台虚拟机的 CPU 数减少到 4 个,就可以运行 50 台虚拟机。成本保持不变,总虚拟机数量增加了 25%,而每台虚拟机的 CPU 数只减少了 20%。
What is “hotplugging”?
Plugging in a device means the user intends a new device to become visible and usable to the system. “Hot” plugging indicates that the system is already running while the device is being added.
什么是“热插拔”?
插入一个设备意味着用户希望新设备能被系统识别并使用。“热”插拔表示在系统已经运行的情况下添加设备。
Many different device types can be hotplugged, such as memory, vCPUs, or PCIe devices. PCIe, VirtIO, and ACPI provide hotplugging interfaces with different capabilities regarding what sort of devices can be hotplugged, and under what circumstances.
许多不同类型的设备都可以热插拔,例如内存、虚拟CPU (vCPU) 或 PCIe 设备。PCIe、VirtIO 和 ACPI 提供了热插拔接口,它们在支持的设备类型以及可用场景方面各不相同。
Hot-unplugging is also possible, where a device is removed while the system continues to run. In the case of vCPUs, this requires all processes running on a given vCPU to migrate to other vCPUs so it can be brought offline.
热拔出也是可能的,即在系统继续运行的情况下移除设备。对于 vCPU 来说,这需要将运行在该 vCPU 上的所有进程迁移到其他 vCPU 上,然后才能将其下线。
Hotplug Interfaces
Various device interfaces exist, but are often not monoliths. That is, an interface at one level of abstraction may use another lower-level interface in its implementation.
热插拔接口
存在多种设备接口,但它们通常不是单一的。也就是说,一个抽象层级的接口可能会在实现中依赖另一个更低层级的接口。
PCIe
The original PCI specification lacked the concept of adding and removing devices at runtime, though support for hotplug was added later. PCIe included hotplug support from its inception. In recent years the most common use cases for PCIe hotplug have gone from the canonical adding/removing peripheral cards in a server, now to hot-swapping NVMe drives, and connecting Thunderbolt devices.
PCIe
最初的 PCI 规范没有在运行时添加或移除设备的概念,但后来增加了热插拔支持。PCIe 从一开始就包含了热插拔支持。近年来,PCIe 热插拔的常见场景已经从传统的服务器中添加/移除外设卡,演变为热插拔 NVMe 硬盘以及连接 Thunderbolt 设备。
PCIe hotplugging is often correlated with a device being physically connected or disconnected. In a virtualization context, using PCIe hotplug directly may be less common, as exposing a virtual machine to events in the physical world runs contrary to the goal.
PCIe 热插拔通常与设备的物理连接或断开相关。在虚拟化环境中,直接使用 PCIe 热插拔并不常见,因为让虚拟机感知物理世界的事件通常违背了虚拟化的目标。
VirtIO
VirtIO is a popular interface for exposing simplified devices to a virtual machine. Implementations commonly use PCIe as the underlying transport, so hotplug support is not purely native. Many VirtIO devices can be hotplugged, such as virtio-net and virtio-scsi. There is even support for memory hotplugging in virtio-mem. However, there is no VirtIO CPU device, so VirtIO isn’t helpful for CPU hotplugging.
VirtIO
VirtIO 是一种常用接口,用于向虚拟机提供简化的设备。其实现通常以 PCIe 作为底层传输,因此热插拔支持并非完全原生。许多 VirtIO 设备可以热插拔,例如 virtio-net 和 virtio-scsi。virtio-mem 甚至支持内存热插拔。但 VirtIO 没有 CPU 设备,因此对 CPU 热插拔没有帮助。
ACPI
ACPI can be considered a lower-level interface, and provides mechanisms for the OS and firmware to work together to manage the lifecycle of a device. ACPI allows the firmware to describe the hardware and its capabilities in a way the OS can understand, and provides multiple event and device model alternatives. Handlers for different events can be defined in firmware, and the OS hands off control to the appropriate event handler when an interrupt is received.
ACPI
ACPI 可以被视为一种更底层的接口,提供了操作系统与固件协作管理设备生命周期的机制。ACPI 允许固件以操作系统可理解的方式描述硬件及其能力,并提供多种事件与设备模型的选择。不同事件的处理程序可以在固件中定义,当中断发生时,操作系统会将控制权交给相应的事件处理程序。
ACPI allows devices, including CPUs, to be hotplugged and has multiple alternatives for the choice of event model. It is the primary focus of this blog entry.
ACPI 支持包括 CPU 在内的设备热插拔,并且提供多种事件模型选择。它是本文的重点。
CPUs
The hypervisor will present a virtual CPU to the guest that is an abstraction of the physical CPUs present on the host. These can be distinct threads of a CPU core, or physical cores themselves. When viewed from this perspective, we will refer to the physical CPU(s) as ‘CPU’ and the virtualized CPU being presented to the guest as ‘vCPU’.
CPU
虚拟机监控程序会向客户机呈现虚拟 CPU (vCPU),它是宿主机物理 CPU 的抽象。这些可能是 CPU 核心的不同线程,也可能是物理核心本身。从这个角度看,物理 CPU 被称为“CPU”,而呈现给客户机的虚拟 CPU 被称为“vCPU”。
Within the guest, we just refer to a vCPU as a CPU. The guest may be aware that it’s working with a virtual CPU and not a ‘real’ CPU, but for the purposes of hotplug there is no need to make a distinction.
在客户机内部,我们直接把 vCPU 称为 CPU。客户机可能知道自己在使用虚拟 CPU 而不是“真实”CPU,但对于热插拔来说无需区分。
Guest CPU States
From the guest kernel’s perspective, a CPU can be in one of many states. These states are maintained as separate bitmasks.
从客户机内核的角度看,CPU 可以处于多种状态。这些状态通过不同的位掩码来维护。
Possible – The entire set of CPUs that could be given to the guest. The size of this set is the max CPU count specified to QEMU.
Possible(可能)——表示可能分配给客户机的所有 CPU 集合。该集合的大小由 QEMU 启动时指定的最大 CPU 数决定。
Present – The set of CPUs that are currently plugged in and visible to the guest. Subset of possible CPUs.
Present(存在)——表示当前已经插入并且客户机可见的 CPU 集合,是 Possible CPU 的子集。
Online/Offline – There is only an online bitmask, so a present CPU can be online or offline, but not both. An online CPU is available to be used by the kernel scheduler. Online is a subset of present.
Online/Offline(在线/离线)——只有在线的位掩码,因此一个存在的 CPU 要么是在线,要么是离线,不可能同时两者。在线 CPU 可被内核调度器使用。Online 是 Present 的子集。
Active – Active CPUs are a subset of online CPUs. The scheduler makes an online CPU active, which allows kernel tasks to be migrated to the given CPU.
Active(活动)——活动 CPU 是在线 CPU 的子集。调度器会将一个在线 CPU 激活,使内核任务可以迁移到该 CPU 上运行。
Some of these CPU bitmasks are exposed read-only through sysfs. An example with maxcpus=68 in QEMU:
部分 CPU 位掩码会通过 sysfs 只读暴露出来。例如,在 QEMU 中设置 maxcpus=68 时:
[root@localhost ~]# cat /sys/devices/system/cpu/possible
0-67
[root@localhost ~]# cat /sys/devices/system/cpu/present
0-23
[root@localhost ~]# cat /sys/devices/system/cpu/offline
24-67
[root@localhost ~]# cat /sys/devices/system/cpu/online
0-23
Hotplug support within the guest requires a kernel with the CONFIG_HOTPLUG_CPU option enabled. If hotplug support is not present, Possible and Present CPUs will be the set of CPUs reported by ACPI at boot. Active and Online are equivalent in this case as well.
客户机要支持 CPU 热插拔,必须在内核中启用 CONFIG_HOTPLUG_CPU 选项。如果没有该支持,Possible 和 Present 的 CPU 集合将在启动时由 ACPI 报告,并且在这种情况下 Active 与 Online 也是等价的。
Hotplugging in QEMU
In QEMU, the maximum number of vCPUs must be specified at VM launch time. A guest can start with a smaller number of vCPUs, and additional vCPUs can be hotplugged after launch, but only up to the maximum CPU count initially specified (QEMU determines a max CPU count if the user doesn’t explicitly specify it). ACPI information generated by QEMU will contain structures for the entire set of possible vCPUs, but unused vCPUs will be disabled.
在 QEMU 中,虚拟机启动时必须指定最大 vCPU 数。客户机可以以较少的 vCPU 启动,之后通过热插拔增加,但不能超过启动时设定的最大值(若用户未显式指定,QEMU 会自动决定一个最大值)。QEMU 生成的 ACPI 信息会包含所有可能 vCPU 的结构,但未使用的 vCPU 会被禁用。
In QEMU, CPU 0 is enabled by default and has unique duties. Normally it cannot be hot-unplugged.
在 QEMU 中,CPU 0 默认启用并承担特殊职责,通常不能被热拔除。
Hotplug Workflow
The hotplug workflow can be divided into 2 parts. Part 1 can be viewed as synchronous, in that the hypervisor waits for the guest to report back on the CPU state change. But Part 2 is mostly opaque to the hypervisor, and involves the guest and actual hardware. Step 2 can be thought of as asynchronous, since the hypervisor is no longer waiting for the CPU state change and can perform other work.
CPU 热插拔流程可分为两部分。第一部分是同步的,管理程序(Hypervisor)会等待客户机反馈 CPU 状态变化。第二部分对管理程序来说几乎是不可见的,主要由客户机和硬件完成,因此是异步的,Hypervisor 不再等待,可以执行其他任务。
Part 1
Specific CPU hotplug steps may differ, depending on CPU count or whether SecureBoot is enabled, but they share a common overall workflow. ‘Hypervisor’ is used here as a blanket term that can include the host kernel, KVM, QEMU, OVMF, etc.
具体步骤可能因 CPU 数量或是否启用 SecureBoot 而异,但整体流程类似。这里的 “Hypervisor” 泛指宿主机内核、KVM、QEMU、OVMF 等。
The user informs the hypervisor to add a vCPU to the guest
用户通知 Hypervisor 向客户机添加一个 vCPU
The hypervisor creates its own internal representation of the new vCPU
Hypervisor 创建新 vCPU 的内部表示
The hypervisor updates the ACPI event status
Hypervisor 更新 ACPI 事件状态
The hypervisor sends a System Control Interrupt (SCI) to inform the guest OS that there is a change in virtualized hardware
Hypervisor 发送系统控制中断(SCI),告知客户机操作系统虚拟硬件有变化
The guest OS executes the ACPI method for scanning CPUs (called CSCN) that it was provided at boot
客户机执行启动时提供的 ACPI 方法 CSCN 扫描 CPU
The ACPI CSCN method (running in the context of the guest OS) may retrieve additional information from the hypervisor about the new CPU
CSCN 方法(运行在客户机中)可能会从 Hypervisor 获取新 CPU 的更多信息
ACPI will notify the guest OS (guest effectively notifies itself) that there is a new CPU
ACPI 会通知客户机操作系统(实际上是客户机自我通知)有新 CPU
The ACPI “notify” event is handled by the ACPI drivers in the guest kernel
客户机内核中的 ACPI 驱动程序处理该 “通知” 事件
ACPI driver code will enable the new CPU
ACPI 驱动代码启用新 CPU
The guest notifies the hypervisor that the vCPU hotplug completed with ACPI OST success status
客户机通知 Hypervisor vCPU 热插拔成功完成(ACPI OST 成功状态)
The guest kernel generates a udev event for the new CPU device
客户机内核为新 CPU 设备生成 udev 事件
Until this point, the hypervisor was blocking on the vCPU hotplug operation. From the guest’s perspective, the new vCPU is “present” but not “online”, meaning it can’t yet perform any useful work.
在此之前,Hypervisor 一直在等待 vCPU 热插拔完成。从客户机角度看,新 vCPU 处于 “存在” 状态但尚未 “在线”,因此还不能执行任务。
Part 2
As previously stated, the hypervisor is mostly hands off during part 2. The guest kernel does the majority of the work here, and will finally initialize the CPU itself.
如前所述,在第二部分中,Hypervisor 基本不参与,客户机内核完成大部分工作,并最终初始化 CPU。
A process in the guest (e.g., systemd-udevd in Oracle Linux) will handle the udev event generated above by registering a handler for the “online” and “offline” actions
客户机中的进程(如 Oracle Linux 的 systemd-udevd)会处理上述生成的 udev 事件,并注册处理器来执行 “上线/下线” 操作。
The udev handler brings the new CPU online by writing to sysfs:
udev 处理器通过写入 sysfs 将新 CPU 上线:
echo 1 > /sys/devices/system/cpu/cpu<number>/online
The write to sysfs invokes kernel code that will bring the CPU online. Some notable kernel functions:
写入 sysfs 会触发内核代码,将 CPU 上线。关键函数包括:
→ 通过 INIT 和 SIPI 中断初始化 CPU
wakeup_secondary_cpu_via_init()
→ 唤醒可能被暂停的 CPU
cpuhp_kick_ap()
→ 使 CPU 成为 cgroups 可用资源
cpuset_wait_for_hotplug()
→ 最后一步激活 CPU,允许任务迁移到该 CPU
sched_cpu_activate()
The hypervisor is not notified when the CPU online action completes. This state change is purely within the hardware and the guest.
当 CPU 上线完成时,Hypervisor 不会收到通知。这一状态变化仅发生在客户机与硬件内部。
ACPI Hotplug/Unplug Interfaces
The ACPI hotplug interface is the mechanism by which the hypervisor works with the guest OS to add or remove a vCPU. There are a few variants of the hotplug interface, but they all accomplish the same goal.
ACPI 热插拔接口
ACPI 热插拔接口是虚拟机管理程序(hypervisor)与客户机操作系统交互以添加或移除 vCPU 的机制。该接口有几种不同的实现方式,但它们的目标都是相同的。
The ACPI General Purpose Event has supported CPU hotplug since version 2.0 (released in 2000). QEMU’s “legacy” hotplug interface supports < 256 vCPUs total, and doesn’t support hot-unplug. QEMU’s “modern” hotplug interface is required for > 255 vCPUs, and for hot-unplug support. Both the legacy and modern interfaces use GPE interrupts to signal the vCPU change event to the guest.
ACPI 通用事件(General Purpose Event)自 2.0 版本(2000 年发布)开始支持 CPU 热插拔。QEMU 的“传统”热插拔接口最多支持 256 个 vCPU,并且不支持热拔出。而 QEMU 的“现代”热插拔接口适用于超过 255 个 vCPU 的场景,并且支持热拔出。无论是传统接口还是现代接口,都通过 GPE 中断向客户机通知 vCPU 变更事件。
ACPI also supports a more generic hotplug interface, through the Generic Event Device (GED), introduced in ACPI version 6.1. QEMU uses GED for memory and NVDIMM hotplug. As of QEMU version 9.1, code for vCPU hotplug using GED is in place, but is not used by either the x86-64 or arm64 architectures.
ACPI 还支持一种更通用的热插拔接口,即在 ACPI 6.1 版本中引入的通用事件设备(Generic Event Device, GED)。QEMU 使用 GED 来实现内存和 NVDIMM 的热插拔。截至 QEMU 9.1 版本,已经有使用 GED 进行 vCPU 热插拔的代码实现,但在 x86-64 和 arm64 架构上尚未启用。
SecureBoot guests use an additional mechanism on top of GPE, which moves critical vCPU hotplug functionality into Secure Management Mode of x86-64 CPUs.
启用 SecureBoot 的客户机在 GPE 机制之上还使用了额外的机制,将关键的 vCPU 热插拔功能放入 x86-64 CPU 的安全管理模式(Secure Management Mode, SMM)中。
Legacy Hotplug
QEMU maps vCPU hotplug to GPE.2, which is exposed to the guest as part of the FADT. QEMU will set bits in a bitmask that represents the active CPUs, set the GPE.2 event, and then raise an SCI interrupt in the guest.
传统热插拔
QEMU 将 vCPU 热插拔映射到 GPE.2,并通过 FADT 表暴露给客户机。QEMU 会在一个位掩码中设置代表活动 CPU 的位,触发 GPE.2 事件,并在客户机中生成 SCI 中断。
The SCI interrupt handler will run in the guest, see that GPE0 status is set, and invoke the GPE.2 handler (defined by QEMU, provided to guest through ACPI information). The handler will perform MMIO reads to find the CPU bits set in the bitmask, and notify the guest OS with corresponding ACPI CPU hot-add events.
客户机中的 SCI 中断处理程序会运行,检测到 GPE0 状态被设置,并调用 GPE.2 处理程序(由 QEMU 定义并通过 ACPI 信息提供给客户机)。该处理程序会执行 MMIO 读取操作,以确认位掩码中哪些 CPU 被激活,然后向客户机操作系统发送相应的 ACPI CPU 热添加事件。
“Modern” Hotplug
QEMU’s legacy hotplug interface is inflexible because of the use of bitmasks for vCPUs (limiting vCPU count), and not extensible to functionality like hot-unplug. Among other improvements, the modern interface has a dedicated register for the vCPU’s APIC ID. The interface includes a device status field, to distinguish between insert and eject (hotplug and hot-unplug) requests.
“现代”热插拔
QEMU 的传统热插拔接口由于使用位掩码来表示 vCPU,因此灵活性不足(限制了 vCPU 数量),并且无法扩展到热拔出功能。相比之下,现代接口进行了改进,引入了一个专门用于存储 vCPU APIC ID 的寄存器。同时,该接口包含一个设备状态字段,用于区分插入(热插拔)和移除(热拔出)请求。
QEMU writes the vCPU’s APIC ID to a register designated the “CPU selector”, rather than a bitmap, and sets the appropriate bit in the CPU device status to indicate hotplug/hot-unplug. The guest is informed about the requested command and the CPU device status by reading distinct MMIO offsets.
QEMU 将 vCPU 的 APIC ID 写入名为“CPU selector”的寄存器,而不是使用位图表示。同时,它会在 CPU 设备状态中设置相应的位,以表示热插拔或热拔出。客户机通过读取不同的 MMIO 偏移量来获取请求的命令和 CPU 设备状态。
SecureBoot Guests
SecureBoot guests are protected such that they only execute trusted firmware during the boot process. x86_64 Secure Management Mode helps provide this isolation because it’s a completely distinct operating mode of the CPU. SMM code has access to a separate address space that is inaccessible to non-SMM code. The emulated chipset is configured to allow write access to certain persistent variables only in SMM.
SecureBoot 客户机
启用 SecureBoot 的客户机在启动过程中受到保护,只会执行可信的固件。x86_64 的安全管理模式(SMM)为这种隔离提供支持,因为它是 CPU 的一种完全独立的运行模式。SMM 代码可以访问一个独立的地址空间,该空间对非 SMM 代码不可见。模拟的芯片组被配置为仅允许在 SMM 模式下写入某些持久变量。
The GPE.2 handler in qemu will generate an SMI (Secure Management Interrupt) that puts all CPUs into SMM. OVMF (from the EDK II project) is an open-source implementation of UEFI firmware that supports SecureBoot. The SMI handler is defined in OVMF, and this runs when a CPU transitions into SMM.
QEMU 中的 GPE.2 处理程序会生成一个 SMI(安全管理中断),将所有 CPU 切换到 SMM 模式。OVMF(来自 EDK II 项目)是支持 SecureBoot 的开源 UEFI 固件实现。SMI 处理程序在 OVMF 中定义,并在 CPU 切换到 SMM 时运行。