Simpler management of the huge zero folio

One might imagine that managing a page full of zeroes would be a relatively straightforward task; there is, after all, no data of note that must be preserved there. The management of the huge zero folio in the kernel, though, shows that life is often not as simple as it seems. Tradeoffs between conflicting objectives have driven the design of this core functionality in different directions over the years, but much of the associated complexity may be about to go away.

人们可能会以为管理一页全是零的数据是件相对简单的事情；毕竟其中并没有需要保存的重要数据。然而，内核中对“巨大零 folio”的管理却显示出，现实往往并不如此简单。在这些年里，内核在设计这一核心功能时不得不在相互冲突的目标间作出权衡，这导致其实现不断演变。但如今，相关的复杂性可能即将被大大简化。

There are many uses for a page full of zeroes. For example, any time that a process faults in a previously unused anonymous page, the result is a newly allocated page initialized to all zeroes. Experience has shown that, often, those zero-filled pages are never overwritten with any other data, so there is efficiency to be gained by having a single zero-filled page that is mapped into a process's virtual address space whenever a new page is faulted in. The zero page is mapped copy-on-write, so if the process ever writes to that page, it will take a page fault that will cause a separate page to be allocated in place of the shared zero page. Other uses of the zero page include writing blocks of zeroes to a storage device in cases where the device itself does not provide that functionality and the assembly of large blocks to be written to storage when data only exists for part of those blocks.

一页全是零的数据有很多用途。例如，当一个进程访问尚未使用的匿名页时，内核会分配一页并将其初始化为零。实践表明，这类零页往往不会被写入其他数据，因此在每次触发缺页错误时，直接将同一个零页映射到进程的虚拟地址空间，可以显著提高效率。零页采用写时复制（copy-on-write）方式映射，所以一旦进程尝试向该页写数据，就会触发缺页错误，内核会为它分配一页新的物理内存来替换共享零页。零页的其他用途还包括：在存储设备不支持零填充时向其写入零块，以及在大块写入存储时构造仅部分数据存在的大块。

The advent of transparent huge pages added a new complication; now processes could fault in a PMD-sized (typically 2MB) huge page with a single operation, and the kernel had to provide a zero-filled page of that size. In response, for the 3.8 kernel release in 2012, Kirill Shutemov added a huge zero page that could be used in such situations. Now huge-page-size page faults could be handled efficiently by just mapping in the huge zero page. The only problem with this solution was that not all systems use transparent huge pages, and some only use them occasionally. When there are no huge-page users, there is no need for a zero-filled huge page; keeping one around just wastes memory.

透明大页（THP）的出现带来了新问题：进程可能会一次性触发一个 PMD 大小（通常为 2MB）的缺页错误，此时内核需要提供一页零填充的大页。为了解决这一问题，Kirill Shutemov 在 2012 年的内核 3.8 版本中引入了“巨大零页”。这样，当遇到大页缺页错误时，内核可以高效地通过映射这一巨大零页来处理。唯一的问题在于，并不是所有系统都会使用透明大页，有些系统甚至只是偶尔用到。当没有大页用户时，保留这样一页巨大零页只会浪费内存。

To avoid this problem, Shutemov added lazy allocation of the huge zero page; that page would not exist in the system until an actual need for it was encountered. On top of that, he added reference counting that would keep track of just how many users of the huge zero page existed, and a new shrinker callback that would be invoked when the system is under memory pressure and looking to free memory. If that callback found that there were no actual users of the huge zero page, it would release it back to the system.

为避免浪费，Shutemov 引入了“延迟分配”机制：巨大零页不会预先存在，只有在真正需要时才会被分配。同时，他还增加了引用计数来追踪有多少用户正在使用巨大零页，并引入了新的 shrinker 回调函数。当系统内存紧张需要回收内存时，该回调会被调用；如果发现巨大零页已无用户使用，就会释放它。

That seemed like a good solution; the cost of maintaining the huge zero page would only be paid when there were actual users to make that cost worthwhile. But, naturally, there was a problem. The reference count on that page is shared globally, so changes to it would bounce its cache line around the system. If a workload that created a lot of huge-page faults was running, that cache-line bouncing would measurably hurt performance. Such workloads were becoming increasingly common. As so often turns out to be the case, there was a need to eliminate that global sharing of frequently written data.

这看起来是个不错的方案：只有当确实有用户时，才会承担维护巨大零页的开销。但自然，这里也存在问题。该页的引用计数是全局共享的，因此每次修改都会导致缓存行在系统中来回传递。如果运行的工作负载频繁触发大页缺页错误，这种缓存行抖动就会显著降低性能。而这样的负载正在变得越来越常见。于是，问题的关键变成了如何避免对全局共享、频繁写入的数据进行竞争。

The solution to that problem was contributed to the 4.9 kernel by Aaron Lu in 2016. With this change, a process needing to take its first reference to the huge zero page would increment the reference count as usual, but it would also set a special flag (MMF_USED_HUGE_ZERO_PAGE) in its mm_struct structure. The next time that process needed the huge zero page, it would see that flag set, and simply use the page without consulting the reference count. The existence of the flag mean that the process already has a reference, so there is no need to take another one.

这个问题的解决方案由 Aaron Lu 在 2016 年为内核 4.9 提供。改进后的机制是：当一个进程第一次引用巨大零页时，它会像往常一样增加引用计数，但同时会在其 mm_struct 结构中设置一个特殊标志（MMF_USED_HUGE_ZERO_PAGE）。下次该进程需要使用巨大零页时，它会检查到这个标志已经设置，就可以直接使用该页，而无需再次修改引用计数。标志的存在意味着进程已经持有引用，因此不必重复申请。

This change eliminated most of the activity on the global reference count. It also meant, though, that the kernel no longer knew exactly how many references to the huge zero page exist; the reference count now only tracks how many mm_struct structures contained at least one reference at some point during their existence. The only opportunity to decrease the reference count is when one of those mm_struct structures goes away — when the process exits, in other words. So the huge zero page may be kept around when it is not actually in use; all of the processes that needed it may have dropped their references, but the kernel cannot know that all of the references have been dropped as long as the processes themselves continue to exist.

这一变化几乎消除了全局引用计数上的操作。但这也意味着，内核再也无法精确知道当前到底有多少个引用在使用巨大零页；引用计数现在只记录在其生命周期中，至少曾经包含过一个引用的 mm_struct 数量。唯一能减少引用计数的时机就是这些 mm_struct 被销毁时——换句话说，就是进程退出时。因此，巨大零页可能会在实际上无人使用时依然保留；即使所有需要它的进程已经不再访问它，只要这些进程本身还存在，内核就无法确认所有引用都已被释放。

That problem can be lived with; chances are that, as long as the processes that have used the huge zero page exist, at least one of them still has it mapped somewhere. But Lu's solution inherently ties the life cycle of the huge zero page to that of the mm_struct structures that used it. As a result, the huge zero page cannot be used for operations that are not somehow tied to an mm_struct. Filesystems are one example of a place where it would be useful to have a huge zero page; they often have to zero out large ranges of blocks on an underlying storage device. But buffered I/O operations happen independently of any process's life cycle; they cannot use the huge zero page without running the risk that it might be deallocated and reused before an operation completes.

这个问题是可以容忍的；大多数情况下，只要使用过巨大零页的进程还存在，其中至少有一个进程仍然会映射它。但 Lu 的方案从根本上把巨大零页的生命周期与使用它的 mm_struct 生命周期绑定在一起。结果就是，巨大零页不能被用于与 mm_struct 无关的操作。例如，在文件系统中，拥有巨大零页会非常有用，因为文件系统经常需要将存储设备上的大范围块清零。然而，缓冲 I/O 操作独立于任何进程的生命周期，它们无法安全地使用巨大零页，否则可能在操作完成之前，零页就已被释放并重新分配。

That limitation may be about to go away. As Pankaj Raghav pointed out in this patch series, the lack of a huge zero page that is usable in the filesystem context makes the addition of large block size support to filesystems like XFS less efficient than it could be. To get around this problem, a way needs to be found to give the huge zero page an even more complex sort of life cycle that is not tied to the life cycle of any process on the system without reintroducing the reference-counting overhead that Lu's patch fixed.

这种限制可能很快会被解决。正如 Pankaj Raghav 在该补丁系列中指出的，在文件系统上下文中缺少可用的巨大零页，使得像 XFS 这类文件系统在支持更大块大小时效率低于理想情况。为了解决这一问题，需要找到一种方法，让巨大零页拥有一种更加复杂的生命周期，这种生命周期既不依赖于任何进程的生命周期，同时又不会重新引入 Lu 的补丁已经解决的引用计数开销。

Or, perhaps, the right solution is, instead, to do something much simpler. After renaming the huge zero page to the “huge zero folio” (reflecting how it has come to be used in any case), the patch series adds an option to just allocate the huge zero folio at boot time and keep it for the life of the system. The reference counting and marking of mm_struct structures is unnecessary in this case, so it is not performed at all, and the kernel can count on the huge zero folio simply being there whenever it is needed. This mode is controlled by the new PERSISTENT_HUGE_ZERO_FOLIO configuration option which, following standard practice, is disabled by default.

或者，也许更合适的解决办法反而更简单。在该补丁系列中，开发者将巨大零页更名为“巨大零 folio”（以反映它目前的实际使用方式），并增加了一种新选项：在系统启动时直接分配一个巨大零 folio，并让它在整个系统生命周期中一直存在。在这种模式下，不再需要引用计数和 mm_struct 标记，因此完全省略了这些逻辑。内核可以始终依赖这一零 folio 的存在，并在需要时直接使用。该模式由新的 PERSISTENT_HUGE_ZERO_FOLIO 配置项控制，按照惯例，默认是禁用的。

The acceptance of this series in the near future seems nearly certain. It simplifies a bit of complex logic, reduces reference-counting overhead even further, and makes the huge zero folio available in contexts where it could not be used before. The only cost is the inability to free the huge zero folio but, in current systems, chances are that this folio will be in constant use anyway. The evolution of hardware has, as a general rule, forced a lot of complexity into the software that drives it. Sometimes, though, newer hardware (and especially much larger memory capacity) also allows the removal of complexity that was driven by the constraints felt a decade or more ago.

这一补丁系列在不久的将来几乎肯定会被接受。它简化了一部分复杂逻辑，进一步减少了引用计数的开销，并使巨大零 folio 可以在此前无法使用的场景中被利用。唯一的代价就是无法释放这一零 folio，但在当前系统中，它很可能会被持续使用。总体而言，硬件的发展往往会迫使软件变得更复杂。然而，有时新硬件（尤其是更大的内存容量）也为我们提供了一个机会，去移除那些十几年前由于硬件限制而引入的复杂性。

Simpler management of the huge zero folio