MMU Virtualization via Intel EPT: Technical Details

Overview

This article marks the first of 5 articles covering the virtualization of the memory management unit (MMU) using Intel EPT. This technology is used as additional support for the virtualization of physical memory and allows hypervisors to monitor memory activity. This article will address the motivation for extended page tables, the many performance concerns associated, and the different architectural components associated with MMU virtualizationThe components will are covered in some detail, but most information about the various components is in the Intel SDM. We will not be discussing anything OS-specific in this article – just the architectural details necessary to understand for proper implementation.

Disclaimer

Readers must have a foundational knowledge of virtual memory, paging, address translation, and page tables. This information is in §4.1.0 V-3A Intel SDM.

Memory and the MMU

In this rundown, we will cover some important concepts related to the memory management unit and paging. This is by no means a full discourse on paging and virtual memory for the Intel architecture, but more of an abstract overview to help the reader connect the dots a little better.

— Physical and Virtual Memory

Physical memory exists on physical cards like DIMM modules and storage devices like hard-disks. Assuming familiarity with computer science’s fundamental concepts, recall that the executable must be mapped into physical memory before any process executes Now, on modern systems, there is a secondary memory storage space called virtual memory. In a perfect world, data required to run programs would be mapped directly into RAM where it can be accessed quickly by the processor. Sadly, we do not live in a perfect world, and the system’s main memory can become full. Enter stage right, virtual memory. The secondary form of memory utilizes a storage device like a hard drive to free up space in physical memory. Nevertheless, we are not concerned with virtual memory for the time being. When setting up EPT, we need to know some critical details about physical memory, first and foremost.

When a computer begins its boot sequence, the code executing on the bootstrapping processor can access physical memory directly. This is because the processor is operating in real address mode. Aptly named since addresses in real-mode correspond to their physical memory addresses. There are also several physical memory ranges available for use by the OS/bootloader, at this point. If we were to breakpoint a system and dump the physical memory ranges present, we would be able to see what is called the system memory map. Below is an image of the physical memory ranges when a breakpoint was applied prior to MmInitSystem.

The image shows the physical memory ranges and their size. The first range from 1000h-A0000h is available as general DRAM for OS consumption. This range of memory is also called low-memory – sometimes DOS compatibility memory. So, what is the purpose of this drivel? During the boot sequence, the BIOS does many things, but the most relevant thing to this series is applying the different caching behaviors to physical memory ranges. The BIOS programs something called a memory-type range register (MTRR) to achieve this. These are a set of control registers that give the system control over how specific memory ranges are cached. The details of the caching requirements vary from system to system. For the sake of example, the physical memory range 1000h-9FFFFh is write-back. Whereas the range A0000h-BFFFFh is write-combined or uncached.

If you’re wondering how MTRRs are relevant, do not worry. We will get to that…

𝛿 Memory Type Range Register (MTRR)

Physical memory has ranges, and each range has a cache-control policy applied during system initialization. Why is this important? For starters, applying the proper caching policies to memory regions is vital to ensure that system performance does not degrade. If a frequently accessed region of memory is uncached, frequent data fetches will significantly degrade system performance This would happen because applications typically access data with high measures of locality. If data is not present in a cache, then the CPU will have to reach out to main memory to acquire it – reaching out to main memory is slow! This is important because when allocating memory and initializing EPT we will have to build what’s called an MTRR map. Fortunately for us, there is already an MTRR map of the current physical memory regions that we can use as a reference.

Figure 0. MTRR encoding table (Intel SDM)

Figure 1. MTRR map on physical machine.

From the image, you might notice the ranges are quite specific – this is due to Windows using fixed-range MTRRs and some variable-range MTRRs. Armed with this information, it’s clear that applying the appropriate caching policy to our extended page tables during initialization is imperative to preserving system performance. No need to worry either, modifying and creating an MTRR map for our VM is straightforward. We will go into more detail in the next article when we build our MTRR map. See the recommended reading if you’re eager to get ahead. With this addressed, let’s talk about the purpose of the MMU and page tables.

Page Attribute Table

In addition to MTRRs, an additional cache-control called the Page Attribute Table (PAT) is for the OS to control caching policies at a finer granularity (page level). This cache control is detailed more in the next article.

— The MMU

Most modern processors come with a memory management unit (MMU) implemented which provides access protection and virtual-to-physical address translation. A virtual address is, simply put, an address that software uses; a physical address is an address that hardware outputs on the address lines of the data bus. Intel architectures divide virtual memory into 4KB pages (with support for other sizes) and physical memory into 4KB frames. An MMU will typically contain a translation lookaside buffer (TLB) and will perform operations on the page table such as hardware table walks. Some MMU architectures will not perform those operations. This is done to give the OS the freedom to implement its page table in whatever manner it desires. The MMU architecture specifies certain caching policies for the instruction and data cache whether identifying code as cacheable or non-cacheable, or write-back and write-through data caching. These policies may also cover caching access rights.

MMU Split

In certain processors, the MMU can be split into an Instruction Memory Management Unit (IMMU) and Data Memory Management Unit (DMMU). The first is activated with instruction fetches and the latter with memory operations.

The MMU architecture for the Intel64 architecture provides a physical address space that covers 16-EiB. However, only 2^57 units are addressable in current architectures with the new page table structure. That’s still ~128-PiB of address space available. The short and “simple” for how an MMU works is this – the MMU gets a virtual address and uses it to index into a table (TLB or page tables.) These entries in the table provide a physical address plus some control signals that may include the caching policy, whether the entry is valid, invalid, protected, and so on. It may also receive signals as to whether the memory referenced by the entry was accessed/modified. If the entry is valid then the virtual address is translated into the physical address; the MMU will then use information from the control signals to determine what type of memory transaction is occurring. These tables mentioned are similar to a directory structure. The MMU will traverse the page tables to translate the virtual address to the physical address. Now on x86-64 architecture, the MMU maps memory through a series of tables – 4 or 5 depending on software requirements.

 

Figure 1. Simplified diagram of address translation.

 

We will cover a bit about TLBs and their role in a virtualization context later. Since we know the purpose of the MMU now let’s talk start talking about Intel’s EPT.

Extended Page Tables (EPT)

Intel’s Extended Page Table (EPT) technology, also referred to as Secondary Level Address Translation (SLAT), allows a VMM to configure a mapping between the physical memory as it is perceived by the guest and the real physical memory. It’s similar to the virtual page table in that EPT enables the hypervisor to specify access rights for a guest’s physical pages. This allows the hypervisor to generate an event called an EPT violation when a guest attempts to access a page that is either invalid or does not have appropriate access rights. This EPT violation is one of the events we will be taking advantage of throughout this series since it triggers a VM-exit.

Important Note

Virtualization of the IOMMU is performed by a complementary technology to EPT called VT-d. This will not be covered in this series.

This technology is extraordinarily useful. For instance, one can utilize EPT to protect the hypervisor’s code and data from malicious code attempting to modify it. This would be done by setting the access rights to the VMM’s code and data to read-only. In addition to that, if a VMM were to be used to whitelist certain applications it could modify the access rights of the remaining physical address space to write-only. This would force a VM-exit on any execution to allow the hypervisor to validate the faulting page. Just a fun thought experiment.

Enough about the potential, let’s get into the motivations for EPT and address the other various components associated…

— Motivation

One of the main motivations for extending Intel’s virtualization technology was to reduce performance loss on all VM-exits. This was achieved by adding virtual-processor identifiers (VPID) in 2008 to the Nehalem processor. It’s known to many researchers in the field that the first generation of the technology forced a flush of the translation lookaside buffer on each VMX transition. This resulted in significant performance loss when going from VMX non-root to root operation. Now, if you’re wondering what the TLB is or does do not worry – we cover it briefly in a subsection below. This performance loss also extended to VM-entries if the VMM was emulating stores to specific control registers or utilizing the invlpg instruction.

This TLB entry invalidation occurs for moves to CR3 and CR4, but with other conditions related to process-context identifiers which we will address later. If you’re not familiar with what TLBs are then I’d strongly suggest revisiting the address translation section in the Intel SDM. However, the next section briefly reviews it as it relates to EPT.

— Translation Lookaside Buffer (TLB)

The translation lookaside buffer (TLB) is a cache that houses mappings for virtual to physical addresses – it follows the principle of locality to reduce the number of traversals of the paging structures that the CPU needs to make when translating a virtual address. For the sake of simplification let’s look at an example of what happens during a TLB fill, hit, and miss. This will make the later explanation easier to understand. Let’s say we are performing a virtual address lookup on virtual address 0x00001ABC. This is a simplified look at what would happen in the three scenarios.

𝛿 TLB Fill

When a lookup is required for a specific virtual address the TLB is the first stop in any address translation. However, if the TLB is empty a sequence of steps is required to ensure faster lookup in future translation. In this case, we’re looking up the virtual address 0x00001ABC.

 

The first step (1) that will occur is that the translation unit will check the TLB to determine if the mapping for the virtual address is available. The translation unit will determine that the PTE is not in the TLB and have to proceed to step two (2) which will load the PTE from the page table in main memory. It uses the virtual page number, 0x00001 to index into the page table to locate the PTE. You can see at index 1 in the page table we have the value 0xA. This value represents the physical page number (PPN), which will be used to fill in the PPN field in the first TLB entry. Now, since the TLB is a cache of mappings from virtual-address to physical addresses we will use the virtual page number as the tag. This achieves the mapping requirement that VPN 0x1 -> PPN 0xA. Once the TLB entry is filled we will use the physical page number, 0xA to complete the translation giving us the physical address 0x0000AABC. This is a simplified example of the process for a TLB fill/TLB miss. The end result is below.

𝛿 TLB Miss + Eviction

Now, what happens when our TLB is full and our virtual address does not have a mapping cached in the TLB? This is called a TLB miss + eviction, or just TLB eviction. Using the same virtual address as before, but with a filled TLB, let’s take a look at the sequence of operations to complete the address translation.

 

The first step is the same as before – the translation unit goes to the TLB to see if a mapping is available for virtual page number 1 (1). However, the TLB is full and no entry corresponds to the virtual to physical mapping for the virtual address. This means the TLB will have to evict the oldest entry (2). Let’s assume that the address translation prior to this used virtual page number 3, so the eviction will occur on the second entry with tag 0x4.

Following the eviction, the translation will continue by traversing the page table in main memory and loading the PTE corresponding to the virtual page number 1 (3). After locating the PTE for VPN 1, the evicted TLB entry is replaced with the mapping for our current virtual address (4). The physical page number would be 0xA and the tag 0x1.

And finally, the address translation will use the physical page number to complete the address translation yielding the physical address 0x0000AABC. This does not seem like a difficult or cumbersome process, but remember that page table traversals are not this simple and reaching out to main memory is slow! What happens if the virtual page number, in this example, is 0? If you guessed that a page-fault would occur you’d be correct, and page-faults are horrifically slow. If you take this diagram and added all the levels of tables required for address translation you would see that TLB misses will increase overhead substantially. Below is an image of address translation using a two-entry TLB taken from this book.

So what does this have to do with EPT? Well, if you’re in a virtualized environment utilizing EPT then there is an increased cost of TLB miss processing. This is because the number of operations to translate a guest virtual-address to a host-physical address dramatically increases. The worst-case scenario for memory references performed by the hardware translation unit can increase by 6 times over native execution. Because of this, it has become imperative for the virtualization community to reduce the frequency and cost of TLB misses as it pertains to Intel VT-x and EPT. There have been numerous research articles on reducing the length of 2-dimensional page table walks, page sharing, and so on – but that’s a discussion for another time. Lucky for us, the technology has made leaps and new mechanisms have been introduced. One of which is the virtual-process identifier (VPID).

— Virtual Processor Identifier (VPID)

As we learned previously, flushing the TLB is a knockout for performance. But Intel engineers were aware of this issue, and in 2008 introduced virtual-processor identifiers in the Nehalem architecture. This virtual-processor identifier is used as a tag for each cached linear address translation (similar to the diagrams above). This provides a way for the processor to identify (tag) different address spaces for different virtual processors. Not to mention, when VPIDs are used no TLB flushes occur for VM-entries or VM-exits. This has significant performance implications in that when a processor attempts to access a mapping where the VPID does not the TLB entry tag a TLB miss will occur – whether an eviction takes place depends.

When EPT and VPID are active the logical processor may cache the physical page number the VPID-tagged entry translates to, as well as information about access rights and memory type information. The same applies to VPID-tagged paging-structure entries except the physical address points to the relevant paging structure instead of the physical page frame. It’s important to note briefly that each guest CPU obtains a unique VPID, and all host CPUs use the VPID 0x0000. We will also come across an instruction, invvpid, that is necessary for migrating a virtual CPU to a new physical CPU. This instruction can also be used for shadowing – such as when the guest page table is modified by the VMM or guest control registers are altered.

There is plenty of information on the information that may be cached when VPID/EPT is in use, as well as more detail on VPIDs in the Intel SDM. The section numbers for this information are provided in the recommended reading section. These subsections are intended to briefly introduce you to terminology and features you will encounter throughout this series.

— Oh, how the Extended Page Tables

Understanding the paging structures and address translation without virtualization in the mix can be confusing. Once we introduce some form of SLAT, in this case – EPT, the complexity of the caching and translation process increases. This is why in the beginning of the article it was recommended you have some background with the translation process. Noting that, let’s look at what the typical translation process looks like on a system that is using 4-level paging.

In this image, you will see the usual process for translating a virtual address to a physical address without any form of SLAT. A virtual address is given to the MMU, and the TLB performs a look-up to determine if there is a VAPA mapping. If a mapping exists we get a TLB hit which results in a process that was detailed in the TLB section above. Otherwise, we have TLB miss and are required to walk the paging structures to get the VAPA translation. If we introduce EPT our memory management constructs get more complicated.

As we can see from the above picture, processors with hardware support for MMU virtualization must have extended paging caches. These caches are part of a “master TLB”, so to speak, that caches both the GVA→GPA and GPA→HPA. The hardware is able to track both of these mappings using the TLB tagging we addressed earlier, the VPID. The result of using the VPID, as mentioned earlier, is that a VM-transition does not flush the TLB. This means that the entries of various VMs can coexist without conflict in the TLB. This master TLB eliminates the need for updating any sort of shadow page tables constantly. There is a downside to this, however. Using EPT makes the virtual to host physical address translation significantly more complex – most notably if we incur a miss in the TLB. Now, this diagram does not cover the complexity very well so let’s talk about how the translation works with EPT.

Concerning the Master TLB

This “master” TLB contains both the guest virtual to guest physical mapping and guest physical to host physical mapping. It also uses a virtual-processor identifier (VPID) to determine which TLB entry belongs to what VM.

For each of the steps in the guest translation, we have to do all the steps in the VMM. When EPT is in use, the addresses in the paging structures are not used as physical addresses to reference memory – they’re treated like guest-physical address and are pushed through the set of EPT paging structures for translation to the real physical address. This means that when we do not have the proper TLB entry the traversal requires 16 lookups as opposed to 4 lookups in a non-virtualized environment – yikes. This is why I wanted to drive the point home that TLB misses… are bad! It’s also worth mentioning that the TLBs on the CPU are much bigger than previous generations. There’s more to this, as with everything, but I want this process introduced prior to implementation so you’re not wildly confused in the next article. Before concluding this article, we need to address one more topic that is vital to increasing the translation speed when emulating the MMU.

— Virtual TLBs (vTLB)

We know now that virtual-to-physical address translation in a virtualized environment is costly. One of the ways to curb this performance hit is to emulate the TLB hardware in software. This is what implementing a virtual TLB (vTLB) requires. We will not be implementing a virtual TLB in this series, but it’s worth knowing that it is a possible solution. The virtual TLB is typically a complete linear lookup table. On Intel processors with vTLB support, we have to enable the virtual TLB scheme by modifying some VMCS fields. We have to trap on #PF exceptions, VM-exit on all CR3 writes and enable invlpg exiting. However, it’s noted in the Intel SDM that the combination of these may not yield the best performance. You can read more on utilizing the virtual TLB scheme in the Intel SDM §32.3.4 Volume 3C.

If you’ve made it this far, you’re prepared to begin your journey through implementing EPT in your hypervisor.

Conclusion

In this article, we covered information regarding the caching policies applied to memory and how they will be utilized when allocating our paging structures. We addressed the MMU and its purpose along with some of its components that are vital to performant address translation. We also discussed the motivations for EPT and went into more detail than I anticipated on the hardware TLB. I wanted to introduce these topics in this article so that the breadth of the future articles in this series was not overwhelming. It’s important that you understand the technical details and purpose underlying these sub-topics. Particularly important is the virtual-processor identifier and TLB operations. The abstract overview in this article should be sufficient but be prepared for more details in the coming parts.

In the next article, we will dive right into building the MTRR map and cover the page attribute table (PAT). I will be provided prefabricated structures and explaining the initialization of EPT in your hypervisor. We will cover identity mapping, setting up the PML4/PML5 entries for our EPTP, allocating our various page directories, and how to implement 4KB pages versus the 2MB pages. In addition to that, the detail will be provided on EPT violations/EPT misconfigurations and how to implement their VM-exit handler. The easiest part will be inserting our EPTP into our VMCS. Unfortunately, the next article will only be configuration and initialization, and the following article will provide different methods of monitoring memory activity.

Aside: The IOMMU virtualization using VT-d may be attached to this series at the end, or a brief implementation in a separate article.

I apologize in advance for the potentially erratic structuring of this article. As I was writing it I realized there was a lot that might’ve been missing and started trying to find a way to naturally cover the topic. It’s been a little bit so I have to stretch my writing muscle again. As always, thanks for reading – please feel free to reach out to me on Twitter or leave a comment on the article below.

Do the recommended reading!

Recommended Reading

3 thoughts on “MMU Virtualization via Intel EPT: Technical Details

Leave a Reply