Day 1: Introduction to Virtualization, Type Definitions, and Support Testing

Overview

In this article we’re going to introduce virtualization, the various forms of virtualization, terminology, and a high level view of the abstraction that is virtualization. We’ll also be building out a test function for support of virtual machine instructions, followed by defining structures to represent various architectural registers and components. The reason for using structures to represent these things is because it’s common for people to use preprocessor macros, however, I find the abuse of preprocessor macros to be vague, ugly and bad practice overall. Preprocessor macros have their place, and we will use them in our project, but sparingly. All types, flags, bits, etc., will be defined using structures or unions. In the type definition section I’ll discuss a common problem seen among driver developers and encountered by other hypervisor developers. After all type definitions are created we’ll write a quick test to determine if our machine supports VMX and then we’ll close with recommended reading before the big article that is heavy on implementation details.

The recommended reading list is provided at the bottom and I strongly encourage you to read each entry in the list in its entirety, take notes, and connect the dots from the article if something was misunderstood.

Notice: At the time of writing all information has been checked and verified with the sources provided, any additional changes or modifications that may occur at a later date should be forwarded to the author. However, always check the sources should the information in the article be dated.

All development took place on Windows 10 x64 (Version 1803). If you’re on a different version, higher or lower, you may experience issues/conflicts during testing. This is not a guarantee, but a warning that you should – for the sake of correctness – be on the same version of Windows and developing for the same target version.

Introduction to Intel VT-x

Virtualization is a concept that is decades old, believe it or not. There are countless examples from the early 1960’s, and as of the 21 years ago – VMware made its appearance on the virtualization scene introducing their x86 virtualization software. The timeline is incredible, and while we aren’t going to go into it in this article I’d recommend looking through this timeline of virtualization development to better get a grasp of its progression over the decades. We’re at the pinnacle of virtualization technology and today we begin to take advantage of the decades of hard work and the research of others.

To start let’s discuss the two principal classes of virtual machine software.

 

— Virtual-Machine Monitors (VMM or Hypervisor)

The virtual machine monitor is what is referred to in virtualization software as the host (also known as VMM, or hypervisor). It controls the processors and other platform hardware. It’s an abstraction between the guest environment (guest OS) and the logical processor. A hypervisor is in charge of managing processor resources, system memory, interrupts, and I/O. The hypervisor is a piece of software that gives the impression to the guest environment that they are operating on physical hardware. It’s the technology that allows for multiple operating systems to share a single host platform and its hardware and resources.

There are two types of hypervisors, and as stated in the overview post we’re going to write a Type 2 Hypervisor (also known as a hosted hypervisor.) The other type, which won’t be covered in this series unless doing comparisons, is a Type 1 Hypervisor – otherwise referred to as a native, or baremetal hypervisor. These are incredibly complex and time consuming to write, not to mention the challenge of achieving stability on more than one machine. I’ll provide some information on the two different types below.

Type 1 Hypervisor (Baremetal/Native)

  • Operates directly on hardware of the host, and can monitor operating systems that run above the VMM.
  • Can modify boot structures, and CPU features.
  • Independent of an operating system.
  • Small, compact, with the main task of sharing and resource management for all systems operating above it.
  • Used in virtualization products like VMware ESXi Server, Microsoft Hyper-V and Xen Server.

Type 2 Hypervisor (Hosted)

  • Installed on an operating system, usually running at the lowest privilege level and supports operating systems above it.
  • Dependent on host operating system for operations.
  • Issues on base operating system affect entire system.
  • To run at the lowest privilege level it is usually written as a device driver that performs virtualization of each processor using kernel provided API.
  • This is primarily seen in products such as VMware Workstation, Microsoft Virtual PC, Virtual Box, and KVM.

Visual of the Type 2 Hypervisor software stack. Credit: LinuxHub

 

— Guest Environment (Guest OS or Guest)

Every virtual machine is a guest software environment (Guest OS). It is an operating system and application software executing above a VMM that presents the guest software. Whenever someone refers to something as a VM or virtual machine, they’re talking about the guest environment operating independently of other virtual machines, but using the same interfaces to system resources provided by a physical platform. Those resources include memory, graphics, processor interfaces, and so on. The guest environment, in a well written hypervisor, will be none the wiser about its virtualization status. That is, it should operate the same whether or not it’s running on a VMM.

The caveat with the guest environment is that it executes at a virtually reduced privilege level so that the VMM has control of system resources. Certain operations will trap into the VMM giving it full control of the responses, management of resources, and platform overall.

Note: VMM, and hypervisor will be used interchangeably in this article. They are synonymous.

 

— Forms Of Virtualization

Paravirtualization is the opposite and lighter-weight form of virtualization. It aims to present a machine-like software interface that exposes the fact that it is running in a virtualized environment. It could, for example, offer a set of hypercalls, that allow for the guest to send requests to the hypervisor (similar to the system call mechanism in modern operating systems). The guest would then use the hypercalls to perform privileged operations such as modifying page tables, model-specific registers, and so on. This form of virtualization is actually simpler and faster due to the direct communication with the VMM. It’s worth mentioning that this is not a new technology and has been present since the early 70’s as part of IBM’s VMOS. However, because this virtualization series is meant for security researchers, and hobbyists who may not want to expose the VM API to the guest we’ll be using the other form of virtualization – Full Virtualization.

Full Virtualization is the form of virtualization that attempts to trick the guest into believing it is running on physical hardware; that is has control of the entire system. It is known as complete abstraction, and the guest operating systems don’t know about the presence of a hypervisor. There is no exposed virtual machine API, or a set of hypercalls to communicate with the hypervisor. It is independent and no modifications to the guest environment are needed. This form of virtualization allows the VMM full control over the behavior of the guest environment. For instance, if someone wanted to write to a control register it would cause a VM exit and trap into the hypervisor for validation and completion of the operation. This also allows the VMM to discard modifications it deems malicious or disruptive to VMM operation.

It’s important to note that a performance penalty is incurred in full virtualization because of the VMM’s responsibility to transition between root and non-root operation, as well as managing both physical and virtual resources such as the CPU, memory, and I/O.

 

— VMX Operation

On Intel processors support for virtualization is provided as a processor operation called VMX operation. During VMX operation there are two operation states – root operation and non-root operation. One can venture to guess that the hypervisor will run in root operation, and the guest will run in non-root operation. Much like transitions from user to kernel operations when a VMM is not present, the transitions while in VMX operation are called VMX transitions. These transitions are caused by a variety of conditions, the main two we’ll be covering in this introduction are VM entries and VM exits. To enter non-root operation (or guest operation) the VMM performs a VM entry, and to exit non-root operation and enter root operation (or the VMM) a VM exit is performed.

It’s important to note that VM entries are primarily performed by a set of instructions introduced only while in VMX operation, and in the VMM. If a guest attempts to perform a VM entry it will actually cause a VM exit and transition into VMX root operation for the VMM to handle the use of a privileged instruction. As mentioned above a set of instructions are introduced while in VMX operation, if an application attempts to execute them outside of VMX operation they’ll be met with a #UD (invalid opcode exception). The instructions introduced are listed below, however, we won’t be going into extreme detail until the 3rd article in this series is published when we enter VMX operation ourselves.

By the end of this series you will know the details of each of these instructions and their uses as well as what happens when you don’t use them properly. To provide a little more detail on the implementation of hypercalls in paravirtualization, one could use the vmcall instruction to trap into the hypervisor. The hypervisor would then have an exit reason provided that matches a value indicating that vmcall was used and would extract data from registers to determine how to handle the vmcall from the guest. This is one of the primary ways hypercalls are implemented, it’s very similar to the syscall mechanism. Don’t worry if this doesn’t make sense just yet, once you understand how VM exits are handled by the VMM it’ll all come together.

To continue our introduction to VMX operation, let’s discuss the state transitions of a VMM and when they occur. Below is a diagram representing the interactions between a VMM and the Guest OS. We’re only showing one guest in this diagram to reduce complexity and the possibility of confusion, there may be more than one on any given platform.

 

Figure 1. Interactions between VMM and Guest OS

The diagram above depicts the initialization and exiting of VMX operation as well as the transitions from non-root to root operation (VM Exit), and root to non-root operation (VM Entry). If we go from left to right you’ll see that the VMXON instruction is executed, this instruction puts the current logical processor into VMX operation. While the software is in VMX operation the hypervisor can use certain instructions to perform a VM entry and begin non-root operation. To transfer control to the hypervisors entry point (the exit handler, detailed later on) a guest system can directly, or indirectly cause a VM exit allowing the hypervisor to handle the cause of the VM exit and transfer control back to the guest. During VMX operation, if the VMM decides to shut itself down and exit VMX operation it will execute VMXOFF. This is the flow of operation in a hypervisor, and is used to control resources, responses, and guest behavior all while the guest believes it is executing on real hardware.

There are a variety of ways to perform a VM entry, and countless conditions that can result in a VM exit. The conditions for VM exits are primarily controlled by a group of VM execution control structures. We won’t get into those details until later in the series. If you’re interested in learning more ahead of time, see the recommended reading section at the bottom. This was meant to be an introduction into the different types of virtualization, how VMX operation is entered and exited, and how the hypervisor controls system resources through transitional instructions and conditions. More detail will be provided in subsequent articles, for now let’s move on to putting together type definitions to make our lives easier later in the development process.

 

Type Definitions

When I refer to type definitions I’m talking about the structures that represent various processor specific components like model-specific registers, control registers, rflags, cpuid feature information; and segment descriptors, selectors, etc. This is a long section and all the structure definitions are provided below, however, I recommend that you use the references on keywords in this section to learn about the different components our structures represent and their uses in a standard machine as well as a virtual machine. This is an all day task, however, I’ve provided the definitions to save the readers time from having to write their own and sift through the Intel SDM to verify correct alignment and bit field size.

 

— Bit fields

Before we dive into defining our structures I want to link a few resources over bit fields for those who aren’t familiar with them. If you’re comfortable with bit fields and proficient with using them, feel free to ignore these resources.

 

— Off-by-one errors

If you’re familiar with bit fields you know that incorrectly constructing a bit field with a flag/value that has the incorrect number of bits can cause serious instability if not completely break the operation of the device or software using the bit field. Make sure your definitions use the same number of bits as mine, otherwise you’re likely going to have problems with initialization or later on when reading specific flags/values from these structures.

 

— The Definitions

For a hypervisor you need to have structures for representing the following:

  • Model-specific registers (VMX related, and some architectural)
  • Control Registers (cr0, cr2, cr3, cr4, cr8)
  • RFLAGS
  • CPUID (the feature flags, particularly)
  • Debug Registers (dr0-4, dr6, and dr7)
  • VM Execution Controls

The below structures will be defined later on…

  • VMM Context
  • VCPU Context (Virtual CPU, representing the current logical processor)
  • VMX Fields
  • Special Registers (GDTR, LDTR, IDTR)
  • Segment Information

We’re going to start with definitions of the various MSR’s and briefly discuss each one and their importance. You should be creating your own project structure that separates these definitions into their respective headers for use throughout the project. I seriously recommend against putting them all in one header.

IA32_EFER_MSR (0xC0000080)

union __ia32_efer_t
{
  unsigned __int64 control;
  struct
  {
    unsigned __int64 syscall_enable : 1;
    unsigned __int64 reserved_0 : 7;
    unsigned __int64 long_mode_enable : 1;
    unsigned __int64 reserved_1 : 1;
    unsigned __int64 long_mode_active : 1;
    unsigned __int64 execute_disable : 1;
    unsigned __int64 reserved_2 : 52;
  } bits;
};

This MSR is useful for determining processor operation mode and whether the execute-disable bits for paging structures are available.

IA32_FEATURE_CONTROL_MSR (0x3A)

union __ia32_feature_control_msr_t
{
  unsigned __int64 control;
  struct
  {
    unsigned __int64 lock : 1;
    unsigned __int64 vmxon_inside_smx : 1;
    unsigned __int64 vmxon_outside_smx : 1;
    unsigned __int64 reserved_0 : 5;
    unsigned __int64 senter_local : 6;
    unsigned __int64 senter_global : 1;
    unsigned __int64 reserved_1 : 1;
    unsigned __int64 sgx_launch_control_enable : 1;
    unsigned __int64 sgx_global_enable : 1;
    unsigned __int64 reserved_2 : 1;
    unsigned __int64 lmce : 1;
    unsigned __int64 system_reserved : 42;
  } bits;
};

This is used prior to entering VMX operation, and without the bit fields the function enabling VMX operation would be littered with preprocessor defines. The three fields of interest are lock, vmxon_inside_smx, and vmxon_outside_smx.

IA32_VMX_MISC_MSR (0x485)

union __vmx_misc_msr_t
{
  unsigned __int64 control;
  struct
  {
    unsigned __int64 vmx_preemption_tsc_rate : 5;
    unsigned __int64 store_lma_in_vmentry_control : 1;
    unsigned __int64 activate_state_bitmap : 3;
    unsigned __int64 reserved_0 : 5;
    unsigned __int64 pt_in_vmx : 1;
    unsigned __int64 rdmsr_in_smm : 1;
    unsigned __int64 cr3_target_value_count : 9;
    unsigned __int64 max_msr_vmexit : 3;
    unsigned __int64 allow_smi_blocking : 1;
    unsigned __int64 vmwrite_to_any : 1;
    unsigned __int64 interrupt_mod : 1; 
    unsigned __int64 reserved_1 : 1;
    unsigned __int64 mseg_revision_identifier : 32;
  } bits;
};

IA32_VMX_BASIC_MSR (0x480)

union __vmx_basic_msr_t
{
  unsigned __int64 control;
  struct
  {
    unsigned __int64 vmcs_revision_identifier : 31;
    unsigned __int64 always_0 : 1;
    unsigned __int64 vmxon_region_size : 13;
    unsigned __int64 reserved_1 : 3;
    unsigned __int64 vmxon_physical_address_width : 1;
    unsigned __int64 dual_monitor_smi : 1;
    unsigned __int64 memory_type : 4;
    unsigned __int64 io_instruction_reporting : 1;
    unsigned __int64 true_controls : 1;
  } bits;
};

This MSR is used when we begin initializing our VMXON and VMCS region. All of this will be detailed in Day 3 and Day 4 of the series.

IA32_VMX_PINBASED_CTL_MSR (0x481)

union __vmx_pinbased_control_msr_t
{
  unsigned __int64 control;
  struct
  {
    unsigned __int64 external_interrupt_exiting : 1;
    unsigned __int64 reserved_0 : 2;
    unsigned __int64 nmi_exiting : 1;
    unsigned __int64 reserved_1 : 1;
    unsigned __int64 virtual_nmis : 1;
    unsigned __int64 vmx_preemption_timer : 1;
    unsigned __int64 process_posted_interrupts : 1;
  } bits;
};

IA32_VMX_PRIMARY_PROCESSOR_BASED_CTL_MSR (0x482)

union __vmx_primary_processor_based_control_t
{
  unsigned __int64 control;
  struct
  {
    unsigned __int64 reserved_0 : 2;
    unsigned __int64 interrupt_window_exiting : 1;
    unsigned __int64 use_tsc_offsetting : 1;
    unsigned __int64 reserved_1 : 3;
    unsigned __int64 hlt_exiting : 1;
    unsigned __int64 reserved_2 : 1;
    unsigned __int64 invldpg_exiting : 1;
    unsigned __int64 mwait_exiting : 1;
    unsigned __int64 rdpmc_exiting : 1;
    unsigned __int64 rdtsc_exiting : 1;
    unsigned __int64 reserved_3 : 2;
    unsigned __int64 cr3_load_exiting : 1;
    unsigned __int64 cr3_store_exiting : 1;
    unsigned __int64 reserved_4 : 2;
    unsigned __int64 cr8_load_exiting : 1;
    unsigned __int64 cr8_store_exiting : 1;
    unsigned __int64 use_tpr_shadow : 1;
    unsigned __int64 nmi_window_exiting : 1;
    unsigned __int64 mov_dr_exiting : 1;
    unsigned __int64 unconditional_io_exiting : 1;
    unsigned __int64 use_io_bitmaps : 1;
    unsigned __int64 reserved_5 : 1;
    unsigned __int64 monitor_trap_flag : 1;
    unsigned __int64 use_msr_bitmaps : 1;
    unsigned __int64 monitor_exiting : 1;
    unsigned __int64 pause_exiting : 1;
    unsigned __int64 active_secondary_controls : 1;
  } bits;
};

IA32_VMX_SECONDARY_PROCESSOR_BASED_CTL_MSR (0x48B)

union __vmx_secondary_processor_based_control_t
{
  unsigned __int64 control;
  struct
  {
    unsigned __int64 virtualize_apic_accesses : 1;
    unsigned __int64 enable_ept : 1;
    unsigned __int64 descriptor_table_exiting : 1;
    unsigned __int64 enable_rdtscp : 1;
    unsigned __int64 virtualize_x2apic : 1;
    unsigned __int64 enable_vpid : 1;
    unsigned __int64 wbinvd_exiting : 1;
    unsigned __int64 unrestricted_guest : 1;
    unsigned __int64 apic_register_virtualization : 1;
    unsigned __int64 virtual_interrupt_delivery : 1;
    unsigned __int64 pause_loop_exiting : 1;
    unsigned __int64 rdrand_exiting : 1;
    unsigned __int64 enable_invpcid : 1;
    unsigned __int64 enable_vmfunc : 1;
    unsigned __int64 vmcs_shadowing : 1;
    unsigned __int64 enable_encls_exiting : 1;
    unsigned __int64 rdseed_exiting : 1;
    unsigned __int64 enable_pml : 1;
    unsigned __int64 use_virtualization_exception : 1;
    unsigned __int64 conceal_vmx_from_pt : 1;
    unsigned __int64 enable_xsave_xrstor : 1;
    unsigned __int64 reserved_0 : 1;
    unsigned __int64 mode_based_execute_control_ept : 1;
    unsigned __int64 reserved_1 : 2;
    unsigned __int64 use_tsc_scaling : 1;
  } bits;
};

IA32_VMX_EXIT_CTL_MSR (0x483)

union __vmx_exit_control_t
{
  unsigned __int64 control;
  struct
  {
    unsigned __int64 reserved_0 : 2;
    unsigned __int64 save_dbg_controls : 1;
    unsigned __int64 reserved_1 : 6;
    unsigned __int64 host_address_space_size : 1;
    unsigned __int64 reserved_2 : 2;
    unsigned __int64 load_ia32_perf_global_control : 1;
    unsigned __int64 reserved_3 : 2;
    unsigned __int64 ack_interrupt_on_exit : 1;
    unsigned __int64 reserved_4 : 2;
    unsigned __int64 save_ia32_pat : 1;
    unsigned __int64 load_ia32_pat : 1;
    unsigned __int64 save_ia32_efer : 1;
    unsigned __int64 load_ia32_efer : 1;
    unsigned __int64 save_vmx_preemption_timer_value : 1;
    unsigned __int64 clear_ia32_bndcfgs : 1;
    unsigned __int64 conceal_vmx_from_pt : 1;
  } bits;
};

IA32_VMX_ENTRY_CTL_MSR (0x484)

union __vmx_entry_control_t
{
  unsigned __int64 control;
  struct
  {
    unsigned __int64 reserved_0 : 2;
    unsigned __int64 load_dbg_controls : 1;
    unsigned __int64 reserved_1 : 6;
    unsigned __int64 ia32e_mode_guest : 1;
    unsigned __int64 entry_to_smm : 1;
    unsigned __int64 deactivate_dual_monitor_treament : 1;
    unsigned __int64 reserved_3 : 1;
    unsigned __int64 load_ia32_perf_global_control : 1;
    unsigned __int64 load_ia32_pat : 1;
    unsigned __int64 load_ia32_efer : 1;
    unsigned __int64 load_ia32_bndcfgs : 1;
    unsigned __int64 conceal_vmx_from_pt : 1;
  } bits;
};

All of the above definitions starting from IA32_VMX_PINBASED_CTL_MSR are used for controlling which conditions control trapping into the VMM, control certain processor components such as the ability to perform TSC scaling and offsetting, and operations to perform upon entering root or non-root operation. All of these will be detailed as they are used throughout the series, for now it’s just important to note that these are referred to as VM execution controls.

Note: The following definitions are for architectural registers or helper structures, the definition labels will be linked to a gist with their explanation below the link to conserve space.

CR0 Definition

CR0 is primarily used to disable caching or paging, as well as enable protected mode for a processor. We won’t be touching this control register, however, it is useful to have defined should you decide to modify CPU operation.

CR4 Definition

CR4 is used in a variety of places when writing a hypervisor, most notably when we intend to enter VMX operation we must set the vmx_enable bit and write the new control value back to CR4 for vmxon to work properly.

CR3 Definition

CR3 isn’t going to be used in our basic hypervisor, however, if you decide to implement your own memory management facilities having a CR3 definition will simplify a lot of bit manipulation that would be required otherwise.

CR8

union __cr8_t
{
  unsigned __int64 control;
  struct
  {
    unsigned __int64 task_priority_level : 4;
    unsigned __int64 reserved : 59;
  } bits;
};

While somewhat unnecessary, for the sake of convention I wanted to provide definitions for CR8. They won’t be used in our hypervisor, but can be useful if you decide to implement other features available.

Debug Register Definitions

Debug register definitions are important when dealing with exception bitmaps, particularly if you want to build a hypervisor level debugger. I won’t go into much detail on the debug registers since there is an entire section dedicated to them in the Intel SDM. There are also comments in this definition that were meant to help simplify the information in the manual. You can find information on these registers in the Intel SDM Chapter 17, Section 2 of Volume 3A.

CPUID Definition

The CPUID structure provided is to make it easy on the developer when attempting to read specific cpu identification information. An example you’ll see in the driver initialization test will use the cpuid intrinsic to obtain information about supported features that will be reported back through the feature_ecx union and allow the author to check whether the virtual_machine_extensions bit is set indicating support on the CPU.

RFLAGS Definition

Having the RFLAGS register represented by a union is incredibly useful since some VMX instructions will set flags in the RFLAGS register if an error is encountered. This will be detailed more once we get past entering VMX operation and begin writing our VM exit handler and implement good error handling.

Support Test

In this section we’re going to use our definitions to write a function that will be called in the driver entry point to determine if virtual machine extensions are supported on the current CPU. This will employ our latest structure definition for CPUID, and give you a general idea of how and when testing for support should be done. Note that testing for support of virtual machine extensions is only required on a single physical processor.

This is the implementation of our test function to determine if virtual machine extensions are supported.

int VmHasCpuidSupport( void )
{
  union __cpuid_t cpuid = { 0 };
  __cpuid( cpuid.cpu_info, 1 );

  return cpuid.feature_ecx.virtual_machine_extensions;
}

This function calls cpuid the function ID 1 to get processor feature information. Using our cpuid definition from earlier we can easily test if virtual machine extensions are supported and report back to our driver that we are all systems go for initializing the VM. This function will be used in our driver entry point and in the next article we’ll build out what will occur if VMX is supported, and unsupported.

Conclusion

In this article you were introduced to virtualization of the processor using Intel’s virtualization technology, the different forms of virtualization, terminology, and the flow of execution when running under VMX operation. You also defined important structures to represent various architectural registers and VMX controls to make development much simpler and cleaner. In the next article we will be doing more coding and building upon our existing project. We’ll start with allocating important VMM and VCPU contexts, discussing pre-vmx operation requirements, allocating the VMCS and VMXON regions, the process of enabling and entering vmx operation; and learning the ins and outs of virtual machine control structures.

If you had trouble understanding certain parts of this article, please let me know so I can clarify or elaborate more. And as always feel free to leave a comment, provide feedback, or share this if you enjoyed it.

Recommended Reading

The items in this section are strongly recommended to be read and understood. It’s critical to your success in hypervisor development. There are no shortcuts, and my articles only scratch the surface of a very large topic. The goal of these articles is not to give every single detail of every nuance, but to provide a foundation for learning how to write your own hypervisor and understand the fundamentals of VMX.

daax

Independent security researcher. Focus in hypervisor development, Windows internals, and device driver development. Feel free to reach out to me on Twitter: @daax_r

3 thoughts to “Day 1: Introduction to Virtualization, Type Definitions, and Support Testing”

    1. Change treat warnings as errors, and use #pragma warning ( disable : 4201 ). The __cpuid usage is fine. You aren’t going to receive negative values for responses unless the feature is not supported. In that event it just gets set to 21474836XX.

Leave a Reply