Errata Or Nah?
Over the last 2-3 years, Microsoft has inserted various methods of virtualization introspection detection (big brain words) into the workings of patchguard. It shouldn’t come as surprise that this has happened, as subverting kernel patch protection is a breeze when the attacker code is running at a higher privilege level. While Windows obviously runs just fine under a hypervisor, and has an open paravirtualization interface, patchguard is looking for signs that the vmm is tampering with state that isn’t necessary for a functional virtual machine. For instance, attempting to hook system calls by hiding the true value of the MSRs that control their branch targets, or exploiting nested paging to gain execution at critical control paths.
While patchguard contains more mechanisms to detect these types of introspection then are presented in this post, the author has chosen his favorites because they are of peculiar nature. It can be an exercise of the reader to find more 😉 It is the intention of this article to aid in software interoperability between security, anti-virus and introspection tools with kernel patch protection.
First on our list is KiErrata704Present
. Upon first glance, the naming convention of these functions seems innocent, and to the untrained eye, might actually look like it’s legitimately checking for some kind of meme errata. Let’s break this function down:
A little background: certain ancient forms of privilege transitioning, like SYSENTER and call gates, allowed the caller to essentially single step over the opcode. This wasn’t quite optimal because the single step #DB would be delivered after the branch is complete. The kernel would then need to keep note of this so it could IRET to the caller, to continue the single step operation after handling the system call. The introduction of SYSCALL/SYSRET addressed this problem with the FMASK MSR. This MSR let OS developers have finer control over how SYSCALL handles RFLAGS when it’s executed. Any sane OS is going to ensure that IF and TF are masked off with this MSR. In addition, SYSRET was crafted specially so that if it loads a RFLAGS image with TF set, that it will raise the #DB on the following instruction boundary, as opposed to how IRET applies it to the boundary after its branch target. This allows for a smooth user-mode debugging experience when single stepping over the SYSCALL instruction. Now that we hopefully have a better understanding, we can see that the first thing KiErrata704Present
does is save off the FMASK MSR contents and then set the MSR value such that TF will not be modified by the SYSCALL operation.
Next we see a sequence of PUSHFQ/POPFQ setting the trap flag and loading it back into the RFLAGS register. This as you are likely aware, will cause the preceeding instruction to have TF set during its execution, and on it’s boundary, will fire a #DB. Unless of course the instruction is of software exception, software interrupt, or privileged software exception class, or if the instruction generates a hardware exception.
You probably realize by now that once SYSCALL has finished its execution, a #DB will fire, just as it would if we stepped over any other branch instruction. Thus if the LSTAR target looked like the code sequence below:
0x40000: SWAPGS 0x40001: MOV GS:[0x8], RSP 0x40002: MOV RSP, GS:[0x10]
The #DB handler interrupt stack would contain 0x40000
, because that is the syscall operation branch target, which hasn’t executed yet.
As you have probably already realized, patchguard can indirectly discover the true contents of the LSTAR MSR by inspecting the #DB generating IP in its interrupt handler. This serves as a way to discover if a malicious virtual machine might be exiting on RDMSR/WRMSR and giving the OS expected values.
Next up is my personal favorite, KiErrataSkx55Present
. As it serves as a throwback to CVE-2018-8897 and was added to patchguard not long after this vulnerability was mitigated. In order to have a solid understanding of how this detection works under the hood, you should read the POP SS/MOV SS vulnerability whitepaper.
If you read the paper, then this almost speaks for itself. Thus given the example SYSCALL handler above, this #DB will also have 0x40000
on its interrupt stack.
What’s a young hypervisor to do in this situation since the guest code can now have wisdom beyond RDMSR/WRMSR? Simple really, set our exception bitmap such that we exit on #DB exceptions, and check the guest state IP to handle both of the possible instruction boundary #DBs above, if it does not match, then it would be appropriate to reflect it back to the guest via vectored event injection. It would be wise to check the exit qualification instead of just the TF set in guest state.
Let me tell you a story about a popular anti-virus hypervisor that failed to do this, and thus when it injected the #DB back into the guest to the RIP of its secret syscall handler, the KiDebugTraps
mitigation was non the wiser, and this hypervisor made your system vulnerable to CVE-2018-8897 all over again.
Finally, what wouldn’t be the icing on the cake, but a solid check that can only blow your hypervisor up if you’re exiting on #DB exceptions, since, you kinda gotta amiright? Enter KiErrata361Present
.
There’s a bit going on here so let me explain. Under normal circumstances, loading RFLAGS with TF via a POPF variant, followed by a SS load will cause the single step to be seen after the instruction boundary of the instruction following the SS load. This is the same for #DBs that fire for hitting armed debug registers, when temporarily blocked by a load SS. In the case above, a INTn also known as a software interrupt, or the dedicated INT3 opcode also known as a software exception don’t care about the previous pending #DB via TF, and it’s discarded no matter what.
This is the same natural behavior from ICEBP
which albeit undocumented, is the privileged software exception you see in your Intel manuals. In this case, the #DB wont have DR6.BS
set, even though it was pending, it was discarded due to the nature of how these opcodes operate natively. ICEBP
actually carries with it this caveat when it induces a #DB VMEXIT. Under normal architectural circumstances the BS bit would be set in the pending debug exceptions field in the VMCS, because that is the true state here, however when the exit is induced by the privileged software exception the bit is cleared.
As such the state of VMCS is not naturally resume-able and will cause VMRESUME to fail, causing most hypervisors to shit themselves watery logs on the spot. The architecture requires that if the virtual cpu is in an interrupt shadow such that blocking by MOV SS/POP SS is enabled AND the TF bit is set, that a pending BS based #DB must exist because there is no other way to acquire this machine state. The fix for this is also relatively simple: Check for privileged software exception on qualifying exits, and if blocking by MOV SS is indicated alongside TF==1, then make sure BS is set in pending debug exceptions.
The idea for KiErrata361Present
was actually taken from the CVE-2018-1087 vulnerability, before it was publicly known that privileged software exception was indeed ICEBP
, and showed up in patchguard not long after the vulnerability had been mitigated in KVM. The Intel SDM has since been updated to indicate what privileged software exception actually is, but still leaves out this edge case.
If this wasn’t too boring, continue onto Part 2 where we talk about another Patchguard detection and use some critical thinking to come up with our own neat tricks!
10 thoughts on “Patchguard: Detection of Hypervisor Based Introspection [P1]”
I have a question. My question may be a bit naive, but I am a bit confused. Since it is a single-step exception, then it break at the boundary. How can it continue to execute the following instructions? It cannot find your hypervisor without executing the following instructions
Good question, possibly I made it difficult to understand.
It doesn’t continue to execute the instructions. Single stepping over syscall is no different then if you single stepped over a call instruction. The IP shown in the interrupt stack is of the branch destination, because single step #DBs fire after instruction retirement and a new IP is already calculated.
It’s no different for syscall, and is a secondary way of exposing where syscall will branch to, besides reading the MSR values.
Does that make sense?