• On August 15, 2019
  • By

Applied Reverse Engineering: The Stack

Overview

This article is written for new reverse engineers who have a hard time understanding the stack, its layout, how it works, and the various requirements for proper function. It can be a confusing concept to wrap your head around at first, but after reading this article you should have a very deep understanding of how stacks work and their usage in a 64-bit architecture. Knowing how the stack works is a topic fundamental to reverse engineering. Various types of obfuscation are stack-based and can be daunting to deal with if the operator doesn’t understand it. It’s also useful for circumventing checks that malware may perform such as return address checks (to validate that a call came from a trusted source). Overall, you’ll find learning about the stack invaluable – even if only reviewing your understanding.

This article is written to cover stacks in 64-bit Intel architecture, and the calling conventions used in the x64 architecture. The calling convention explored is the Microsoft ABI. If you’re not sure what an ABI is we’ll cover it in this article. There are a variety of different conventions depending on platform, so be sure to validate. All examples were created and analyzed on Windows 10 x64 Version 1903 Build 18362.

Note: If you're unfamiliar with memory or how memory is organized I'd suggest consulting the recommended reading for more information. Understanding memory will help to understanding this article.

The Stack

— What is the stack?

In general, a stack is a contiguous array of memory. It’s also sometimes referred to as a structure based on the last-in-first-out principle (LIFO). A contiguous array is simply a sequence of objects in a linear structure format, accessible one after the other. This stack structure is bounded at the bottom meaning that all the operations performed are performed on the top. There’s a simple analogy to remember this – a weapon magazine. While the analogy has limitations we’ll discuss, it goes like this:

  1. Bullets are inserted from the top of the magazine. (LIFO)
  2. Only the top bullet is accessible to the operator. (top of stack)
  3. To load a new bullet you have to push a new bullet into the magazine, this the new bullet is the new top. (push)
  4. To remove the top element you have to shoot the weapon. (pop)
  5. You can check if the magazine is empty. (check if stack is empty)
  6. You can use a new magazine, or reuse the same one. (creating, adding elements back)

This analogy is rather interesting since the normal one is a stack of plates. You can visualize how the stack is laid out, however, there are some issues with this analogy. The first of which is that in modern systems you can access certain stack locations (in memory) if you know the offset from the current stack pointer. There’s a few other ones, but we haven’t covered them so I’m not going to confuse the example. It’s a good representation, regardless.

You have a general idea of how the stack works, let’s get into the dirty details regarding its layout and the various registers that control the structure.

— Stack Layout

Before we begin building a view of the stack it’s important to know how it’s managed. If you recall from the previous article when you learned about general-purpose registers there was a register named RSP. This is the register that manages the current stack in 64-bit architectures called the stack pointer. The stack is also managed by a segment register called the stack segment, or SS for short. The processor will reference SS for all stack operations (which will be discussed in just a bit). The stack is an awkward structure to think about because it grows down in memory when items are added, and shrinks up when items are removed. The stack pointer will always point to the top of the stack, unless by some annoying trickery a tool (maybe a form of obfuscation) uses it as a counter or some other generic use. If that’s confusing, don’t worry – the diagrams below will help you get a better idea of what is going on.

Below is a diagram that we’ll build upon as we discuss different topics that are related to the stack, for now we just know it’s a contiguous array of memory where RSP points to the top and it’s always referenced through the SS register. (If you’re not familiar with segment registers, check recommended reading – worthwhile to know.)

Alright, so we have this graphic. It’s just an empty stack, and if you recall I said the the stack grows down in memory when items are added. Take a look at the image, you’ll see that our RSP points to the top of the stack at the highest address where the stack was allocated (it can be located anywhere in a processes address space, just know that it’s at the upper boundary of that allocation.) To add to this illustration, let’s talk about the two instructions that affect the stack: push and pop. When software needs to place an item on the stack it performs a push, so let’s adjust our graphic after placing two values onto the stack with two consecutive push operations.

As you can see our stack has two new values. If you remember from the analogy early the stack is a LIFO structure. This means the last element pushed onto the stack is the first to be removed by its opposite operation. This also means that the 4 was pushed first and 12 next. The corresponding assembly would have looked like this:

push 4
push 12
...

You also see that the RSP register points has been adjusted. This is because when items are pushed onto the stack the processor decrements the RSP register and writes the data to the new top of stack. This is an example of the stack growing down as items are added. This also means that to access either of those values you could do one of two things: offset from RSP, or pop the items off the stack. Let me illustrate both operations and give some details on the pop instruction.

To access the stack elements by offsetting from RSP you have to know how the stack works. Those two elements are at higher addresses (adding elements makes the stack grow down), meaning we’ll have to add an offset to our stack pointer to acquire that information. Let me adjust the diagram to show how we can do this.

To access the first element pushed onto the stack, the value 12, we’d have to offset 0 bytes from RSP. If you’re wondering why 0 bytes, well we’re on a 64-bit architecture so the push and pop instructions decrement/increment the stack using a 64-bit width. For example, if I want to store the value of 12 in a general-purpose register like RBX by offsetting from RSP I’d write something like this in assembly:

mov rbx, qword ptr ss:[rsp]

That’s a very specific line. I mentioned earlier that all stack references are done through the stack segment register (SS). That’s exactly what this code is doing. Performing a store into RBX from the stack at RSP+0h – which is the value 12. We use the ss:[...] to tell the processor “hey, this is a stack reference.” The same operation would apply for retrieving the value 4 from the top of the stack, just using a different offset – the offset would be 8. We’ll cover why this is important to understand when we get into function frames and usage of the base pointer register.

That’s one way we could retrieve the values of the stack, however, the simpler and faster way is to use the pop instruction. There are different scenarios where usage of one method is preferred over another, but for the sake of this example it is simpler and faster to use pop to get these values off the stack. I’ll adjust the graphic to demonstrate how acquiring the value 12 from the stack could be done using pop. We want to store the value in a general-purpose register as well, so for the sake of consistency we’ll reuse RBX. To do this we’ll have to execute pop twice, but the view of the stack will be quite different.

pop rbx		; rbx = 12
pop rbx		; rbx = 4

We have to pop twice since the stack is LIFO and the value 4 was the first element pushed onto the stack. On the first pop we specify a register that the value on the top of the stack will be placed in – RBX. After executing the first pop instruction the RBX register holds the value 12. After the second, RBX equals 4. That’s what we wanted! When we perform a pop the topmost element of the stack is removed, so what does the stack look like now?

It’s empty! And the stack pointer now points to the top of the stack, as it did in the very beginning. This is because when items are popped from the stack the stack shrinks up – toward higher addresses. To adjust RSP the processor reads the item off the top of the stack and places it in the location specified, and then increments the stack pointer. What’s interesting to note about the instructions that operate on the stack is that they’re not limited to immediate values like 4 or 12 they can use general-purpose and segment registers, or simple memory operands. We won’t demonstrate those right now, but we’ll come in contact with them throughout the series.

Before we move on I have to clarify that a program or operating system can setup many stacks. The limitation is based on the maximum number of segments and available physical memory, you likely won’t encounter more than one stack per task in the system. This means every process and thread can have more than one stack, but most usually only have one. However, only one stack is available at a time regardless of how many exist, and the current stack is the stack referenced by the SS register. In addition to push and pop there are a few other instructions that operate on the stack that we’re going to cover next. Those instructions are call and ret. We’ll cover these instructions, compare and contrast high-level/low-level examples, and then move on to discuss calling conventions. These next few sections are detail heavy, but required to know to reverse engineer.

Calling and Returning

We’ve encountered the call instruction before, in the first article, as part of the assembly excerpt generated by our example program. We saw some operations before it, and some operations after it. We also ran into the ret instruction as well at the very end of the function excerpt. These two instructions will be encountered innumerable times when reverse engineering, and they operate on the stack, but how does each interact with the stack? Both of them operate differently, but use some of the same components such as the stack pointer. We’re going to address everything you need to know about both of these starting with the call instruction.

— The Call Instruction

If you’ve taken a look at the Intel Instruction Manual, and attempted to decipher the meaning of the instructions in the previous article’s excerpt then you’ve likely run into this instruction in the manual. It’s description and various opcodes and nuances about prefetching instructions, etc may have been confusing. I’m only going to cover what’s relevant to the 64-bit architecture regarding the call instruction with some details about it’s operation in the x86 architecture that were passed along.

Before we do that I need to make something known. I mentioned segment registers a section above, and if you’re familiar with segmentation it doesn’t operate the same for processors in 64-bit mode. All segment registers are zero based except for the GS and FS segment registers. The FS and GS segment registers can still have non-zero base addresses because they may be used for critical operating system structures, and in Windows 10 – they are. The GS on Windows stores the Thread Environment Block. The FS segment is used for thread local storage, or canary-based protection, it could also be configured to point to other data. We’ll encounter the GS segment register quite often in x64 projects since a lot of information can be extracted for usage in anti-debugging or integrity checks. Anyways, in 64-bit mode segmentation is effectively disabled. I say this because when describing how the call instruction works I will make references to segment registers. These segment registers are based from 0 and the limit is ignored (in 64-bit architectures). The reason for this is because of the memory model used for the 64-bit architecture. If you’re interested in learning that (and I recommend it) be sure to read the reference in the recommended reading section.

The call instruction has 4 type classifications:

  • Near Call
  • Far Call
  • Inter-Privilege-Level Far Call
  • Task Switch

However, in 64-bit mode, we’re mostly going to be concerned with the near call. It has two opcodes associated, those being E8 and FF; and it’s described as a call to target relative to next instruction. The difference between a near call and a far call is pretty straight forward. A near call doesn’t modify the code segment register (CS), but a far call changes the value of CS. We aren’t going to be concerned with far calls since 64-bit operating systems use what’s called the flat memory model where all memory can be accessed from the same segment. This means there’s no reason to change the value of CS. You remember when I stated that all segment registers are based at 0 (save for FS and GS)? This is how a flat memory model is implemented.

Knowing this actually simplifies the call types we have to learn about. The two sub-types of near calls are near relative calls (E8) and near absolute (FF). Near relative calls are pretty simple to think about. Relative means that the call target address will be relative to the address of the next instruction. To demonstrate this I picked apart a call instruction I found in ntoskrnl.

We know this is a relative near call since it’s opcode is E8. The following 4 bytes are the call target relative to the next instruction. Let’s suss out how this calculation is performed. We’ll have to extract the target address from the instruction which turns out to be FF A4 62 96. If you’re wondering why we went backwards it’s because of Intel’s use of storing information in little endian. Little endian is simply storing the “little end” first, to rebuild the actual target in big endian, or “big end” first, which is the normal way to think about the number we just start from the last byte and work our way forward. Anyways, we should be able to add that relative address to the address of the next instruction and arrive at MiProcessLoaderEntry. What happens is we get a number that isn’t in our address space – what the hell happened? Take a look at that call target again, it starts with FF – it’s negative. To successfully extract this target we’ll take our relative address and sign extend it – meaning using the sign of the value extend it to the maximum width (64-bits). The actual relative address is FF FF FF FF FF A4 62 96. If we take that and add it to 1406FB38E (remember, it’s relative to the instruction after the call) we get 140141624. And take a look at this:

We wind up at the entry point for MiProcessLoaderEntry. That’s how near relative calls work! You can extract the target of any call instruction, and that’ll become very useful to us in the future.

The simplest way to identify a near relative call is by looking at the first opcode and it’s mnemonic. It will always look like this: call some_function. For near absolute calls, where a target address is specified indirectly, we’d see something like call [rbx]. An indirect call specifies the call target in a register or some memory store. A direct call will have the call target specified as part of the instruction. This means that near relative calls, as given above, are direct calls and near absolute calls are indirect! It’s a simple way to remember them and identify them, and also how to pull their targets out. That was a lot, I’m sure. Let’s take a break from overloading with disassembly and get back to how the call instruction utilizes the stack.

— Call Stack Operations

Up to this point you learned a lot more than you might’ve been willing to about the call instruction… Good. In this subsection we’re going to get detailed with how the call instruction utilizes a few registers, and the stack to effectively put a bookmark at its location. When referring to a call we are always talking about near calls – we won’t be using far calls at all.

Continuing, when we execute a near relative call the processor does a few things for us. First, it pushes the value of the instruction pointer (RIP) on to the stack. It does this because RIP contains the offset of the instruction following the call. If you need a refresher on what RIP holds check in the previous article. This new stack value is used as the return-instruction pointer (not to get confused with RIP). The processor branches to the call target address specified by the operand, and if we use our example that operand value was FF FF FF FF FF A4 62 96. This relative offset is encoded as a signed 32-bit immediate value (lots of terms, don’t worry we’ll cover new ones) that is sign extended to 64-bits and added it to the RIP register. This sign extension to 64-bits only occurs in 64-bit mode, if you’re operating in a 16- or 32-bit environment the relative offset is encoded as a signed 16- or 32-bit immediate. The targets operands are always 64-bits in 64-bit mode. Remember that, otherwise if you calculate targets by hand you may wind up with wonky numbers.

Similarly, with near absolute calls most everything is the same except that the absolute offset is specified indirectly in a general-purpose register or memory location. That absolute offset is loaded directly into the RIP register, no addition to RIP necessary. Using our old stack graphic lets illustrate what the stack looks like after execute a call instruction.

As we can see, the instruction pointer was pushed onto the stack by the processor. RIP, in this example, would point to test eax, eax. This is because RIP always points to the next instruction to be executed.

Note: If you find that I'm repeating myself it's because I want to make sure this sticks with you. It's easy to get confused, so the more you read it the better you remember.

RSP is decremented because when we push items on the stack we grow down in memory (towards lower addresses). Not too difficult right? Well, how do we get back to the function that just called func? Remember, we’re executing in func after the processor branches to the target. That’s done by executing the ret instruction. I briefly mentioned that the RIP value pushed onto the stack would be later used as the return-instruction pointer, so let’s dig in to the return instruction and go full circle.

— The Return Instruction

Return from procedure, the return instruction, or simply known as ret is the instruction that transfers program control to a return address located on the top of the stack. That return address was pushed onto the stack by the call instruction, and the return brings us to the instruction following the call in our caller function. Two things, the function that executes the call instruction is often referred to as the caller, and the target function is cited as the callee.

The return instruction has a few different opcodes, the majority of the time in 64-bit targets we’ll just see the C3 opcode should we look at the instruction bytes. However, it’s very possible that you’ll encounter C2 as well. Both operate in their own way. Let’s talk about the generic return instruction, ret. This instruction performs a near return (to pair with our near call) to a calling procedure. The near return instruction, when executed, pops the return instruction pointer off the top of the stack and into RIP, and resumes execution at the instruction pointer. It’s really that simple. Remember, we can’t directly modify the instruction pointer, but the processor can. As with the call instructions in 64-bit mode, the operation size (meaning the width of memory) for this instruction is 64-bits, or 8 bytes. We’ll talk about what happens if there are issues with the top of the stack when we get to the stack faults section. For now let’s look use our stack diagrams of the call and see what happens following a return.

The diagram above shows the execution of a call instruction and then the state of the stack just before the return instruction. Let’s walk through from left to right. First, we land on the call func instruction. This is a near relative call, and if you recall the RIP prior to being pushed on the stack points to the next instruction (where the address is highlighted in blue.) RIP is pushed onto the stack for use later when a ret instruction is encountered. The current RIP on stack holds the value 5 (the address of test eax, eax). Then the processor branches to the callee (func) where we execute 2 instructions of no interest, and land on our ret instruction. Notice the RIP value, it’s the address of the return instruction. Upon executing this return instruction the processor pops the old RIP value from the stack (labeled as the return address) into the RIP register. Below is what the stack looks like after executing the return instruction.

When the return instruction is executed the top of the stack is popped off and put into the RIP register. We can see that by looking at the RIP value highlighted in blue. It points to the instruction following the call func instruction, as it should. The ret transfers control back to the caller at the specified RIP value, and resumes execution. We can see that the stack is clear again, RSP was incremented since pops shrink the stack up toward higher addresses, and the RIP after control transfer points to the next instruction to be executed.

Wow, that’s a lot of information for just two instructions! It turns out this is just the tip of the iceberg. These examples were very trivialized to help you understand what happens when the two branching instructions execute. I put together a high-level example, so you can see what similar code would look like in C. This is somewhat reduced, but the logic still holds.

int func()
{
              // xor eax, eax
    return 0; // ret
}

int main(int argc, char **argv, char **envp)
{
    int res = 1;

    res = func(); // call func
    if (res == 0) // test eax, eax
                  // jz over_there
                  // ...
                  // .over_there:
        printf("return was 0.");

    return 0;
}

This is sort of what the code used above would look like translated to a high-level language. There are definitely some instructions and terms that you encountered that may not be clear, but remember to consult the Intel Instruction Manual when in doubt, and keep reading. If I’m not explaining it just yet it’s not vital to know right this second. I just wanted to provide a look at how call and returns work from the low-level. At this point you’ve encounter a variety of assembly instructions that correspond to high-level operations, and I’m hoping that the overall thought process of breaking high-level code down is beginning to set in. Remember to read all the recommended reading sections to maximize your level of understanding.

That being said, we have to continue on and cover calling conventions. We’ll explain from a high-level and then get low. If the above was confusing take a moment to read the recommended reading sections relevant to the topics we’ve covered and then continue on. If you’re feeling good, and understanding everything then read on.

— Calling Conventions and the Microsoft ABI

A calling convention is a specific method for functions that a compiler uses to set up access to a subroutine. It specifies how arguments are passed to a function, and how return values are – well – returned from the function. It also determines how that function is invoked and the way it creates and manages its stack and stack frames. It’s the way the function call in the compiled language is converted into assembly and we’re going to look at how the most prominent calling convention – fastcall – does these things. Originally, there were three calling conventions that could be used with C in 32-bit x86 processors – those being stdcall, cdecl, and fastcall; and then in C++ when thiscall was introduced to support virtual function invocation. In x64 processors on a 64-bit operating system, notably for Windows in this series, it simply uses fastcall for all 64-bit code. If you run a process in compatibility mode under WoW64 you’ll encounter the predecessor calling conventions mentioned above. We’re only focused on fastcall since we’re going to be operating with 64-bit targets.

If you’ve been programming for a while you know how a function is declared and defined, and how arguments are passed. Let me create a function that does some simple arithmetic and then we’ll get a little more technical with the calling convention.

This function returns the difference between the two arguments. You’ll notice the __fastcall keyword used, this is to explicitly declare the calling convention. When compiling a 64-bit program with MSVC, it’s implied and always used. I just put it there to be explicit. It’s important to note that this calling convention is not standardized across all compilers, some may use different methods of passing arguments to the function or managing the stack and frames. This brings us to our next discussion, the Microsoft ABI.

An application binary interface, ABI, is a the interface between a program and the OS/platform. It provides a set of conventions and details such as data types, their size, alignment requirements; calling conventions, object file format, etc. The ABI is platform dependent meaning it can vary some degree from compiler to compiler. The ABI is a primary component used in how the generated assembler operates meaning that the code generation (part of the compilation process) must know the standards of the ABI. What we’re going to be considering from the ABI today is the layout of the stack frame for a function call, how arguments are passed, and how stack cleanup is performed. This is all implemented by the assembly instructions that reserve space, store certain registers to create a “frame”, copy values into that reserved space. If this is new to you, don’t sweat it. When you’re writing in a high-level language such as C or C++ you don’t really need to know about the ABI. However, when you begin to work and analyze assembly it’s important to use the correct ABI or be able to identify the ABI for the components of interest.

Programs compiled for a 64-bit Windows OS will use their x64 ABI. This ABI uses a four-register __fastcall calling convention by default. We’re going to break the entire convention down and determine how it affects our programs stack during function calls. We’ll be using our sub example above.

— Fast-call Calling Convention

The __fastcall convention uses four general-purpose registers to pass integer arguments to the callee. The registers are rcx, rdx, r8, and r9; in order. If you need a refresher on the general-purpose registers go to the previous article and save the diagram of registers. Using our subtraction example above this means that when sub is called the generated assembly instructions will place the a value into rcx, and the b value into rdx. The other two, in this instance, are unused. Let’s take our small program and translate it to assembly to help tie this idea together.

int __fastcall sub(int a, int b)
{
    return a - b;
}

int main()
{
    sub(8, 4);

    return 0;
}

The first thing we do in main is call sub. The two arguments are 8, and 4. Let’s get the generated assembly and take a look at it.

//
// Assembly listing for main()
//
mov qword ptr[rsp + 24], r8
mov qword ptr[rsp + 16], rdx
mov dword ptr[rsp + 8], ecx
sub rsp, 40
mov edx, 4          ; second argument
mov ecx, 8          ; first argument
call sub            ; function call
xor eax, eax
add rsp, 40
ret

//
// Assembly listing for sub(a,b)
//
mov	eax, edx
sub	ecx, eax
mov	eax, ecx
ret

I’ve reduced the disassembly to be a little simpler, but not by much. Let’s ignore the first 4 lines of the main listing and start analysis at mov edx, 4. I mentioned before that arguments are passed into rcx, rdx, r8, and r9. They’re passed from left to right, as well, meaning that the first parameter will always be in rcx, and so on. Prior to our function call where sub is executed we see that our values from the C program, 8 and 4, are placed in their respective registers as part of the calling convention. 8 is placed in ecx and 4 in edx – their 32-bit counterparts since the full 64-bits aren’t required since the data type is 32-bits in width (an int). If the type were to be unsigned long long then the values would’ve been placed in the register partition that matches the width, so rcx and rdx.

Following the storage of our arguments in the registers used by the convention we execute the call instruction. The call instruction pushes the instruction pointer – which is pointing to the instruction after the call – onto the stack and sets RIP to the call target. Jump down to the assembly listing for the sub function, you’ll immediately see a mov performed to store the value of edx in eax. The next instruction performs the subtraction (this is not the same as calling our function sub, this executes the sub instruction, part of the ISA.) The sub instruction subtracts the second operand from the first and stores the result in the first operand. Thus, we see sub ecx, eax which translates to ecx = ecx - eax. If you’re wondering, we could’ve removed the store of edx into eax and used edx in place of eax.

For the x64 ABI, and most others, the return result is passed back to the caller through the rax (eax, in this instance) register. Remember the result is stored in ecx upon completion of the sub instruction therefore to return the result back to the caller, as our C program specifies, we store ecx in eax and return. The return instruction pops the return-instruction pointer (the address of instruction following the call) into the current RIP register, and transfers control back to the calling function at xor eax, eax – because that was the instruction following our call sub. The instruction afterwards performs a stack clean-up, which we’ll cover in just a minute, and then returns.

How was that? Simple enough, right? We’re gonna add a little more complexity now by further describing the convention.


When passing integer arguments we go through the four registers previously specified. Though, we don’t always pass integers to functions. Sometimes floating point arguments, structures, and so on. Any argument that doesn’t fit into a supported size 1, 2, 4, or 8 bytes has to be passed by reference, this is because (unlike our mailbox analogy) an argument is never split across multiple registers. All floating point arguments and operations are done the XMM registers. We didn’t talk about those, but they’re just like the general-purpose registers – there are 16 XMM registers and computations are performed on them using instructions defined in the ISA. The XMM registers are named with their index as XMM0-XMM15. We will cover these when necessary, as we won’t encounter them much until we get to our game reversing project.

So what does any of this have to do with the stack? Well, prior to executing a call instruction and as part of the convention, it is the job of the caller to allocate space on the stack for the callee to save the registers used to pass arguments. This space that’s allocated by the caller is known as the shadow store, spill space, home space, or shadow space. To be rigorous with our terminology we will always refer to it as the shadow store. The space allocated is strictly the maximum size supported (8 bytes) times the number of registers used to pass arguments (4).

If you look at our main assembly listing above you’ll notice an instruction sub rsp, 40. This is using the allocation of our shadow store, plus a little something else except 8 * 4 = 32; so, what gives? Well, the stack must always be aligned on a 16-byte boundary. This means that the address of the top of the stack must be a multiple of 16. You might be thinking, 32 is a multiple of 16, but remember that we pushed the return address onto the stack which is 8 bytes meaning if we used sub rsp, 32 our stack would have 40 bytes allocated. 40 is not a multiple of 16, so to combat this we allocate an additional 8 bytes thus giving us sub rsp, 40.

To simplify: prior to a function call the stack must always be aligned on a 16-byte boundary.

Let me reuse the main assembly listing above to illustrate what I just addressed since it can be somewhat confusing.

//
// Assembly listing for main()
//
sub rsp, 40         ; sub rsp, 32 (shadow store) + 8 (alignment pad) = > 40, this way 8 bytes for call will keep stack aligned
mov edx, 4          ; second argument
mov ecx, 8          ; first argument
call sub            ; function call
xor eax, eax
add rsp, 40
ret 0

To reiterate, we allocate space on the stack for the registers used (32 bytes) plus an extra 8 to make sure that the stack is aligned when the call instruction is executed and the return address (another 8 bytes) is pushed on the stack. Also, if you’re wondering why we use sub rsp, X to allocate space on the stack remember that the stack grows down in memory – toward lower addresses. To reclaim this allocation when the function finishes execution we use add rsp, X to shrink the stack up to its original state prior to the call. The reclamation of stack space must be the same size as the allocation, otherwise you wind up with a misaligned stack and invariably a crashing program. If this is still confusing for you, I made a graphic to illustrate this process.

This shows what happens when we only allocate space for our shadow store prior to a function call. We wind up with a misaligned stack. The solution is to add 16 bytes to our allocation as an alignment padding to ensure that the stack is 16-byte aligned prior to execution transfer.

I’m sure you’re tired of my diagrams at this point, but unfortunately there’s a little bit more to cover. We have yet to cover stack frames, and how data larger than 8 bytes is passed. If you’ve made it this far, keep going. You’ll have a better understanding of the stack than most just starting out, and that’s what I’m going for.

— Stack Frames

So far we’ve seen how the calling convention passes arguments, how it maintains stack alignment across functions calls, and how it allocates space for register storage for the callee. Now, we’re going to break down how a stack frame is created and used. A stack frame is simply a frame of data that gets placed on the stack. In our example, we’re talking about a call stack frame which represents a function call and its argument data. An important distinction is that the shadow store allocated is not part of the call stack frame. The call stack frame starts with the return address being pushed onto the stack first, then storage of the base pointer, and space for local variables is allocated. In some instances when a function is small enough and no locals are used we wind up not needing a stack frame, and instead opt to use registers to perform a quick calculation such as in the sub function. A good majority of the time you’ll encounter a stack frame, but it’s good to know that they’re not always required.

I’ve constructed a more in-depth example that generates a listing that creates a stack frame and uses it to address local variables and perform some modifications. It’s a bit more involved, but I’m sure you’ll be able to catch on.

void do_math(void)
{
    int x = 10;
    int y = 44;
    int z = 36;
    int w = 109;
    int a[4] = { 1, 2, 3, 4 };

    a[0] = x * a[0];
    a[1] = y * x;
    a[2] = a[1] * z;
    a[3] = w * a[2];

    printf("%d\n", a[3]);
}

It’s just an arbitrary amount of math on an array, and some locals. No significance. I just needed to massage the compiler into giving me the assembly listing I wanted. Speaking of which, it’s a bit of a mess, but we’ll work through it.

//
// Assembly listing of main()
//
mov qword ptr [rsp+24], r8
mov qword ptr [rsp+16], rdx
mov dword ptr [rsp+8], ecx
sub rsp, 40
call do_math
xor eax, eax
add rsp, 40
ret 0

//
// Assembly listing of do_math()
//
push rbp
mov rbp, rsp
sub rsp, 60
mov rax, qword ptr ss:[rbp+30]
mov qword ptr ss:[rbp-40], rax
mov qword ptr ss:[rbp+18], r9
mov qword ptr ss:[rbp+28], r8
mov qword ptr ss:[rbp+10], rdx
mov qword ptr ss:[rbp+20], rcx
test rdx, rdx
jne 7FF691A34607
call 7FF691A36294
mov dword ptr ds:[rax], 16
call 7FF691A36174
or eax, FFFFFFFF
jmp 7FF691A34651
test r8, r8
je 7FF691A345F2
lea rax, qword ptr ss:[rbp+10]
mov qword ptr ss:[rbp-38], rdx
mov qword ptr ss:[rbp-28], rax
lea r9, qword ptr ss:[rbp-38]
lea rax, qword ptr ss:[rbp+18]
mov qword ptr ss:[rbp-30], rdx
mov qword ptr ss:[rbp-20], rax
lea r8, qword ptr ss:[rbp-28]
lea rax, qword ptr ss:[rbp+20]
mov qword ptr ss:[rbp-18], rax
lea rdx, qword ptr ss:[rbp-30]
lea rax, qword ptr ss:[rbp+28]
mov qword ptr ss:[rbp-10], rax
lea rcx, qword ptr ss:[rbp+30]
lea rax, qword ptr ss:[rbp-40]
mov qword ptr ss:[rbp-8], rax 
call printf
add rsp, 60
pop rbp
ret

The listing for main is easy enough. It’s actually storing the arguments for main in its shadow store. You can identify storage in the shadow store by looking for calling convention registers being stored in [rsp+8] or higher. You won’t see it at [rsp] since that’s where the return address (what ret pops into RIP) is stored. Modifying that can cause a lot of issues. Alright, we already covered what happens before we call a function, and right after we transfer control to that function; so now we’re going to look at how the compiler builds stack frames to allow for local storage in functions. Let’s look at the assembly listing of do_math.

The first line pushes a general-purpose register onto the stack, rbp. This register is referred to as the base pointer, and it’s purpose is normally for use in stack frames and addressing local variables in a function. It’s actually pushing this register onto the stack to preserve its value, most pushes you find preceding actual function code are used to preserve register values. We’ll talk about why this is important soon. The next line stores the value of the stack pointer in the base pointer register. This means that both rbp and rsp point to the top of the stack.

The function starts with two instructions:

push rbp
sub rsp, 60h

Before this instruction, RSP was assumed to be 16-byte aligned as required by the x86-64 ABI before a call. However, since a call instruction implicitly pushes the return address the stack is no longer 16-byte aligned upon entry to the do_math function. This is fine as the stack is only required to be aligned prior to a call instruction (and some other instructions out of scope here). Anyways, the push rbp instruction saves the previous frame pointer onto the stack. This decrements the stack pointer (RSP) by 8 bytes. RSP is once again aligned on a 16-byte boundary. The sub rsp, 60h instruction then subtracts 96 (0x60) bytes from RSP. This allocates space for the function’s stack frame, which includes:

  1. 8 bytes for the saved RBP value, which was just pushed in the prologue.
  2. 32 bytes for local variables, as determined by the function’s requirements. In this case, there are 8 4-byte integer variables (32 bytes).
  3. 32 bytes for shadow store (spill space).

However, the question arises: why not allocate 80 bytes (0x50) instead of 96 (0x60)? 80 bytes would still be a multiple of 16 and sufficient to hold the return address, saved RBP, and local variables. The reason for this extra 16 bytes is not immediately clear from the assembly given. It could be for additional padding (likely with the array), or due to some optimization strategy (though the code looks very unoptimized).

This sequence of instructions actually has a name, it’s called the function prolog. Any function that allocates stack space, calls other functions, and preserves registers then it will have a prolog. The epilog is the sequence of instructions that cleanup any stack allocations and restore preserved registers prior to returning. Anyways, the reason for storing rsp in the base pointer register is so that the base pointer can be used to store values in stack storage designated for local variables. A visual really helps solidify this concept, you already have a good idea of how the stack looks up to this point so here’s the stacks state following the function prolog.

That’s quite an allocation. I switched the padding location because in reality it doesn’t really matter, it’s just part of the allocation for shadow store and they can be stored anywhere in that region. I put a label for where rbp points after the mov rbp, rsp instruction. And then we perform the sub rsp, 60h which allocates space for 12 8-byte stack slots. The brackets around the labels for those cells indicate that a dereference of rbp minus that offset will access that slot. It makes sense since rbp is rsp before the allocation, the stack grows down so the allocation will take rsp toward lower addresses, and to access those lower address we have to take rbp and subtract. We’re gonna take a look again at our assembly listing for the do_math function except I trimmed the fat so we can just make a point.

push rbp
mov rbp, rsp
sub rsp, 60
mov rax, qword ptr ss:[rbp+30]
mov qword ptr ss:[rbp-40], rax
mov qword ptr ss:[rbp+18], r9
mov qword ptr ss:[rbp+28], r8
mov qword ptr ss:[rbp+10], rdx
mov qword ptr ss:[rbp+20], rcx

......

call printf
add rsp, 60
pop rbp
ret

At this point you know what rbp is used for. It’s the frame base pointer, meaning we use it to index into the stack to store local variables. The line following our stack allocation is mov rax, qword ptr ss:[rbp+30]. That’s a mouthful, but we can immediately identify a few things. It’s referencing the stack, ss:; it’s using rbp to index into a location; and storing that dereferenced value in rax. Unfortunately, the value it’s dereferencing isn’t shown in our diagram, it’s actually at a higher address than is shown. But we can identify where the next thing is stored: mov qword ptr ss:[rbp-40], rax. If you look at the diagram above we store the value of rax in the local stack space at [rbp-40].

Note: Positive offsets from RBP access arguments passed on the stack. Negative offsets from RBP access local variables.

The above note applies to normal accesses using RBP while executing a function. This brings me to something new, if the number of arguments is greater than 4 the 5th argument and on is passed on the stack. An example is provided below.

fnc(int a1, int a2, int a3, int a4, int a5, int a6);

// x64 calling convention passes args as such:
rcx = a1
rdx = a2
r8 = a3
r9 = a4
a5 and a6 pushed onto stack

You’ll be introduced to the various tricks and optimizations that are applied throughout this series. Once complete with the basics you should be able to identify stack uses, and prologues that don’t necessarily follow convention. For the time being though, they will follow convention. To start wrapping things up we’re going to quickly talk about how arguments that are larger than the maximum supported element size are passed.

— Passing Large Arguments

Large arguments don’t necessarily have to be an abstract data structure. In fact most of the time, they’re just strings. Before pulling the example from the first article of the series where you had a brief look at an assembly listing let’s recall some rules enforced by the ABI. Arguments not of size 1, 2, 4, or 8-bytes are passed by reference. That’s done similar to how you may expect it. Take printf for example, the string could be larger than 8-bytes in size since each character is a single byte. When we call printf with a formatter string and some value, the formatter string is passed into the callee by reference through rcx, and the value is passed through rdx. Let’s break down a simple example.

printf("Elapsed Time = %u\n", ElapsedTime);

The string clearly holds more than 8 characters, so it is definitely greater than 8-bytes – we’ll have to pass it by reference. ElapsedTime is just some unsigned integer value, we’ll pass it normally through rdx. What this winds up breaking down to in assembly is this:

mov rdx, ElapsedTime
lea rcx, offset elapsed_string
call printf

You’ve seen the mov instruction before, and call, but lea is new. The lea instruction mnemonic stands for load effective address, and it’s used to compute the effective address of the source operand and then stores it in the destination operand (rcx). The destination is always a general-purpose register. To think about this in high-level terms it’s similar to constructing a string, and passing the string by reference to a function. The reference to this string will point to the address of the first character in its character array, and printf has code to parse that string and perform whatever operations to fill in the necessary formatting components. It’s really that simple. If you see lea you’re most likely seeing a reference to some data larger than the supported size for stack elements. Most of the time it’s strings, but you’ll learn as you progress that sometimes it’s data structures.

Conclusion

In this article, you learned a lot about the stack, its purpose, how certain instructions affect it, and how certain interfaces utilize it to generate code in assembly that matches the semantics of your high-level program. We covered quite a bit of material, but there’s still so much more. If you’re interested in reading ahead and learning more about the stack, the calling convention, volatile and nonvolatile registers (what that even means), and so on then check the recommended reading section. The next article will cover exceptions and I plan to batch publish it with the accelerated assembly section. We’ll address the basics of exceptions, how software and hardware generated exceptions occur, the most common exceptions you’ll encounter; structured exception handling; vectored exception handling; and the role the OS plays. The accelerated assembly article will use a hands on approach to teach you a good portion of the x86 instruction set. You’ll encounter conditional jumps, compares, bit shifting, and more stack based operations.

All that being said, this concludes the introduction to the stack. As always feedback, questions, and comments are welcome.

Recommended Reading

Author

25 thoughts on “Applied Reverse Engineering: The Stack

  • Hi,
    Great article, but i think your first two diagrams of stack are incorrect.
    When you do push 12 and push 4, in both the diagrams the 4 should be below 12 not above it as the stack moves from higher to lower memory addresses and has a LIFO structure.

    1. Yep, you’re right. I meant to put the push 4 first and push 12. What’s funny is I wrote that the first time, and then changed it. Since I was thinking about the stack view from my usual perspective in a debugger.

      Thanks for pointing that out. It’s been fixed.

  • Thanks for this article. I have a question though; you said the stack is required to be 16 bytes so the stack pointer had to be incremented by 64(40h) `sub rsp, 40`. we have need 4*8 bytes plus 8 bytes for the return address. why didn’t we increment the stack by by 48(30h).

    1. Hey, so the values are in hex. When doing the calculations I was using the wrong tab in the calculator and performing the ops on the hex value and made a mistake! It’s been fixed and I hope it makes sense now.

      Very sorry for the confusion, but thank you for bringing it to my attention.

      The function starts with two instructions:
      push rbp
      sub rsp, 60h

      Before this instruction, RSP was assumed to be 16-byte aligned as required by the x86-64 ABI before a call. However, since a call instruction implicitly pushes the return address the stack is no longer 16-byte aligned upon entry to the do_math function. This is fine as the stack is only required to be aligned prior to a call instruction (and some other instructions out of scope here). Anyways, the push rbp instruction saves the previous frame pointer onto the stack. This decrements the stack pointer (RSP) by 8 bytes. RSP is once again aligned on a 16-byte boundary. The sub rsp, 60h instruction then subtracts 96 (0x60) bytes from RSP. This allocates space for the function’s stack frame, which includes:

      8 bytes for the return address, which was pushed by the call instruction that invoked this function.
      8 bytes for the saved RBP value, which was just pushed in the prologue.
      36 bytes for local variables, as determined by the function’s requirements. In this case, there are 8 integer variables (32 bytes) and a 4-byte character array.
      44 bytes of padding to ensure the total stack frame size is a multiple of 16 bytes, maintaining alignment.

      However, the question arises: why not allocate 60 bytes (0x40) or 80 bytes (0x50) instead of 96 (0x60)? 80 bytes would still be a multiple of 16 and sufficient to hold the return address, saved RBP, and local variables. The likely reason is the need for scratch space for the printf call in the middle of the function. The x86-64 ABI requires certain registers (RCX, RDX, R8, R9) to be used for passing the first 4 arguments to functions. If the values in these registers need to be preserved or passed as additional arguments to nested calls, they must be saved somewhere. The extra 32 bytes (96 – 64 (0x40)) in the stack frame provide the necessary scratch space to save these registers before the printf call and restore them afterwards, without needing to further modify RSP.

  • you showed that, because of 16 byte alignment ABI requirement of the stack, the stack would look like 4 64-bit shadow values followed by a 64-bit padding followed by the 64-bit return value as the last thing on the stack because before the function is invoked. the code for assembly listing for main looks like such.

    mov qword ptr[rsp + 24], r8
    mov qword ptr[rsp + 16], rdx
    mov dword ptr[rsp + 8], ecx

    the ecx register (which has the first parameter of main, right?) is being placed inside the stack pointer + 8 bytes offset. this is the location where the padding is. does this cause any probably or is it a non-issue as long as there is enough shadow store to keep four parameters?

  • Hey Daax. Thanks for taking the time to providing these awesome articles.

    Just want some clarification on the 16 byte boundary stuff. Does this only matter just before you call a function? I know you made a note in red that it does but I just wanted to confirm. So for example, if in the middle of a function I push a register onto the stack (which would be 8 bytes) it is fine even though the stack is no longer aligned by a 16 byte boundary after this push?

    Also i’m a little confused as to why in your example you changed the sub instruction from “sub rsp, 40” to “sub rsp, 48”. Shouldn’t it be “sub rsp, 40” because as soon as you make the call to the function, it adds the return address to the top of the stack which is another 8 bytes making it 40 + 8 = 48 bytes allocated which is a multiple of 16. If you use “sub rsp 48” and then at the call instruction add another 8 bytes for the return value it would be 56 which is not a multiple of 16.

    Thanks.

  • Hey Daax. Thanks for taking the time to providing these awesome articles for us. you have helped me understand more into my pc to help me develop into kernel drivers

    i was just wondering when you will be doing articles on Driver Development i have searched around for it but the series directory doesn’t bring me to any articles on driver development.

    1. That’s coming in the future. It’s just there so I don’t forget what I plan to write 🙂

      For the time being, check out Windows Kernel Programming and Developing Windows NT Device Drivers. Great resources for driver development.

  • Great article!! Even though, I have read above stack workings in multiple places, books, your article was a great refresher and helped me discover few new ideas. For instance, I do not understand opcode, and you told about significance of near call(relative startins with E8 and absolute starts have call func_name).

    Also a good reminder on the shadow-space and padding along with 6-multiple rule for caller allocating shadow-space.

    1. Thank you! I’m super glad you enjoyed it, and I greatly appreciate the feedback! I apologize for the length of some of them – they get quite thorough heh.

  • Great article but something that I just not get fully understand is why exactly 16 byte alignment was chosen and why only before calls?

  • Hey Dx. Thanks for taking the time to providing these awesome articles.
    But I had a question.
    I wrote the following code on the system and checked the assembly code:
    int main(int a)
    {
    std::cout << "Hello World!\n";
    }

    In the generated output, I do not understand something.
    stack is not aligned on a 16-byte boundary.
    disassembly output:

    00007FF7F6071040 mov dword ptr [rsp+8],ecx
    00007FF7F6071044 sub rsp,38h
    Can you guide me?

    1. The stack is aligned at the function body, not at the beginning. The call instruction implicitly pushes the return address onto the stack which is 8 bytes, this allocates 56 bytes on the stack + the 8 bytes for return address which is 64 bytes. 64 % 16 = 0. The requirement for the stack pointer to be 16-byte aligned is due to the potential use of SIMD instructions (more info if you research on the x86_64 ABI).

  • Pingback: My Site

Leave a Reply