Applied Reverse Engineering: Accelerated Assembly [P1]

Overview

In this article you’ll be guided through a course on the x86 Instruction Set. This article serves at as a quick fix to the problem of not knowing where to start when learning Assembly. We’ll be covering instruction format briefly, and then jump right in to the instructions. This is like learning another language, and it may not make sense immediately, but rest assured if you do this enough reading assembly listings will become second nature. You’ll be able to decipher functionality of a code block from a brief excerpt. This page will also serve as a reference in later articles as all the instructions here are encountered often while reverse engineering some piece of software. If you forget what an instruction does, or the types of operands it’s compatible with you can refer back to here or the Intel SDM Volume 2.

As always, it is assumed you, the reader, have some sort of experience with a compiled programming language. Any language that has functional constructs will count too (loops, comparisons, etc.) The instruction set to be analyzed is one of the most popular ones, the x86 ISA, and all examples will be written for execution on Intel or AMD processors. Let’s not waste any time, there’s a lot to cover…

Introduction

Before continuing, it would be wise for those of you who may have forgotten about general purpose registers and their use to review the article on Basic Architecture. General purpose registers are used quite frequently in load/store operations and will be encountered all throughout our various examples. It’s important you know them off hand. Take a second to go back and read the section on general purpose registers, and then come back.

— Microcode versus Assembly

A common problem when reading reference material for assembly and low-level development is the misuse of terms. Particularly, the terms microcode and machine code. Microcode is considered an abstraction beyond machine code. For the sake of understanding, the machine code we’ll be looking at is the x86 instruction set. What I mean by an abstraction beyond machine code is that the CPU actively converts machine code, the assembly instructions, into microcode for the CPU to execute. There are many reasons this is done – the main one is that it is easier to create a complex processing unit with backwards compatibility. In this post, we’re examining the x86 instruction set. This instruction set contains thousands of instructions for many different operations, some of them for loading and storage of strings or floating point values. Rather than explicitly defining an execution path for these instructions they’re converted into microcode and executed on the CPU. It preserves backwards compatibility, and gives way to faster, smaller processors.

It’s important to distinguish between these two for technical accuracy as well as understanding. In addition to that, microcode and machine code do not always have a 1:1 mapping. However, there is no published documentation about Intel or AMD’s microcode, so it’s hard to infer the internal architecture and mapping of microcode:machine code.

As an example, take the instruction popf. This instruction pops the top word on the stack into the EFLAGS register. Prior to doing that though, it performs checks on certain bits in the EFLAGS register, the current privilege level, and IO privilege level. These operations aren’t likely to be stuffed in one instruction, and their microcode is likely not a single instruction to do this. It has to check EFLAGS, current privilege level, and other things before getting the top word of the stack. You could be looking at a number micro-operations that are executed when this instruction is converted.

Note: Microcode is the abstraction beyond machine code. Machine code is the higher level representation of these micro-operations.

— Instruction Simplification

We aren’t going to break down the entire format on an x86 instruction in this subsection since there is an entire chapter dedicated to that in the Intel SDM Volume 2, however, we need to address the general format.

Assembly instructions come in all different sizes (quite literally), but adhere to a similar shape. The format is typically an instruction prefix, the opcode, and the operand(s). There may not always be an instruction prefix (we’ll cover those in the future), but there will always be an opcode so long as the instruction is valid and supported. These opcodes map to a specific instruction in the instruction set, some instructions have a number of opcodes that will change based on the operands they’re acting upon. For example, the logical AND instruction has an opcode for the instruction that uses the lower byte of the rax register, al and performs a logical AND against an 8-bit immediate value. Recall that an immediate is just a numerical value. Below is a simple summary with the opcode and instruction mnemonic.

Logical AND [Opcode: 24h | Instruction: AND AL, imm8]

That’s a new term as well, mnemonic. In assembly, a mnemonic is a simple way to identify an instruction. It beats the alternative of reading a hex dump and determining instruction boundaries and then translating the opcodes by hand to a human readable form. These mnemonics are devices that allow system programmers, hardware engineers, and reverse engineers like us to read and understand what some sequence of instructions is doing with relative ease. In the above example the mnemonic for the logical AND operation is AND followed by op1, and op2 – the operands.

Note: It's pronounced like nehmonik, not memnomic. Maybe I'm just an idiot and am the only one who struggled to say it right.

All instructions follow this general format. If you want the nitty, gritty technical details then you’ll need to consult the Intel SDM. Otherwise, you know enough to begin learning and digesting the instructions you’ll encounter throughout this journey. We’re going to start off basic and gradually increase in difficulty with the instructions. If you struggle with understanding any portion of this text please drop me a line on twitter or leave a comment and I’ll be sure to answer to the best of my ability.

Arithmetic Operations

In this section, we’ll cover simple arithmetic instructions like add, subtract, division, multiplication, and modulus. Following that we step it up a little bit and cover pointer arithmetic and how pointers are modified with assembly.

— Simple Math

When a mathematical expression is executed it usually breaks down into logically equivalent blocks. Take ((2 + 4) * 6) – this expression adds 2 to 4 and then multiplies the result by 6. The expression can be done in a single line in C, but in Assembly it will be broken down into a few loads and stores, then an addition instruction, and then a multiplication. Like I mentioned, logically equivalent blocks. I’ve constructed a few examples with progressively more complex expressions and provided their C and Assembly listings.

static __declspec( noinline ) uint32_t simple_math( void )
{
    volatile uint32_t x = 0;
    volatile uint32_t y = 0;
    volatile uint32_t z = 0;

    x = 4;
    y = 12;
    z = 7;

    return ( x + y ) * z;
}

This function is pretty trivial. I’ve told the compiler with the __declspec( noinline ) modifier to never inline this particular function. I did this primarily so that I can grab the assembly as it relates to the function and not have other instructions polluting the example. We see the use of volatile to prevent the local storage from being optimized out as well, and then we set our variables to random values. So what would this look like in assembly?

sub     rsp, 38h
xor     eax, eax
mov     [rsp+20h], eax
mov     [rsp+24h], eax
mov     [rsp+28h], eax
mov     dword ptr [rsp+20h], 4
mov     dword ptr [rsp+24h], 0Ch
mov     dword ptr [rsp+28h], 7
mov     eax, [rsp+20h]
mov     edx, [rsp+24h]
add     eax, edx
mov     ecx, [rsp+28h]
imul    eax, ecx
add     rsp, 38h
retn

This first starts out allocating space on the stack for our spill space (previously called shadow store) and our local storage. The spill space only requires 32 bytes, then it allocates 12 bytes for our 3 local variables.

Why is the stack allocating 56 bytes of storage instead of 44 bytes?

By definition of the System V AMD64 ABI our stack must always be 16-byte aligned where N modulo 16 = 8. 44 modulo 16 is 12. The stack is misaligned, so we must allocate enough space to the next 16-byte boundary by adding an extra 4 bytes onto the stack. However, this is still not properly aligned because 48 modulo 16 is 0. This is solved by adding an additional 8 bytes to our allocation to ensure that our stack is aligned according to the N module 16 = 8 rule. If we were to make any sort of WinAPI call in this function it would invoke the function with a misaligned stack and most likely break execution.

After the stack space is allocated we notice a xor instruction with the operands being the same 32-bit register eax. This is a simple method of zeroing out a register since any number xor’d against itself is 0. Now comes the part where remembering the information from the stack article will come in handy. We see three instructions that are 1:1 with the source. There are some details to mention before moving on though. The mov instruction is considered a load/store instruction where the first operand is the target and the second operand – in this case eax – is the value to store. The braces you see wrapping [rsp+offset] indicates memory access. You can think of it as [eax] means access the memory contents at address eax. The simplest way to think of it is like dereferencing a pointer in assembly.

*(cast*)(rsp+0x20) = eax

You might be wondering as well what the offset 20h means. The 20h is the offset from the top of the stack to the address where this variable’s storage is located. If we were to look at a stack of this application it would like the diagram below.

The first thing pushed onto our stack prior to the stack space allocation is the callers return address, then space for our local storage and our alignment padding elements are allocated. Remember that the padding is performed because the address has to be 16-byte aligned, so the padding elements are given stack space since all other address values are not 16-byte aligned. But what about 18h (24)? 24 modulo 16 is 8 thereby following the rule, however, we hadn’t allocated storage for our spill space. After allocating storage for our spill space we are no longer aligned and need to add padding elements. You may also notice that x and y are in the same stack element that’s because these allocations are 8 bytes in size and our variables are 4 bytes in size. This means we can fit our x and y variable into one storage spot on the stack. The same goes for our z variable. You’ll notice it goes padding then z storage and that’s just the way I wanted to show it since the upper 32-bits of [rsp+28h] are 0, and the lower 32-bits are the value of z.

Strong understanding is important!

If you’re wondering why all the detail for this particular example it’s because I want to cover it in the most detail so that in future examples you are well equipped to read them and understand them. This will likely be the longest section because there is a lot to cover initially about assembly. Once we move forward the other examples will just be a matter of understanding the nuances of the instruction.

Let’s continue and bring the assembly example back into view.

sub     rsp, 38h
xor     eax, eax
mov     [rsp+20h], eax
mov     [rsp+24h], eax
mov     [rsp+28h], eax
mov     dword ptr [rsp+20h], 4
mov     dword ptr [rsp+24h], 0Ch
mov     dword ptr [rsp+28h], 7
mov     eax, [rsp+20h]
mov     edx, [rsp+24h]
add     eax, edx
mov     ecx, [rsp+28h]
imul    eax, ecx
add     rsp, 38h
retn

We now know that the mov [rsp+20h], eax instruction is zeroing the storage where x is allocated. The same goes for y and z they just have different offsets from rsp. We can see that y is at [rsp+24h] and z at [rsp+28h] are being set to 0. The lines after that are the storage of the values we had preset in the source. You probably notice that the mov is slightly different than the last with some sort of specifier being used: dword ptr. The dword ptr specifier simply means that the target operand is 32-bits in size; size of a doubleword. This then will only write to the lower 32-bits of the stack element. This is also what allows us to share a stack element between two 32-bit variables. The next two instructions are simple to understand now.

After storing our values to the appropriate stack elements we load those elements into registers to be used for computation.

Registers vs. Memory Accesses

Memory accesses by the processor are slow to execute because the instructions generate virtual addresses that must be translated by the MMU to physical memory addresses, then the processor must reach out to main memory with this translated address to access the memory. This is why having a hierarchy of caches associated with the CPU is beneficial, however, using CPU registers that are part of the die is orders of magnitude faster than reaching out to main memory. Compilers typically will prefer to use registers when performing computations to favor speed of execution.

We know now that x is loaded into eax and y into edx then immediately after an add instruction is encountered with the operands eax and edx, respectively. The add instruction takes the second operand and adds it to first. In this case, it would be performing this:

x += y;

Simple enough. The next line we see that z is being loaded into ecx, and then executes imul with eax as the first operand and ecx as the second. This instruction takes the second operand and multiplies it by the first and stores the result in the first operand. This would translate to:

x *= z;

The original source performs all of this in the return statement. There’s something peculiar about this because we know it returns an integer, but how? Through the use of eax. The general-purpose register rax is the return value register. This means that if anything is to be returned to the caller using the System V AMD64 ABI the return value will be stored in rax. It is subject to change with different architectures, but for Intel and AMD it is always rax. The instruction add rsp, 38h is the method with which we reclaim the stack space allocated for our local storage. This leaves the return address of the calling function at the top of the stack which means that when the last instruction, retn, executes rip will be set to that address and the processor will jump to that location and continue executing.

That’s all there is to this function. As we continue on with the next fourty-five million instructions I’ll only address details that can’t be deduced easily and explain new behavior. We’ve covered a lot for this first example, but it will make life so much easier as we move forward. The next sections will go by quickly, but be sure to take note of the quirks and additional information dialogs. It’s important to understand this content fully.

Order of Operations

When evaluating mathematical expressions there is a set of rules that is followed in order to obtain the correct result. If you’ve taken a math class you’ve encountered information about order of operations. In this case, we have parentheses surround the first expression we wanted solved which means that it gets evaluated first. The compiler takes that into consideration otherwise you would get an incorrect result. If you remove the parentheses from the source provided the imul would take place before the add instruction. PEMDAS. Remember that.

— Pointer Arithmetic

If you’ve written in C or C++ you’ve probably done some pointer arithmetic yourself. It’s confusing at a high-level sometimes and it certainly gets confusing when ripping away the abstractions of a high-level language. In this sub-section, we’re going to look at two examples of pointer arithmetic performed on two different data structures: an array, and a linked list. As mentioned previously, only important or new information will be addressed in this sub-section and the others so if you’re having trouble remembering certain things please refer to the above section. If it’s not mentioned now I’ve mentioned it before. We’re going to start off with another example in C which is just how array accesses can look in assembly.

static __declspec( noinline ) uint32_t pointers( void )
{
	uint64_t a[10];
	
    // looped access
    for ( volatile uint32_t it = 0; it < 10; it++ )
        a[ it ] = it + 2;

    // direct access
    a[ 0 ] = 1337;
    a[ 4 ] = 1995;
    
    // quik maffs
    *( uint64_t* ) ( a + 6 ) = 49;

    for ( volatile uint32_t it = 0; it < 10; it++ )
        printf( "%d\n", a[ it ] );

    return 0;
}

This example is pretty straight forward. The assembly? Not so much.

                sub     rsp, 78h
                pxor    xmm0, xmm0
                movdqu  xmmword ptr [rsp+20h], xmm0
                movdqu  xmmword ptr [rsp+30h], xmm0
                movdqu  xmmword ptr [rsp+40h], xmm0
                movdqu  xmmword ptr [rsp+50h], xmm0
                movdqu  xmmword ptr [rsp+60h], xmm0
                mov     dword ptr [rsp+70h], 0
                mov     eax, [rsp+70h]
                cmp     eax, 0Ah
                jnb     short loc_140001084

loc_140001067:                          
                mov     eax, [rsp+70h]
                mov     edx, [rsp+70h]
                add     eax, 2
                mov     [rsp+rdx*8+20h], rax
                inc     dword ptr [rsp+70h]
                mov     ecx, [rsp+70h]
                cmp     ecx, 0Ah
                jb      short loc_140001067

loc_140001084:                          
                mov     qword ptr [rsp+20h], 539h
                mov     qword ptr [rsp+40h], 7CBh
                mov     dword ptr [rsp+74h], 0
                mov     eax, [rsp+74h]
                mov     qword ptr [rsp+50h], 31h
                cmp     eax, 0Ah
                jnb     short loc_1400010D2

loc_1400010B0:                         
                mov     eax, [rsp+74h]
                lea     rcx, aD         ; "%d\n"
                mov     rdx, [rsp+rax*8+20h]
                call    sub_1400010E0
                inc     dword ptr [rsp+74h]
                mov     eax, [rsp+74h]
                cmp     eax, 0Ah
                jb      short loc_1400010B0

loc_1400010D2:                       
                xor     eax, eax
                add     rsp, 78h
                retn

Immediately we notice a significant difference in complexity from the last example. We want to get the hard stuff out of the way first, so why the hell not? You can probably guess what the first instruction does based off prior experience. If you do the math to determine the proper size of the stack allocation the value makes sense. Spill space is four 8-byte elements, our array is 10 elements so 10 * 8 = 80, 80 + 32 = 112 bytes, 112 modulo 16 = 0 and we need it to be aligned so we add 8-bytes on and we get 120 or 78h. 120 modulo 16 = 8! No problem.

The best way to approach complex disassembly or unknown disassembly is literally one line at a time and group together similar operations. Looking at the next instruction we see a pxor. This instruction is a logical exclusive OR for SIMD structures like m128i. It acts the same as the previous instance we saw but zeroes the 16-byte register xmm0. XMM registers are other CPU registers that were added with the advent of SIMD instructions. They are 128-bit (16-byte) SIMD floating-point registers and are named XMM0 to XMM15. You can read more about them in the recommended reading section. You might be wondering why these are even used when we’ve haven’t performed any floating-point operations or used SSE anywhere. The usage of these registers is because the compiler wanted to yield the most performant code and optimized our function. You’ll notice the movdqu instruction which, you guessed it, loads the value of xmm0 into that stack location. The xmmword ptr specifier is used similarly to the previous example and tells the processor we’re going to be performing a write to 16 bytes of data at [rsp+20h]. The sequence of these 5 instructions is a fast way to initialize our allocated stack space to 0. Think about this: 70h – 20h is 50h which is 80 bytes in decimal and our array is 10 elements each 8 bytes in size, thus this sequence is the shortcut to zero our memory. If you’re confused because you see the 60h and not 70h remember that this is writing zero in [rsp+60h] to [rsp+(60h + 10h)], where 10h is 16 bytes because that’s the size of the xmm register. This means that everything up to 70h is zero!

Moving on we notice memory access to [rsp+70h] and initializing it to 0, followed by a mov of [rsp+70h] to eax. What do we know about this sequence of instructions and its relation to our example? The first thing we should note is that it is using eax instead of rax (the 64-bit counterpart of eax.) Where are we using a 32-bit variable? In our first for-loop as the iterator! If we look right after that we notice that there is a cmp instruction. The cmp instruction is the comparison instruction which compares the first operand to the second. It sets certain bits of the RFLAGS register to indicate the result. We’ll cover that in more detail in the next section. For now, just know it is comparing against 0Ah. This feels familiar… our for-loop construct does the same thing! A high-level view of the analysis we’ve done so far would look like this:

void func()
{
    uint128_t xmm_array[5] = { 0 };

    for(uint32_t rsp_70 = 0; rsp_70 < 10;) {}
}

Notice how I’m only taking the assumptions I’ve made from analysis of the disassembly so far. I’m doing this so you, the reader, start to see how to build the pseudo-code from straight disassembly. Now, the instruction following our comparison is a JCC instruction otherwise known as a jump if condition is met. The jnb instruction means jump to the target address if the result of the comparison indicates the value is not below our second operand in the comparison. Like the cmp instruction, the details on these instructions will come later. This will jump to the address 140001084 if our counter is greater than or equal to 10. So in terms of our reconstruction how do we interpret this? Well, we know that a for-loop runs until a condition is met and once it is met or exceeded it breaks out of the loop and continues executing the code that follows. This means that our jnb will go to the address where code continues after our loop, so what follows the jnb if the jump isn’t taken is what is happening inside the loop! We can also assume that once we hit the address where the jnb would jump is where the end of our loop is. Let me bring into view the code that is between the jnb and 140001084.

mov     eax, [rsp+70h]
mov     edx, [rsp+70h]
add     eax, 2
mov     [rsp+rdx*8+20h], rax
inc     dword ptr [rsp+70h]
mov     ecx, [rsp+70h]
cmp     ecx, 0Ah
jb      short loc_140001067

This doesn’t look too daunting. We know some of these instructions. The first two load eax and edx with the value of our counter, and then adds 2 to eax. Now, the next access is a little bit confusing but you might be able to figure it out on your own at this point – give it a try! If you weren’t able to let’s break it down. We see mov, so it’s storing rax into this memory location that is calculated by some obscene combination of things. Jot down what you know from previous instructions.

eax = counter
edx = counter
eax += 2
rsp = top of stack (what's at top of stack?)
[] means we're writing to the memory at location inside braces

mov [rsp + counter * 8 + 20h], rax
8 bytes is the size of a 64-bit integer
20h is offset from stack where our xmm array starts

This is what we know. From here we can begin to understand what’s happening. The easiest thing to do is break down all of the details and make educated guesses about the information. Using what we know we can make sense of the expression in the braces: [rsp + (counter * sizeof(uint64_t)) + base_of_array] = rax. This is where previous experience in languages like C or C++ comes in handy. We know that you can index an array in C in a more messy manner like *(cast*)(array + index), and knowing that this is using the base of our array we know it’s writing somewhere in this array. If we were to reorder this and write it like an array access in C we’d come up with something like this:

// tos = top of stack
*(uint64_t*)(tos + array_offset + (counter * sizeof(uint64_t)) = rax;

It’s beginning to become more understandable. At this point, we can make the assumption that since we have a loop the counter is used to index into the array. Let’s take this low-level representation and combine it with our assumptions to add to our reconstruction.

func()
{
    uint128_t xmm_array[5] = { 0 };

    for(uint32_t counter = rsp_70; counter < 10;)
    {
        rsp_70 = rsp_70 + 2;
        xmm_array[counter] = rsp_70;
    }
}

This looks a lot cleaner. But this doesn’t make much sense since we’d wind up with an index out of bounds bug since the counter loops to 10 but we only have 5 16-byte elements in our array. If you look at the instructions again, particularly the move, we saw that it was indexing in by the size of unsigned __int64. This means that our initial assumption that it was an array of 128-bit elements is wrong, it’s an array of 64-bit elements.

func()
{
    uint64_t u64_array[10] = { 0 };

    for(uint32_t counter = rsp_70; counter < 10;)
    {
        rsp_70 = rsp_70 + 2;
        u64_array[counter] = rsp_70;
    }
}

This is much better. It makes sense with all the assembly we’ve read so far. Continuing our loop excerpt we’ll see that the instruction after the write to the array is inc.

mov     eax, [rsp+70h]
mov     edx, [rsp+70h]
add     eax, 2
mov     [rsp+rdx*8+20h], rax
inc     dword ptr [rsp+70h]    <---
mov     ecx, [rsp+70h]
cmp     ecx, 0Ah
jb      short loc_140001067

The inc instruction is the unary increment instruction that takes the operand and adds 1 to it. Now we know that our loop is incrementing our counter! Skim the rest of the sequence and you’ll notice our comparison again and then a JCC instruction, jb. If you go back and look at the original disassembly listing you’ll see where loc_140001067 is.

                jnb     short loc_140001084

loc_140001067:                          
                mov     eax, [rsp+70h]
                mov     edx, [rsp+70h]
                add     eax, 2
                mov     [rsp+rdx*8+20h], rax
                inc     dword ptr [rsp+70h]
                mov     ecx, [rsp+70h]
                cmp     ecx, 0Ah
                jb      short loc_140001067

That’s it, that’s our first loop! If we add to our reconstruction we will now have this:

func()
{
    uint64_t u64_array[10] = { 0 };

    for(uint32_t counter = rsp_70; counter < 10; counter++)
    {
        rsp_70 = rsp_70 + 2;
        u64_array[counter] = rsp_70;
    }
}

Awesome. Now it’s your turn. Review the rest of the disassembly and rebuild based on assumptions you make then compare with the original source code. Try to refrain from using the original source as a reference.

                sub     rsp, 78h
                pxor    xmm0, xmm0
                movdqu  xmmword ptr [rsp+20h], xmm0
                movdqu  xmmword ptr [rsp+30h], xmm0
                movdqu  xmmword ptr [rsp+40h], xmm0
                movdqu  xmmword ptr [rsp+50h], xmm0
                movdqu  xmmword ptr [rsp+60h], xmm0
                mov     dword ptr [rsp+70h], 0
                mov     eax, [rsp+70h]
                cmp     eax, 0Ah
                jnb     short end_first_loop

first_loop:                          
                mov     eax, [rsp+70h]
                mov     edx, [rsp+70h]
                add     eax, 2
                mov     [rsp+rdx*8+20h], rax
                inc     dword ptr [rsp+70h]
                mov     ecx, [rsp+70h]
                cmp     ecx, 0Ah
                jb      short first_loop

end_first_loop:                          
                mov     qword ptr [rsp+20h], 539h
                mov     qword ptr [rsp+40h], 7CBh
                mov     dword ptr [rsp+74h], 0
                mov     eax, [rsp+74h]
                mov     qword ptr [rsp+50h], 31h
                cmp     eax, 0Ah
                jnb     short loc_1400010D2

loc_1400010B0:                         
                mov     eax, [rsp+74h]
                lea     rcx, aD         ; "%d\n"
                mov     rdx, [rsp+rax*8+20h]
                call    sub_1400010E0
                inc     dword ptr [rsp+74h]
                mov     eax, [rsp+74h]
                cmp     eax, 0Ah
                jb      short loc_1400010B0

loc_1400010D2:                       
                xor     eax, eax
                add     rsp, 78h
                retn

Disassembly Tips

Certain access specifiers like dword ptr, qword ptr, and xmmword ptr are great hints as to the size of an operation and sometimes the size of the operand. And remember the sizes of different types and widths of registers (e.g. eax = 32-bits, rax = 64-bits, uint64_t = 64-bits).

Conditional Operations and Comparisons

This section covers conditional branching instructions and operations. There are a ton of flavors of similar instructions, and we won’t be able to hit them all, but you’ll get a general idea and know where to look to learn more. We’ll also cover checking for error conditions, validated input, if something is about to ruin your life, etc. These are the not necessarily the easiest instructions, however, we’ll cover as many of the subtleties as we can. If you’ve made it to this section and successfully completed the challenge at the end of the last then the majority of these examples will be straightforward.

— Comparing Two Operands

The comparison instruction, cmp, was encountered in the previous section. Its operation is to compare the first operand with the second operand, however, the result is not stored in either of the operands. The comparison instruction sets status flags in the RFLAGS register indicating the result of the comparison. If we take a look at the RFLAGS register diagram from the Intel SDM Vol. 2 we’ll be able to discern which flags are typically affected.

The specific flags (bits in EFLAGS) we’re concerned with in comparisons or conditional operations are as follows:

  • Overflow Flag (OF)
  • Direction Flag (DF)
  • Sign Flag (SF)
  • Zero Flag (ZF)
  • Auxiliary Carry Flag (AF)
  • Parity Flag (PF)
  • Carry Flag (CF)

These are known as the status flags and are also identified in the diagram above. The compare instruction affects any of these status flags, and we’ll look at how they are set and used as we move forward. We first need to understand how the comparison is actually performed. With cmp the comparison of the two operands is done by subtracting the second from the first much like the sub instruction we’ve encountered often. The most often affected flag when performing a comparison is the zero flag (ZF) which is set when the result of the comparison is 0. Let’s pull from our earlier examples: cmp rdx, 0Ah. In this instance, if rdx has a value of 6 the result of the subtraction operation would be -4. Since -4 is not 0 the zero flag (ZF) stays clear. Once the result of the subtraction is 0 then our zero flag will be set (e.g. rdx is 10).

Comparison and jnb

The comparison instruction we encounter earlier prior to the jnb – which if you recall jumped if the value was not below the second operand – uses a different flag than ZF. The jnb instruction uses the carry flag (CF) to determine if the jump should be taken. The CF is only set if an operation generates a carry, or borrow of the result. The CF is also set when an overflow condition is detected which is the case for the comparisons we’ve been performing on unsigned integers. When we go below zero as the result we create an overflow condition that sets multiple flags: signed flag, carry flag, and auxiliary carry flag.

This is why understanding the RFLAGS register is extraordinarily important as well as the conditions that are used to determine if a branch will occur. We’ll cover the JCC instructions soon, for now, that tidbit should just be in the back of your mind.

— Testing Two Operands

It’s not unusual to encounter the test instruction instead of cmp. The test instruction only affects the SF, ZF, and PF status flags based on the result; and the method with which it performs the comparison is different than cmp. The test instruction performs a bitwise AND on the two operands and sets the status flags that correspond to the result. The result is then completely discarded. You’ll typically see test used when the branching instruction that may follow is decided from the result of SF, ZF, or PF. The main difference between cmp and test is the method of evaluation and that cmp sets the AF status flag.

Setting the signed flag

The signed flag is set when the most significant bit of an unsigned integer is set. This bit is also known as the sign bit when used in signed integer arithmetic and indicates whether a value is positive or negative. In unsigned integers, it is just the most significant bit.

The test and cmp instructions are interchangeable and will yield the same results. I don’t know if there is a performance difference or not, or why test is sometimes preferred in place of cmp, but if you find out or have a guess feel free to leave me a note!

— Conditional Branching (JCC Instructions)

It’s time to cover some of the JCC instructions. These instructions are branching instructions and only take a branch when a condition is met. What do I mean by branch? If you’ve ever used a goto statement in C you’ve written in what’s called an unconditional jump. The unconditional jump has the mnemonic jmp and is used to branch directly to an address. When a branch is taken the instruction pointer’s value is modified to the address of the target of our branching instruction. This allows execution to continue at that targeted code block. JCC instructions are the opposite of the unconditional jump, however, they still branch to a target but have a condition requirement. I put together a simple test function with a lot of branches and tried to use different conditions, but there are a lot of conditional branching instructions. If you want to learn more about them after this section, check the recommended reading. We’re only going to cover a few to give you an idea of how they work and what to look for when analyzing branches.

Here’s the example C application:

static __declspec( noinline ) uint32_t branching( uint64_t v1, uint64_t v2 )
{
    volatile uint64_t v3 = 916;
    volatile uint64_t v4 = 0xFFFFFFFFFFFFFFDD;

    volatile uint64_t r1 = 0;

    if ( v1 < v2 )
    {
        r1 = 1;

        if ( v3 != v2 )
        {
            r1 = 2;

            if ( v1 + v2 >= v3 )
            {
                r1 = 10;
                if ( v4 + v1 <= 1000 )
                {
                    r1 = 15;
                }
                else
                {
                    r1 = 9;
                }
            }
            else
            {
                r1 = 1;
            }
        }
        else
        {
            r1 = 0;
        }
    }
    else
    {
        r1 = 0;
    }

    return r1;
}

int main()
{
    printf( "ret = %d\n", branching( 3444, 3666 ) );

    return 0;
}

A little bit of a headache to follow, but it’s not uncommon to encounter nested conditions. Below is the disassembly listing of the function:

                sub     rsp, 38h
                mov     qword ptr [rsp+20h], 394h
                mov     qword ptr [rsp+28h], 0FFFFFFFFFFFFFFDDh
                mov     qword ptr [rsp+30h], 0
                cmp     rcx, rdx
                jnb     short loc_1400010DD
                mov     qword ptr [rsp+30h], 1
                mov     rax, [rsp+20h]
                cmp     rax, rdx
                jz      short loc_1400010DD
                mov     qword ptr [rsp+30h], 2
                add     rdx, rcx
                mov     rax, [rsp+20h]
                cmp     rdx, rax
                jb      short loc_1400010D2
                mov     qword ptr [rsp+30h], 0Ah
                mov     rax, [rsp+28h]
                add     rcx, rax
                cmp     rcx, 3E8h
                ja      short loc_1400010F0
                mov     qword ptr [rsp+30h], 0Fh
                jmp     short loc_1400010E6

loc_1400010D2:                          
                mov     qword ptr [rsp+30h], 1
                jmp     short loc_1400010E6

loc_1400010DD:                          
                                        
                mov     qword ptr [rsp+30h], 0

loc_1400010E6:                         
                                        
                mov     rax, [rsp+30h]
                add     rsp, 38h
                retn

loc_1400010F0:                          
                mov     qword ptr [rsp+30h], 9
                jmp     short loc_1400010E6

Right off the bat, we are already in familiar territory thanks to our earlier examples. Now that you’re probably more comfortable with some instructions and reading the listings you can quickly skim the dead-listing – looking for patterns of instructions. You’ll notice there are 4 comparison instructions within the first code block. However, knowing that doesn’t immediately tell us these are nested blocks. We’ll have to walk through the code and look at the targets to generate a high-level view of what’s going on. At this point, you should be able to read the first four instructions and know what they’re doing. At the end of the local storage initialization, we see a comparison of rcx and rdx, but we don’t see them used anywhere in the code. This is because of the calling convention, fastcall. If you remember reading the first and second articles there were details about the calling convention and how information is passed to functions. When invoking a procedure that follows the fastcall calling convention the first 4 arguments are passed through registers. These registers are rcx, rdx, r8, and r9, respectively.

Our function doesn’t have any mention of r8 or r9, so it’s safe to assume that it only takes two arguments through rcx and rdx. And just like that we already know what a basic function prototype of this may look like: <ret type> unk_fnc(uint64_t a1, uint64_t a2). Moving back to the comparison, we see it’s comparing the two arguments and then jumping to 1400010DD if the condition is met. The condition is jump if rcx is not below rdx. This can be directly translated to an if statement like so:

if(a1 < a2) { ... }

This is a good time to talk about how to determine which is the if block and which is the else block (if there is one). When a comparison is performed like the one above the code that is the target of the conditional is typically the else block since if the condition is not met the instructions following the JCC instruction will be executed. We’ll see as we move forward. The three instructions that follow our first branching instruction follow a similar pattern: mov, mov, cmp. This time a 1 was placed in some local storage area at [rsp+30h], then a rax was assigned the value of the contents in [rsp+20h]. If we look at the prologue of the function the value 394h was placed in [rsp+20h], so we know that one of our local variables has a value of 394h. Then rax is compared against rdx (our second argument) followed by a jz instruction.

The jz instruction is read as jump if zero meaning the result of the comparison was 0 and therefore will jump when the zero flag is 1. Interestingly enough the jump target for the first two conditional branches points to the same place: 1400010DD. This comparison is attempting to determine if our two registers are equal, and will take the jump if they’re equal. This means that the condition allowing continued execution is that rax and rdx are not equal. It doesn’t look like any of the other branch targets are the same here so let’s put together a reconstruction of what we know so far.

<ret type> unk_func(uint64_t a1, uint64_t a2)
{
    uint64_t rsp_20 = 0x394;
    uint64_t rsp_28 = 0xFFFFFFFFFFFFFFDD;
    uint64_t rsp_30 = 0;
    
    if(a1 < a2)
    {
        rsp_30 = 1;
        rax = rsp_20;
        if(rax != a2)
        {
            
        }
        else
        {
            goto loc_10dd;
        }
    }
    else
    {
.loc_10dd:
        rsp_30 = 0;
    }
}

This is just a rough sketch of what you can assume without looking at the original source. There are three local variables, one of those locals is used in an if-statement as shown in the disassembly, and one of the local variables is set to 1 if the if block is executed. If this feels slow, don’t worry – you’ll get much faster as you gain experience. Let’s bring the dead-listing back into view and read from the jz branch we just analyzed.

                jz      short loc_1400010DD
                mov     qword ptr [rsp+30h], 2
                add     rdx, rcx
                mov     rax, [rsp+20h]
                cmp     rdx, rax
                jb      short loc_1400010D2
                mov     qword ptr [rsp+30h], 0Ah
                mov     rax, [rsp+28h]
                add     rcx, rax
                cmp     rcx, 3E8h
                ja      short loc_1400010F0
                mov     qword ptr [rsp+30h], 0Fh
                jmp     short loc_1400010E6

loc_1400010D2:                          
                mov     qword ptr [rsp+30h], 1
                jmp     short loc_1400010E6

loc_1400010DD:                          
                                        
                mov     qword ptr [rsp+30h], 0

loc_1400010E6:                         
                                        
                mov     rax, [rsp+30h]
                add     rsp, 38h
                retn

loc_1400010F0:                          
                mov     qword ptr [rsp+30h], 9
                jmp     short loc_1400010E6

Interesting, there are 3 more branching instructions, and one of them is unconditional. It’s becoming clear that these are nested conditions and that the if/else blocks store some number into [rsp+30h]. We need to determine what happens inside of the if-statement for our rax/rdx not equal condition. It stores 2 into [rsp+30h], adds arg1 to arg2, stores [rsp+20h] in rax, then compares rdx to rax, and jumps if rdx is below rax. The if block of our previous condition has a nested condition, and we can add to our pseudocode.

<ret type> unk_func(uint64_t a1, uint64_t a2)
{
    uint64_t rsp_20 = 0x394;
    uint64_t rsp_28 = 0xFFFFFFFFFFFFFFDD;
    uint64_t rsp_30 = 0;
    
    if(a1 < a2)
    {
        rsp_30 = 1;
        rax = [rsp_20];
        if(rax != a2)
        {
            rsp_30 = 2;
            uint64_t temp = a2 + a1;
            if(rdx > rax)
            {
                // rdx is above rax
            }
            else
            {
                // rdx is below rax
            }
        }
        else
        {
            goto loc_10dd;
        }
    }
    else
    {
.loc_10dd:
        rsp_30 = 0;
    }
}

We’re beginning to see a pattern here of nested conditions based on the two arguments and two of the local variables. If we look at the jb target 1400010D2 we see that [rsp+30h] is being set to 1. Now, look at the code as if the branch was not taken. You’ll see that [rsp+30h] is referenced again, however, a value > 0 or 1 is stored. We now know this isn’t a traditional error code of true or false.

Deducing Return Type/Value

If you want to deduce what type is being returned or the value that is returned then locating the nearest return instruction in the function may provide information. There may be multiple return instructions within a function body, but the return type will match for all of them.

Understanding disassembly is important to your success!

The above tip is what I typically do to validate anything a disassembler is telling me. Some disassemblers like IDA Pro have a decompiler that generates a pseudo-C output, but it’s important that you’re able to read and validate that the output is correct. Sometimes it gets return types right, other times it’s wrong. Sometimes the calling convention is completely trashed and you have to modify it. This is why we’re going through the disassembly slowly and together – so that you get a good grasp of what to look for and what can go wrong. Also, it’s fun to see how your pseudo-C stacks up against original or commercial decompilation.

There will not be any more pseudo-C until we’ve completed the analysis of this target, so be sure to be doing it in your text editor as we go and then compare to mine. Make the changes we noted in the branch blocks so far including the above, and let’s move on. We’re going to move faster to save page space.

A similar pattern is noticed: value store, local to register for quick execution, an addition, and then a comparison against 3E8h (1000 in decimal). The comparison compares the result of our addition statement: rcx += rax. Then, a ja instruction is executed. The ja instruction means jump if above (jump if greater), so these two instructions can be interpreted as jump if rcx is greater than 1000. At this point, you should know where to look for the if/else portions of the condition. Build out your pseudocode, and continue reading.

mov     qword ptr [rsp+30h], 0Fh
jmp     short loc_1400010E6

The last two instructions are super easy, and also what is inside of our if block of the last condition. We store 0Fh (15) in [rsp+30h], and then unconditionally jump to 1400010E6. The location of our jump turns out to be our return sequence. The easiest way to recognize this is the retn instruction preceded by add rsp, N (to clean up the stack). Note that prior to the stack clean up one of our locals is placed into the rax register which is our return value register. We know that rsp+30h is a 64-bits in size since it is using rax versus eax, ax, ah, or al. Now we can insert all of this information into our pseudo-C implementation and compare yours to mine and the actual source.

uint64_t unk_func(uint64_t a1, uint64_t a2)
{
    uint64_t rsp_20 = 0x394;
    uint64_t rsp_28 = 0xFFFFFFFFFFFFFFDD;
    uint64_t rsp_30 = 0;
    
    if(a1 < a2)
    {
        rsp_30 = 1;
        rax = [rsp_20];
        if(rax != a2)
        {
            rsp_30 = 2;
            uint64_t temp = a2 + a1;
            rdx = temp;
            if(rdx > rax)
            {
                rsp_30 = 10;
                uint64_t temp1 = rsp_28 + a1;
                if(temp1 < 0x3E8)
                {
                    rsp_30 = 15;
                }
                else
                {
                    rsp_30 = 9;
                }
            }
            else
            {
                rsp_30 = 1;
            }
        }
        else
        {
            goto loc_10dd;
        }
    }
    else
    {
.loc_10dd:
        rsp_30 = 0;
    }
    
    return rsp_30;
}

Does your pseudo implementation stack up to mine? Or do it better? How about the original source? There are many other conditional jump instructions, and we’ve covered 5 in this example. You’ll need to consult the Intel SDM Vol. 2 or AMD APM Vol. 3 to read more about the other JCC instructions and the conditions that must be met for execution. As we progress through this article you’ll probably start jumping ahead of what I’m detailing, and that’s perfectly fine. For those that are still beginning to grasp the concepts and understand how to analyze program flow be sure to refer back to earlier sections for details I imply knowledge of if you don’t remember!

Section Challenge

Write a simple application with lots of conditional branches, have Visual Studio generate an assembly output, and build a pseudo-C implementation without referencing the source; then compare.

Load/Store Instructions

If you made it this far then the rest of this will be cake. Loading and storing data is a requirement of every application. Whether it’s storing data in a buffer to write to a file, or simply assigning a value to a variable the code underneath is performing a number of load and store operations. This is one of the most important sections since there are tons of ways to load and store data, and a lot of those ways will vary based on the type of data. Simple assignments will use a move instruction while the storage of a pointer to a string would use the load effective address instruction. You may not know what those are now, but you’ll never forget them after this section.

— Move Zero Extend

We know what a move is in assembly, but what is zero extension? Zero extension is pretty straightforward. Any portion of a storage area (register, stack location, etc.,) that is not written to will be set to 0. Take a look at these brief examples.

Standard Move

mov rax, 0xFFFFFFFFFFFFFFFF
mov eax, 0xDDDDDDDD
mov rcx, rax				; rcx = 0xFFFFFFFFDDDDDDDD

This should make sense if you remember that eax is the lower 32-bit region of rax, and can be assigned individually. We set the whole 64-bits of rax to FFFFFFFF`FFFFFFFF, then we set the lower 32-bits of rax to DDDD`DDDD. We store the value of rax in rcx, and then if we were to look at what was in rcx we’d see what is shown in the assembly comment above. What happens when we use move zero extend, movzx, instead of a standard move instruction?

Zero Extension

mov rax, 0xFFFFFFFFFFFFFFFF
movzx rax, 0xDDDDDDDD
mov rcx, rax				; rcx = 0x00000000DDDDDDDD

The same sequence of operations, different results. This is because a zero extension goes from the byte boundary of the operation size to the size of the storage being written. In this case, we have rax which is 64-bits in width. We write FFFFFFFF`FFFFFFFF to rax, then we use movzx with rax which sets the operation size to 64-bits (bit-width of rax), and writes DDDD`DDDD. The processor zero extends the value to the size of the source operand. The size of the extension is dependent on the operand-size. This is a costly instruction to execute in terms of cycles taken, but it’s common to see in encryption functions or obfuscation.

— Move Sign Extend

Much like movzx there is an instruction for move with sign extension. The instruction movsx does the same copying of the source operand to the destination, however, it performs a sign extension depending on the operand-size. An example is provided below!

xor rcx, rcx
mov ax, FFFF
movsx ecx, ax
mov rax, rcx

First, we zero out rcx, set ax to FFFF (ax will be 65535), then perform a move with sign extension from ax (16-bits in width) to ecx (32-bits in width), and finally copy rcx into rax. This can be a little bit confusing since we know that rcx is 0, ax is now 65535, and then after the movsx executes it’s not exactly clear. Let’s put a pretend breakpoint on movsx ecx, ax and observe the contents of the registers.

rax = 00001ABCDEF0FFFF
rcx = 0000000000000000

ax = FFFF
eax = DEF0FFFF

I’ve placed some garbage value in rax for a little realism and to show that the write of FFFF to ax only wrote to the lower word of eax. Let’s execute the movsx instruction and observe the contents again.

rax = 00001ABCDEF0FFFF 
rcx = 00000000FFFFFFFF 

ax = FFFF 
eax = DEF0FFFF

We see that the value of FFFF was copied to ecx but an extra 8 bits were also written. This is the sign extension. It didn’t set the entire value of rcx to FFFFFFFF`FFFFFFFF because it only sign extends up to the source operands size. With this instruction the default operation size is 32-bits, however, you can extend it to 64-bits by using a 64-bit register to generate an operand size extension attribute (more on that later).

movsx rcx, ax

The above would sign extend rcx to the sign of the source operand (ax).

— Load/Store Status Flags

We talked about status flags in a bit of detail earlier, and their importance isn’t to be dismissed. In practice, I’ve encountered some initially obscure instructions like lahf. If you’re a first-timer analyzing some target and encounter this well my first suggestion would be looking at the instruction manual, but I didn’t even know that was a thing when I first started. The lahf instruction is used to load all status flags into the ah register – the upper byte of the lower word of rax. This means that the contents of ah would store the sign flag, zero flag, auxiliary flag, parity flag, and the carry flag. It’s not very often you’ll see this instruction, but it’s worth noting in the event you do. The sahf instruction is the store status flags instruction which simply takes the flags in ah and stores them into their respective flags in RFLAGS.

— Load Effective Address

As opposed to the other instructions in this section you’ll encounter load effective address quite often. It’s important to cover this instruction prior to our string operations section since lea (load effective address) is so frequently used when loading data offsets or pointers to objects. The instruction is quite simple in how it works and is for some reason over-complicated in discussions. The instruction takes the first operand (a register destination) and stores the effective address of the second operand. What is the effective address? It’s… just the address of the data. The lea instruction takes the second operand and, if necessary, performs calculations to generate the address of the data.

So what’s the difference between mov and lea? Well, mov copies the contents of an object at an address into the destination operand and lea loads the pointer of an object you’re addressing into the destination operand. In some instances, you can trivially replace lea with mov – I wouldn’t recommend it however because lea is useful when multiple bases are used.

Let’s take a look at a quick example:

uint64_t p1 = 0;
printf( "%d\n", p1 );

And the disassembly:

lea rcx, qword ptr ds:[fmt_string_address]
mov edx, eax
call printf

When lea is executed it simply takes the address of the format string which is stored in the .rdata section of the program, calculates the address of it, and stores it in rcx. If we replaced the lea with mov the contents of rcx would be the ASCII values of the format strings characters which for this example would be %d\n. This would likely cause an access violation since printf attempts to dereference the pointer to the format string, and if mov was used the contents at that address would be in rcx, not the pointer. This would generate an access violation and your program would crash. You will sometimes encounter lea used in more complex calculations like this snippet I pulled out of a random disassembly:

push    rbp
sub     rsp, 40h
mov     [rsp+50h], rcx
lea     rax, [rsp+58h]
mov     [rax], rdx

In this case, it’s storing the address of the stack location [rsp+58h]. This happens to be taken out of the disassembly of the printf function, so after storing the stack address in rax it stores the second argument of printf in the storage pointed to by rax (which is rsp+58h). It may seem a bit confusing at first, but once you finish the string operations section it’ll be quite obvious how lea works. And don’t be alarmed if you still get it messed up, everyone confuses themselves once in a while.

String Operations

The most common and confusing thing when starting out can be understanding string processing instructions. There’s a lot to them and we cover it all in this section. If you get stumped in later articles when we do crackme’s it’ll most likely be on these string processing instructions and deciphering where the data is flowing and what operations are being performed on the data. These string operations are essential to understand.

— String Example

Being able to identify how strings are copied intrinsically is super useful because sometimes functions like strcmp, memcpy, or a custom implementation will be inlined in code. We’re going to look at an example that copies one string to another, and we’ll encounter some familiar instructions along the way. The original source will not be provided for this example, and the pseudo-C will only be provided at the end of the analysis. Try it on your own this time!

                push    rbp
                sub     rsp, 50h
                lea     rbp, [rsp+20h]
                mov     [rbp+28h], rdi
                mov     [rbp+20h], rsi
                lea     rax, qword ptr ds:[unk1]
                mov     [rbp+8], rax
                lea     rax, qword ptr ds:[unk2] 
                mov     [rbp+10h], rax
                mov     rax, [rbp+8]
                mov     rdx, [rbp+10h]
                mov     rsi, rax
                mov     rdi, rdx
                mov     rax, rsi

loc_140001039:
                mov     dl, [rdi]
                inc     rdi
                mov     [rsi], dl
                inc     rsi
                test    dl, dl
                jnz     short loc_140001039
                mov     [rbp+18h], rax
                lea     rax, fmt
                mov     rdx, [rbp+8]
                mov     rcx, rax
                call    printf
                mov     [rbp+0], eax
                mov     eax, 0
                mov     rsi, [rbp+20h]
                mov     rdi, [rbp+28h]
                lea     rsp, [rbp+30h]
                pop     rbp
                retn

The first two instructions should set off a few bells. The value used for the stack allocation is 50h (80), and 80 modulo 16 is 0, so the stack is misaligned. Is it? If you thought no, you’re correct. It’s not misaligned. This is because prior to allocating stack space we pushed one of our registers, rbp, onto the stack which made an 8-byte allocation. This means that when we perform sub rsp, 50h our stack will actually have 88 bytes allocated for this function. 88 modulo 16 is 8 and abides by the alignment requirement specified in the ABI. There are a few variations in function prologues and this is one of the more common sequences of instructions you’ll experience.

After the prologue we have our first new instruction encountered: lea. It’s not immediately obvious what’s going on, but it’s storing the address of the stack location [rsp+20h] in rbp. Recall that rbp is commonly referred to as the base pointer and here it points to a seemingly arbitrary stack location. How much space do we normally allocate for spill space? 32 bytes (20h). However, that spill space allocation is only 24 bytes since the push rbp pushed 8 bytes onto the stack prior to the sub instruction. So, we have our typical 32 bytes allocated, then we store [rsp+20h] in rbp. This is setting up what’s called setting up the stack frame.

The Stack Frame

A stack frame is an area of stack space that represents a procedure call and the arguments associated with the procedure call. When a call instruction is executed the return address is pushed onto the stack first, followed by arguments, and then space is allocated for local storage.

It should be starting to make a little bit more sense. The spill space is allocated as well as storage for our arguments and local variables. In this instance, the function we’re analyzing is the main entry point of our program. All the code is executing in there. That function takes three arguments – the command line argument array, and the count of arguments. Our main function has a different calling convention and the arguments are passed through rsi and rdi, respectively. We now see that this sequence is setting up our stack frame for the function. The reason it uses [rsp+20h] as the base of the frame for the main function is because the last 32 bytes that were allocated were used to set up the stack for a call to another function. Different calling conventions have different stack-maintenance responsibilities. In this case, the calling convention of the function we’re analyzing is __cdecl which is required to allocate stack space for any functions called inside of it. The [rsp+20h] is used since the remaining 32 bytes from rsp to rsp+20h are the spill space for printf. Knowing the differences between calling conventions is a must and I encourage you to learn them from the direct links here or in the recommended reading section.

What that all means is that our 32 bytes of the initial stack allocation isn’t used in our function, and we know that the first 32 bytes are the spill space for our function. If we take 88 bytes (size of total stack space allocated), subtract 32 bytes (removing use of the allocation for other function), and then subtract 32 bytes (acknowledging our functions spill space), we’re left with 24 bytes for local variables on the stack. 24 divided by 8 is 3, meaning there are 3 local variables used in this function. Now that we know how many locals are used tracking variable movement is a lot easier. This helps us realize that rbp is used as the last stack spot for use by our function. The base of the stack (or call) frame. So when we see rbp used with an offset it can be thought of as if that’s the top of the stack for the currently executing function.

Since our calling convention was noted as __cdecl the first two arguments are stored in rsi and rdi. Then those are stored in the spill space for our function.

mov [rbp+28h], rdi 
mov [rbp+20h], rsi

To understand how this would look in a stack view, see below.

stack view

The diagram above shows what each instruction of the opening sequence is referencing, and how they all work together. If you were to omit the frame pointer (rbp) and look at where the rdi and rsi registers store their values you’d see they wrote to [rsp+48h] and [rsp+40h]. Now you know how I deduced it was writing to spill space. Let’s bring our disassembly back into view.

                push    rbp
                sub     rsp, 50h
                lea     rbp, [rsp+20h]
                mov     [rbp+28h], rdi
                mov     [rbp+20h], rsi
                lea     rax, qword ptr ds:[unk1]
                mov     [rbp+8], rax
                lea     rax, qword ptr ds:[unk2] 
                mov     [rbp+10h], rax
                mov     rax, [rbp+8]
                mov     rdx, [rbp+10h]
                mov     rsi, rax
                mov     rdi, rdx
                mov     rax, rsi

loc_140001039:
                mov     dl, [rdi]
                inc     rdi
                mov     [rsi], dl
                inc     rsi
                test    dl, dl
                jnz     short loc_140001039
                mov     [rbp+18h], rax
                lea     rax, fmt
                mov     rdx, [rbp+8]
                mov     rcx, rax
                call    printf
                mov     [rbp+0], eax
                mov     eax, 0
                mov     rsi, [rbp+20h]
                mov     rdi, [rbp+28h]
                lea     rsp, [rbp+30h]
                pop     rbp
                retn

It get’s a bit easier here after we get passed the details of the opening 5 instructions. We perform an lea to load the pointer of an item into rax, for this example, it’s obviously a string. Then we store rax into [rbp+8] or location 28 in our stack diagram. The same goes for the next two instructions except it is loading the address of a different string. The next four instructions are copying the contents of registers into other registers. This is where inlining has occurred. We know this because in our discussion earlier a function with the __cdecl calling convention takes 2 arguments through rsi and rdi, and at this point, we see that we are loading the two registers with the pointers to these strings then a call would be made if the function weren’t inlined. We should make note of the mov rax, rsi instruction since that is preserving the original pointer address to unk1.

Labels in Disassembly

When reading a disassembly listing any time you notice a label such as loc_x it should be in the back of your mind that there is a conditional somewhere else in the code that references it. It could be used in an error condition, a loop, an if/else, a goto, etc.

As soon as we see the loc_140001039 we need to make note of any reference to it that may be nearby. There is one, the jnz loc_140001039 only 5 instructions away. This is indicative of a loop. Let’s look at the code that’s looping.

loc_140001039:
                mov     dl, [rdi]
                inc     rdi
                mov     [rsi], dl
                inc     rsi
                test    dl, dl
                jnz     short loc_140001039

Let’s make some notes about this sequence.

dl is the lower byte of rdx
rdi contains the pointer to unk2[0] (base of string)
[rdi] is accessing the contents the pointer addresses, so first character in unk2

After reading these notes we can analyze what’s going on. Assuming you’ve programmed in C or any language you know that a character in a string is one byte in size. The sub-register dl is also one byte in size. The instruction mov dl, [rdi], therefore, is reading the address in rdi and copying the contents into dl. This will only copy one byte from that location. Then it increments the address that rdi contains by 1 which means it’s pointing to the next character in the string since arrays are allocated contiguously in memory. It takes that character and copies it into the location pointed to by rsi, then increments rsi so that it now points to the next character in its sequence. Then a test instruction, one of the instances where it decided to show itself. This performs a logical AND on the two operands dl and dl. This is common to see in string looping sequences where test is used. It uses test since the logical AND of a character against itself yields the ASCII value. If the character is NULL then the result of the test will be 0 and the zero flag will be set which means that the jnz branch will not be taken – simply put: indicating the loop has finished or encountered a 0 byte.

We know that strings have a null terminator (null byte) appended to the end of their sequence so this loops until the end of the source string is encountered. Once the loop ends code execution continues in a linear path through the rest of the excerpt. The operation performed on these two strings should be clear at this point. This is a string copy! An unsafe one at that since it will copy until the end source string is hit, but what about the destination? It could keep overwriting data far beyond the length of that string.

Unsafe Copy Operations

There is a reason that unsafe copy operations are tagged by many compilers. These sorts of unsafe copies like the one depicted above are frequently used in buffer overflow exploits. This one, in particular, could be weaponized to hijack the control flow of the program. This is another reason why keeping an eye out for sequences like this will help you when reverse engineering or building exploits.

I’ve decided that at this point you should be challenged to apply what you’ve learned to convert the disassembly to pseudocode. The pseudocode that you should’ve constructed is available here, but I encourage you not to look until you’ve spent time attempting yourself.

Conclusion

In this crash course on x64 assembly we have covered quite a lot, even on just simple examples. There’s no way to pack in years of learning assembly into a single article or all the tricks and nuances of instructions and examples, but I hope that this first part has helped build a solid foundation for you to begin learning assembly. The contents that could be here could fill a book, and I intend to include as much as I can to make the foundation as solid as possible, however, this should not be seen as a one-stop-shop for learning assembly. That being said, in the next part of Accelerated Assembly we will cover more advanced examples like bitmasking, bit rotating, string encryption, rolling encryption, and some examples that use a few instructions as anti-debugging mechanisms. We’ll tear down some built out examples of authorization, encryption, and a game example.

As always, feel free to ask any questions, feedback, or otherwise, you may have! Thanks for reading!

Legal Notice: All of this information is intended for educational purposes only. I do not endorse using this knowledge for illegal activity.

Recommended Reading

5 thoughts on “Applied Reverse Engineering: Accelerated Assembly [P1]

Leave a Reply