Assembly language, if you look back in 1950s, was the most known powerful language in computer programming history. I am not saying that it has lost its power, rather I will say that it has evolved so immensely that every microprocessor in computer world is still using one of its dialects. Let me take you through some basics before we start digging down to its role in malware analysis. First of all, this blog is not intended to teach you how to code in assembly but I will tell you enough to understand the working of written one. Let’s start.
Assembly language is a low level programming language. It is the most closer language to what machine understands,i.e, binaries. Any program written in assembly consists of mainly instructions, data, comments and pseudo code. The instructions usually contains a mnemonic (like ADD, MOV etc.) followed by zero or more operands. Mnemonics are the words that can be recognized as the instructions to execute like MOV to move data. One of the simple examples of such instruction could be:
MOV A, 0x02
The above instruction adds a byte of value 0x02(the source operand) to A and puts the result into A(so called the destination operand). See, that is simple. Assembler is the one which is going to transform the above code into machine code. It creates the object code by translating the mnemonics and operations(contained in instructions) into their numerical equivalents.
Each instruction in Assembly language equivalents to Opcode(operation code) which tells CPU exactly what to do. For example, the instruction MOV ecx, 0x29 , if written in opcode forms B9 29 00 00 00 as shown in following picture:
In this case, the machine architecture is x86. So, the value 0x29 is transformed into its little endian form 29 00 00 00. To know more about endianess, click here: Endianess.
Operands are the part of an instruction which refers to the data to perform the operation. It can be of three types:
- Immediate: which are fixed values like 0x29 in above mentioned instruction.
- Register: which refers to a register in CPU like ecx in above mentioned example.
- Memory Address: referring to memory address which contains the value of interest.
Most Commonly Used Arithmetic Instructions
The most commonly used arithmetic instructions in assembly language are:
Now, let’s talk about some specifics..
x86 Assembly Language
x86 Assembly Languages, a family of the backward compatible assembly language, is used to build object code for x86 class of Intel processors(8086). It uses similar instructions format as any other general assembly language consisting of mnemonics, register, address modes etc. that CPU can understand and follow.
Many compilers uses it as an intermediate language while translating high level language to machine language. We are going to focus on x86 Assembly languages as most of the malware are compiled for most commonly used x86 computer architecture, including your own windows 32-bit machine.
Here is the cheat sheet giving a brief of almost all the mnemonics used in x86 Assembly language: Code Table.
You will see a lot of instructions consisting of registers as operands. One must learn their names and their use to understand the complete instruction. Following table will give you a brief on the same:
Each General register occupies 32-bit. If it starts with E in the beginning, it means extended. These are typically used to store data or memory address. The basic functioning of each of them are as follows:
- AX – multiply/divide, string load & store, generally contains the return value of function calls.
- CX – count for string operations & shifts
- DX – port address for IN and OUT, sometimes may be used like AX as well
- BX – index register for MOV
- SP – points to top of stack
- BP – points to base of stack frame
- SI – points to a source in stream operations
Segment registers helps in determining where the 64K segments starts in memory. To know more about segmentation in x86 architecture, follow this link: Memory Segmentation.
As the name suggests, these are the 32-bit status registers to control CPU operations or to indicate the results of CPU instructions.
- ZF – stands for the zero flag. When the result of an operation is zero, it is set.
- CF – When the result of an operation is either too large or too small for the destination operand, it is set.
- SF – It is simply a sign flag. It is set when the result of an operation is negative.
- TF – It is a trap flag and it is used for debugging.
In x86 Computer architecture, EIP or Instruction pointers contains the address of the next instruction to be executed by CPU for a program. In simple words, it tells CPU about the next command.
Why do we need to learn it?
Simple answer would be to learn Disassembly. When the basic static and dynamic analysis fails to give you the whole story behind the malware and you still got that urge to break it down into pieces to understand it creation and capability of destruction, well you have to learn it.
Below picture shows how malware author uses high level language to create their malware. The code is further compiled and transformed into machine code, the one that CPU understands. To get that code back to high level language by malware analyst is almost impossible. Though, the conversion to a human readable form is still possible by using disassembler. One of the most popular disassembler available online is IDAPro.
If you have gone through the mnemonics cheat sheet and also understood the use of each and every register mentioned above, then you can easily find out what actually the disassembled code is trying to do. Now, all you need just some hands on. Go, find some interesting executable lying in your “Software and Tools” folder and open them one by one into various disassember available online. Get familiar with those instructions, those mnemonics and registers. Once, you are ready we’ll start our next walk to malware debugging.
That’s all for now. I will come back soon with more as I always do.