31
ptsecurity.ru Techniques of binary code analysis methods, problems, tools Konstantin Panarin, low-level application analysis group developer

«Технологии анализа бинарного кода приложений: требования, проблемы, инструменты», Константин Панарин

Embed Size (px)

Citation preview

ptsecurity.ru

Techniques of binary code analysis –methods, problems, tools

Konstantin Panarin,

low-level application analysis group developer

• Konstantin Panarin, Positive Technologies, [email protected]

• Developer of low-level application analysis group

#whoami

• Purposes of binary analysis

• Overview of techniques and arising problems

• Overview of modern analysis tools

AGENDA

• Error discovering

• Vulnerability discovering

• Searching for backdoors and undocumented features

• Recovery of program logic(RE)

• Tests generation (more details later)

Purposes of binary analysis

Specificity of binary analysis

• Almost complete lack of type-related information

• Intensive use of obfuscation and antidebugging techniques: CFG flattening, virtual machines, dead code insertion

• Complexity of gathering of “meta information” of different kinds (exception handlers)

• Semantic difficulty of particular assembler instructions (especially considering x86 instruction set): XLAT, DIV, CMPXCHG and so on

Types of analysis:• Static analysis

• No program execution

• Dynamic analysis

• Analysis of one execution trace per every program run

• Combined analysis

Analysis techniques:• Symbolic execution

• As a general rule, used in static analysis

• Analysis of marked data (taint analysis)

• As a general rule, used in dynamic analysis

• Fuzzing

• Expected input data is replaced by randomly generated bytes

• And many others

Methods of binary analysis

Types of analysis:• Static analysis

• No program execution

• Dynamic analysis

• Analysis of one execution trace per every program run

• Combined analysis

Analysis techniques:• Symbolic execution

• As a general rule, used in static analysis

• Analysis of marked data (taint analysis)

• As a general rule, used in dynamic analysis

• Fuzzing

• Expected input data is replaced by randomly generated bytes

• And many others

Methods of binary analysis

In practice, analysis tools use a mixture of different techniques because each instrumentation method has its own

restrictions. Consistent use of various approaches helps to partially (and sometimes totally) overcome their

limitations.

Static vs dynamic analysis

Dynamic analysis:

• Availability of run-time information: process memory map, addresses of indirect calls and so on

• Program execution may require specific environment

• It’s not always possible to reproduce the results of previously ran analysis

Static analysis:

• In common, works faster

• One analysis run is potentially able

to cover infinite number of

execution paths

• Able to work in the case of absence

of some parts of code\libraries

• Unable to cope with obfuscation

and encryption

• Key idea – replacement of concrete input data (eg function arguments) by symbolic values

• Analysis tool operates on symbolic expressions instead of their concrete counterpart

• Symbolic execution is able to cover all execution paths on single run

• Every execution path represents a “state” of program, which holds all the constraints on crafted symbolic variables (path and value constraints)

• SMT-solver – tool, designed to resolve constraints on symbolic variables

Techniques – symbolic execution

int twice(int v) {

return 2 * v;

}

void test(int x, int y) {

z = twice(y);

if (x == z) {

if (x > y + 10)

ERROR;

}

}

int main() {

x = read();

y = read();

test(x,y);

}

Symbolic execution: example

Let’s find the values of x and y,

which will force execution flow to

reach the label ERROR

Taken from http://www.srl.inf.ethz.ch/pa2015/Lecture8.pdf

int twice(int v) {

return 2 * v;

}

void test(int x, int y) {

z = twice(y);

if (x == z) {

if (x > y + 10)

ERROR;

}

}

int main() {

x = read();

y = read();

test(x,y);

}

Symbolic execution: example

Value constraints:

X->x0

Y->y0

Path constraints:

True

int twice(int v) {

return 2 * v;

}

void test(int x, int y) {

z = twice(y);

if (x == z) {

if (x > y + 10)

ERROR;

}

}

int main() {

x = read();

y = read();

test(x,y);

}

Symbolic execution: example

Value constraints:

X->x0

Y->y0

Z->2*y0

Path constraints:

True

int twice(int v) {

return 2 * v;

}

void test(int x, int y) {

z = twice(y);

if (x == z) {

if (x > y + 10)

ERROR;

}

}

int main() {

x = read();

y = read();

test(x,y);

}

Symbolic execution: example

Value constraints:

X->x0

Y->y0

Z->2*y0

Path constraints:

x0 = 2y0

Value constraints:

X->x0

Y->y0

Z->2*y0

Path constraints:

x0 != 2y0

Create two different states after conditional branch - if (x==z)

int twice(int v) {

return 2 * v;

}

void test(int x, int y) {

z = twice(y);

if (x == z) {

if (x > y + 10)

ERROR;

}

}

int main() {

x = read();

y = read();

test(x,y);

}

Symbolic execution: example

Value constraints:

X->x0

Y->y0

Z->2*y0

Path constraints:

x0 =2y0 ^ x0 > y0+10

Value constraints:

X->x0

Y->y0

Z->2*y0

Path constraints:

x0 =2y0 ^ x0 <= y0+10

int twice(int v) {

return 2 * v;

}

void test(int x, int y) {

z = twice(y);

if (x == z) {

if (x > y + 10)

ERROR;

}

}

int main() {

x = read();

y = read();

test(x,y);

}

Symbolic execution: example

Value constraints:

X->x0

Y->y0

Z->2*y0

Path constraints:

x0 = 2y0 ^ x0 > y0+10

Reachability condition of label ERROR:

int twice(int v) {

return 2 * v;

}

void test(int x, int y) {

z = twice(y);

if (x == z) {

if (x > y + 10)

ERROR;

}

}

int main() {

x = read();

y = read();

test(x,y);

}

Symbolic execution: example

Value constraints:

X->x0

Y->y0

Z->2*y0

Path constraints:

x0 = 2y0 ^ x0 > y0+10

Reachability condition of label ERROR:

SMT Solver gives the following solution:

x0 = 40, y0 = 20

• Symbolic execution was originally used for tests generation (in the 70s of the last century):

• Input values of tested program are marked as symbolic

• In case of program error\vulnerability at some point in the executable we resolve the reachability constraints to this point

• Found conditions on input data form the required test

• This scheme is widely used in different verification systems

Symbolic execution: applications

• Symbolic execution: general scheme

translator

into IR

assembler instruction

Set of IR instructions

Pool of states

(one state per execution

path)

State №1

State №2

State №…

State №500Each state holds the following data:

• Current IP (instruction pointer)

• Symbolic context (registers, memory

cells)

• Constraints

Executor (director)

– processes

particular stateX86: mov eax, ecx

___________________

IR:

STR R_ECX:32, , V_00:32

STR V_00:32, , R_EAX:32

Interpreter –

contains handlers

for each IR

instruction

translation

Conditional branch with condition

X

If some “interesting” point is

reached, check its reachability:

extract path constraints from

state and solve corresponding

smt task

SMT-Solvers: Z3,

STP, Boolector

New state a:

Constraints += X

New state b:

Constraints += ~X

Add new

states into

the pool

Searcher selects state

from

the pool

Symbolic execution – existing problems

• path explosion (how to generate smaller number of states?)

• cycle-unrolling (how to process cycles, the exit condition of which depend on symbolic variable?)

• symbolic pointers (how to handle load and store operations, the address of which are also symbolic?)

• constraint difficulty (some generated constraints are too difficult for all SMT-solvers to evaluate and find exact solutions)

• external resources (how to process file handlers and other references to external objects?)

Symbolic execution – possible solutions

• path explosion – merge (unite) several states into bigger one (but when and how?)

• path explosion – simultaneous processing of various states (parallel symbolic execution).

• cycle unrolling, symbolic pointers – use specific SMT-logics (how effective is it?)

• external resources – use DSL in order to describe external calls in terms of solver‘s expressions

• Purely dynamic analysis method

• Connects trace with the data processed by program in the course of execution

• Answers the question of how the program processes certain pieces of input data

Analysis of marked data (taint analysis)

Taint analysis: basic idea

• Main concepts: shadow memory and taint propagation.

Shadow memory

Taint propagation

mov eax, tainted_input

xor eax, eax ; eax is UNTAINTED

-----------------------------------------

push tainted_input

pop eax ; eax is TAINTED,

-----------------------------------------------------------------

xor eax, eax

cmp eax, tainted_input ; AF, CF, OF, PF, SF, ZF are TAINTED

Taint propagation: examples

mov eax, tainted _input

mov ecx, untainted_input

add ecx, eax ; ecx is TAINTED

-----------------------------------------

mov eax, tainted_input

mov ecx, untainted_input

mov ax, cx ; ax is UNTAINTED, eax is TAINTED

-----------------------------------------------------------------

Taken from http://defcon.org.ua/data/1/4_Oleksyk_Code_Analysis.pdf

Taint analysis: scheme

Program code:

___________________

__

push ebp

Mov ebp, esp

lea eax, [esp+8]

ret

Runtime analysis of

machine instructions

add eax, [esp+8]

Instruction handler:

Syntax parsing,

extraction of instruction’s

operands,

address resolution (for

memory operands)

Taint context

EBX: not tainted

Taint propagation

ECX: tainted

EDI: tainted

EAX: not tainted

SHADOW MEMORY

Operands:

dest - eax,

src: eax, 0x7f2300

Context reading:

eax – not tainted

0x7f2300 - tainted

Context

writing:

eax – tainted

Taint analysis

Usage of taint-analysis:• Tainted EIP suggest the possibility of control flow hijacking (for example, as a result of

stack\heap overflow).

• Tainted arguments of particular functions (printf family of functions, system) suggest the

possibility of a vulnerability.

• Tainted resources (handlers, mutexes, that do not depend directly on use input) suggest the

possibility of logical error in the program.

Drawbacks:

• Requires detailed analysis of every assembler instruction, which may be tedious for some

architectures (x86)

• Ideal taints analysis should instrument the entire code executed by the operating system (both

in user-mode and in kernel-mode) which is not always possible.

Combined analysis aka concolic execution

Concrete + symbolic = concolic:

• Create snapshots of the entire process on chosen control points

• Instrument concrete execution trace and at the same time fill the queue of symbolic constraints: for each conditional branch on the trace push its constraints into the queue

• Roll back to the previous control point, select new symbolic condition from the queue, solve its corresponding SMT-task and substitute found solution into exact execution context (registers and memory areas)

• Instrument new trace with new parameters

Existing tools (OpenSource)

KLEE

• Created as test generation tool with high coverage

• Based on LLVM IR

• Uses symbolic execution

• Automatic test generation

Existing tools (OpenSource)

Triton

• Uses concolic execution

• Directly converts asm instructions into solver’s (Z3) expressions (bypassing internal representation)

Other: FuzzBall, BitBlaze

• There are no tools of proper product quality

• Every tool is focused on solving one particular task

Existing tools(ClosedSource)

MAYHEM

• Designed to automatically search for vulnerabilities and generate exploits

• Able to work with symbolic pointers

• Winner of DARPA contest in 2016

CodeSurfer, VeraCode

• A little is known about their inner structure

Conclusion

• Methods of binary analysis still require a lot of careful research

• At present, there doesn’t exist a universal instrument of binary analysis

• Every tool is focused on solving one particular task

• Positive Technologies is working on its own tool – STAY TUNED!

Спасибо за внимание!

ptsecurity.ru