28
ZHENG LI ( 李征 ) [email protected] Program Analysis (软件源代码分析技术)

软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

  • Upload
    dinhnhu

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

Page 1: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

ZHENG LI (李征)[email protected]

Program Analysis (软件源代码分析技术)

Page 2: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Lexical and Syntax Analysis

Page 3: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Topic Covered Today

▪ Compilation

▪ Lexical Analysis

▪ Semantic Analysis

Page 4: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Compilation

• Translating from high-level language to machine

code is organized into several phases or passes.

• In the early days passes communicated through

files, but this is no longer necessary.

Page 5: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Language Specification

• We must first describe the language in question

by giving its specification.

• Syntax:• Defines symbols (vocabulary)

• Defines programs (sentences)

• Semantics: • Gives meaning to sentences.

• The formal specifications are often the input to

tools that build translators automatically.

Language Specification

Page 6: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Compiler passesCompiler passes

Page 7: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Compiler passes

Parser

semantic analyzer

Optimizer

Final assembly

Translator

symbol table

managererror handler

target program

source program

front end

back end

Lexical scanner

Compiler passes

Page 8: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Symbol Table Management

▪ The symbol table is a data structure used by all phases

of the compiler to keep track of user defined symbols

and keywords.

▪ During early phases (lexical and syntax analysis)

symbols are discovered and put into the symbol table

▪ During later phases symbols are looked up to validate

their usage.

Page 9: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Symbol Tables

Regular

ExpressionToken Attribute-Value

ws

if

then

else

id

num

<

<=

=

< >

>

>=

-

if

then

else

id

num

relop

relop

relop

relop

relop

relop

-

-

-

-

pointer to table entry

pointer to table entry

LT

LE

EQ

NE

GT

GE

Note: Each token has a unique token identifier to define category of lexemes

Page 10: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Error Management

▪ Errors can occur at all phases in the compiler

▪ Invalid input characters, syntax errors, semantic

errors, etc.

▪ Good compilers will attempt to recover from

errors and continue.

Page 11: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Lexical analyzer

▪ Also called a scanner or tokenizer

▪ Converts stream of characters into a stream of

tokens

▪ Tokens are:

Keywords such as for, while, and class.

Special characters such as +, -, (, and <

Variable name occurrences

Constant occurrences such as 1, 0, true.

Lexical analyzer

Page 12: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Lexical analyzer

▪ The lexical analyzer is usually a subroutine of

the parser.

▪ Each token is a single entity. A numerical code

is usually assigned to each type of token.

Page 13: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

▪ Lexical analyzers perform:

Line reconstruction

delete comments

delete white spaces

perform text substitution

Lexical translation: translation of lexemes -> tokens

Often additional information is affiliated with a token.

Lexical analyzerLexical analyzer

Page 14: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

letter A | B | C | … | Z | a | b | … | z

digit 0 | 1 | 2 | … | 9

id letter ( letter | digit )*

Shorthand Notation:

“+” : one or more r* = r+ | & r+ = r r*

“?” : zero or one r?=r |

[range] : set range of characters (replaces “|” )

[A-Z] = A | B | C | … | Z

id [A-Za-z][A-Za-z0-9]*

Token Definitions

Page 15: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Example of extraction lexemes and produce the corresponding tokens.

Sum = oldsum – value /100;

Token Lexeme

IDENT sum

ASSIGN_OP =

IDENT oldsum

SUBTRACT_OP -

IDENT value

DIVISION_OP /

INT_LIT 100

SEMICOLON ;

Page 16: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Parser

▪ Performs syntax analysis

▪ Imposes syntactic structure on a sentence.

▪ Parse trees are used to expose the structure.

These trees are often not explicitly built

Simpler representations of them are often used

▪ Parsers, accepts a string of tokens and builds a

parse tree representing the program

Parser

Page 17: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Parser

▪ The collection of all the programs in a given

language is usually specified using a list of rules

known as a context free grammar.

Parser

Page 18: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

n A grammar has four components:l A set of tokens known as terminal symbols

l A set of variables or non-terminals

l A set of productions where each production

consists of a non-terminal, an arrow, and a

sequence of tokens and/or non-terminals

l A designation of one of the nonterminals as the

start symbol.

ParserParser

Page 19: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Abstract Syntax Tree

Abstract Syntax Tree

▪ The parse tree is used to recognize the components of

the program and to check that the syntax is correct.

▪ As the parser applies productions, it usually generates

the component of a simpler tree (known as Abstract

Syntax Tree).

▪ The meaning of the component is derived out of the

way the statement is organized in a subtree.

Page 20: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Phase Input Output

Lexer Sequence of

characters

Sequence of

tokens

Parser Sequence of

tokens

Parse tree

Comparison with Lexical Analysis

Page 21: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Semantic Analyzer

Semantic Analyzer

▪ The semantic analyzer completes the symbol table

with information on the characteristics of each

identifier.

▪ The symbol table is usually initialized during parsing.

▪ One entry is created for each identifier and constant.

Scope is taken into account.

Two different variables with

the same name will have

different entries in the

symbol table.

Page 22: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Translator

▪ The lexical scanner, parser, and semantic analyzer are

collectively known as the front end of the compiler.

▪ The second part, or back end starts by generating low

level code from the (possibly optimized) AST.

Page 23: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

▪ Rather than generate code for a specific architecture,

most compilers generate intermediate language

▪ Three address code is popular.

Really a flattened tree representation.

Simple.

Flexible (captures the essence of many target

architectures).

Can be interpreted.

Translator

Page 24: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Optimizers

▪ Intermediate code is examined

and improved.

▪ Can be simple:

changing “a:=a+1” to “increment a”

changing “3*5” to “15”

▪ Can be complicated:

reorganizing data and data accesses for cache efficiency

▪ Optimization can improve running time by orders of

magnitude, often also decreasing program size.

Page 25: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Code Generation

▪ Generation of “real executable code”for a particular target machine.

▪ It is completed by the Final Assembly phase

▪ Final output can either be

assembly language for the target machine

object code ready for linking

▪ The “target machine” can be a virtual machine (such

as the Java Virtual Machine, JVM), and the “real

executable code” is “virtual code” (such as Java

Bytecode).

Page 26: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Compiler Overview

Lexical Analyzer

Syntax Analyzer

Semantic Analyzer

Code Optimizer

Code Generation

Source Program IF (a<b) THEN c=1*d;

Token Sequence

Syntax Tree

3-Address Code

Optimized 3-Addr. Code

Assembly Code

IF_stmt

<a

b

cond_expr

listassign_stmt

c

*

lhs

rhs 1

dGE a, b, L1

MUlT 1, d, c

L1:

GE a, b, L1

MOV d, c

L1:loadi R1,a

cmpi R1,b

jge L1

loadi R1,d

storei R1,c

L1:

IF (ID

“a”<

ID

“b”THEN

ID

“c”=

CONST

“1” *ID

“d”)

Page 27: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Exercise: Abstract Syntax Tree

x := a + b;

y := a * b;

while (y > a) {

a := a + 1;

x := a + b;

}

Page 28: 软件源代码分析技术collectively known as the front end of the compiler. The second part, or back end starts by generating low level code from the (possibly optimized) AST ...cist.buct.edu.cn/staff/zheng/COMP544-PA/2-Lexical.pdf ·

Email: [email protected]: http://cist.buct.edu.cn/staff/zheng/Office: 科410