Value-Based Program Characterization and Its Application to Software Plagiarism Detection

Preview:

DESCRIPTION

Value-Based Program Characterization and Its Application to Software Plagiarism Detection. ICSE 2011 Yoon-Chan Jhi, Xinran Wang, Sencun Zhu, Peng Liu, Dinghao Wu Penn State University Xiaoqi Jia State Key Laboratory of Information Security, Institute of Software, - PowerPoint PPT Presentation

Citation preview

Value-Based Program Characterization and Its Application to Software Plagiarism De-

tection

Embedded Lab.Park Yeongseong

ICSE 2011

Yoon-Chan Jhi, Xinran Wang, Sencun Zhu, Peng Liu, Dinghao Wu Penn State University

Xiaoqi JiaState Key Laboratory of Information Security, Institute of Software,

Chinese Academy of Sciences

Introduction State of the art Core values Design Experiment Discussion Conclusion Q&A

Contents

Identifying same or similar code is very im-portant

Previous works◦ Static source code comparison – C1◦ Static excutable code comparison – C2◦ Dynamic control flow based methods – C3◦ Dynamic API based methods – C4

Introduction

Three highly desired requirements◦ R1 – Resiliency◦ R2 - Ability to directly work on binary executables◦ R3 – Platform independence

BUT!!!! Not satisfy requirement◦ Static source code comparison – C1 R1 R2◦ Static excutable code comparison – C2 R1◦ Dynamic control flow based methods – C3 R1 R3◦ Dynamic API based methods – C4 R3

Introduction

Introduce new approach◦ Core-values

5 optimization options (-O0 ~ -O3, -Os) 3 Compilers ( GCC, TCC, WCC ) KlassMaster, Thicket, Loco/Diablo Obfusca-

tors

Introduction

Code Obfuscation Techniques◦ data obfuscation, control obfuscation, layout obfusca-

tion and preventive transformations◦ indirect branches, control-flow flattening, function-

pointer aliasing

Static Analysis Based Plagiarism Detection◦ String-based◦ AST-based◦ Token-based◦ PDG-based◦ Birthmark-based

State of the arts

Dynamic Analysis Based Plagiarism Detec-tion◦ Whole program path based (WPP)◦ Sequence of API function calls birthmark(EXESEQ)◦ Frequency of API function calls

birthmark(EXEFREQ)◦ System call based birthmark

State of the arts

Runtime values◦ The output operands of the machine instructions ex-

ecuted

Core values◦ Constructed from runtime values

Eliminate non-core values◦ If is not derived form , is not a core-value of ◦ If is not in the set of runtime values of is not a core-

value of

Core values

Core values

Not all values associated with the execution of a program are core-values◦ Value-updating instruction◦ Related to the program’s semantics

Design-Value Sequence Extrac-tion

To refine value sequences◦ Sequential refinement – reduction rate 16%~34%◦ Optimization-based refinement – 5 optimization◦ Address removal – exclude pointer values

Design-Value Sequence Refinementand Similarity Metric

Design-Overview

Intel Quad-Core 2.00 GHz CPU 4GB RAM Linux machin QEMU 0.9.1

Questions1. resilient 2. false accusation3. credible

Experiment

Obfuscation techniques◦ SandMark, KlassMaster : Java bytecode obfusca-

tors

Test application : Jlex◦ Lexical analyzer

Experiment-Obfuscation tool(resiliency)

Test Application◦ 5 individual XML pasers:expat, libxml2, Parsifal,

rxp,xercesc

Experiment-Similar Programs(false accusation)

Test application◦ Bzip2, gzip, oggenc, 9 of 11 programs

Result◦ Similarity scores between 0 and 0.27◦ zip and gzip similarity scores are 1.0

Same compression algorithm : deflate◦ zip and bzip2 similarity scores are 0.01 to 0.03

Different compression algorithm : block sorting

Experiment-Different Programs(credible)

introduce a novel approach to dynamic characterization of executable programs.

The value-based method successfully dis-criminates 34 plagiarisms by SandMark, KlassMaster, Thicket.

Conclusion

Q&A

Recommended