21
Department of Computer Science, Graduate School of Information Science & Technology, Osaka University A clone detection approach for a collection of similar large-scale software products Eunjong Choi† , Norihiro Yoshida‡, Yoshiki Higo†, Katsuro Inoue† †Osaka University Institute of Science and Technology 1

A clone detection approach for a collection of similar large-scale software products

  • Upload
    alma

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

A clone detection approach for a collection of similar large-scale software products. Eunjong Choi† , Norihiro Yoshida‡, Yoshiki Higo†, Katsuro Inoue† †Osaka University ‡Nara Institute of Science and Technology. Software Development for Mobile Device (1/2). - PowerPoint PPT Presentation

Citation preview

Page 1: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology,Osaka University

A clone detection approach for a collection of similar large-scale software products

Eunjong Choi†, Norihiro Yoshida‡, Yoshiki Higo†, Katsuro Inoue†

†Osaka University‡Nara Institute of Science and Technology

1

Page 2: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Software Development for Mobile Device (1/2)

Releases a new model in regular and rapid rushed intervals

Adapts to various country constraints and needse.g. Oshaifu-Keitai for Japan

2

Page 3: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Software Development for Mobile Device (2/2)

Develop software by reusing common pieces and implement unique pieces for each feature.

3

+ + +

Reusedpieces

Unique features

Page 4: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Reused Source Code Pieces

Source code is reused in code fragment level (code clones) and file level

.

Detecting and managing reused pieces is necessarye.g. Inconsistency management, plagiarism detection

4

A code clone : a code fragment that has lexically, syntactically, or semantically similar code fragments in source code

Page 5: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Code Clone

Generated by:Code reuse by copy & pasteStereotyped functions or tool generated code

5

Code Clone Code Clone

Code Clone

A clone set: A set of code clones that are similar or identical to each other

Page 6: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Code Clone Detection

Various detection techniques and tools have been proposed e.g. Text-based and line-based(dup)[Baker1995],

token-base(CP-minder)[Li2006]

CCFinder [Kamiya2002] A token-base clone detection toolMulti language support (C, C++, COBOL, Java, ...)Good speed (5MLOC/20m)

6

[Baker1995] B. S. Baker. On nding duplication and near-duplication in large software systems. In Proc. of WCRE, pages 86, July 1995.[Li 2006] Z. Li, S. Lu, S. Myagmar and Y. Zhou. CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code. IEEE Transactions on Software Engineering, 32: pages 176-192, 2006[Kamiya2002] T. Kamiya, S. Kusumoto and K. Inoue. CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. IEEE TSE, 28: 654-670, 2002

Page 7: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Example of Clone Detection Technique: CCFinder

Source files

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

Code clones

1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. }10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }

static void foo ( ) { String a

[ ] = new String [ ] { "123,400" ,

"abc" , "orange 100" } ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0

; System . out . println ( "sum = "

+ sum ) ; }

throws RESyntaxException

Sample . parseNumber (

) )

if pat

. match a [ i ]( ) )

org . apache . regexp

. RE pat = new org . apache . regexp

. RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

System . out . println ( "sum = " + sum

) ; }

throws RESyntaxException

if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( ) )(

(

(

[ ] = new String [ ] {

} ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0

; System . out . println ( "sum = "

+ sum ) ; }

Sample . parseNumber (

) )

if pat

. match a [ i ]( ) )

pat = new

RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

System . out . println ( "sum = " + sum

) ; }

throws RESyntaxException

if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( (

(

(

static void foo ( ) { String athrows RESyntaxException

$

RE

$ . ) )

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

[ ] = new String [ ] {

} ;

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

sum

+= pat . getParen 0

; System . out . println ( "sum = "

+ sum ) ; }

Sample . parseNumber (

) )

if pat

. match a [ i ]( ) )

pat = new

RE ( "[0-9,]+" ) ;

static void goo (

) {

String

a [ ]

int sum = 0

; for ( int i = 0 ; i <

a . length ; ++ i )

System . out . println ( "sum = " + sum

) ; }

throws RESyntaxException

if exp

. match a [ i ]( ) )

exp =

new RE ( "[0-9,]+" ) ;

(

RE

sum

+= exp . getParen 0

;

parseNumber ( ) )(

(

(

static void foo ( ) { String athrows RESyntaxException

$

RE

$ .

[ ] = [ ] {

} ;

=

; for ( = ; <

. ; ++ )

+= .

; . . (

+ ) ; }

. (

) )

if

. [ ]( ) )

=

( ) ;

static (

) {[ ]

=

; ( = ; <

. ; ++ )

. . ( +

) ; }

throws

if

. [ ]( ) )

=

new ( ) ;

(

+= .

;

( ) )(

(

(

static $ ( ) {throws

$

$ .

$ $ $ $

$ $

$ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

new

for for

new

[ ] = [ ] {

} ;

=

; for ( = ; <

. ; ++ )

+= .

; . . (

+ ) ; }

. (

) )

if

. [ ]( ) )

=

( ) ;

static (

) {[ ]

=

; ( = ; <

. ; ++ )

. . ( +

) ; }

throws

if

. [ ]( ) )

=

new ( ) ;

(

+= .

;

( ) )(

(

(

static $ ( ) {throws

$

$ .

$ $ $ $

$ $

$ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $

$ $ $ $ $

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

7

1. static void foo() throws RESyntaxException { 2. String a[] = new String [] { "123,400", "abc", "orange 100" }; 3. org.apache.regexp.RE pat = new org.apache.regexp.RE("[0-9,]+"); 4. int sum = 0; 5. for (int i = 0; i < a.length; ++i) 6. if (pat.match(a[i])) 7. sum += Sample.parseNumber(pat.getParen(0)); 8. System.out.println("sum = " + sum); 9. }10. static void goo(String [] a) throws RESyntaxException {11. RE exp = new RE("[0-9,]+");12. int sum = 0;13. for (int i = 0; i < a.length; ++i)14. if (exp.match(a[i]))15. sum += parseNumber(exp.getParen(0));16. System.out.println("sum = " + sum);17. }

Lexical analysis

Transformation

Token sequence

Match detection

Transformed token sequence

Clones on transformed sequence

Formatting

Page 8: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Problem of Code Clone Detection Tools

Take enormous time for existing tools to detect code clones on large-scale software

Suggest an approach for detecting code clone for a collection of similar large-scale software productsExcluding detecting code clones among each

set of files that are identical each other

8

A identical file set : A set of files that are identical each other

Page 9: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Overview Of Our Approach

Step1. Calculate MD5 hash. Step2. Prepare Input Files for CCFinder Step3. Detect code clones using CCFinder Step4. Generate all clone sets

9

(1) Calculate MD5 Hash

(2) Prepare Input Files

for CCFinder

(3) Detect Code Clones Using

CCFinder

(4) Generate All Clone

Sets

Hashed files

Input Files for

CCFinder

Identical File Sets

Clone Sets

All Clone Sets

Source Files

Page 10: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Step1. Calculate MD5 Hash

Creates MD5 hash value of input filesMD5 hash does not require any large

substitution tables

10

Ce9e187434e35746abf2C9ad2A77bdd27ed90608d1262244897ccd1164dc3

------------------------------------------------------

Calculate

MD5 Hash

Source Files Source Files

Page 11: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Step2. Prepare Input Files for CCFinder (1/2)

Detect identical file sets

11

0cc0cc 0cc

A1 A2 A3 C1 D1

175 A9

. . . .be05be05

. . . .

B2B1

MD5 Hash

Identical File Set A

Identical File Set B

Identical File SetsInput Files for CCFinder

Page 12: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University 12

C1 D1

175 A9

. . . . be05be05

B2B1

. . . .

Identical File Set A

Identical File Set B

Identical File SetsInput Files for CCFinder

0cc0cc 0cc

A1 A2 A3

Step2. Prepare Input Files for CCFinder (2/2)

Prepare Input Files for CCFinder

Page 13: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Step3. Detect Code Clones

Use CCFinder to detect code clones

13

C1 D1

. . . .

B2B1

. . . .

Identical File Set A

Identical File Set B

Identical File Sets

A1 A2 A3

Detected Code Clones

Code Clones Detected by CCFinder

Page 14: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Step3. Detect Code Clones

Use CCFinder to detect code clones

14

C1 D1

. . . .. . . .

Identical File Set A

Identical File Set B

Identical File Sets

A1 A2 A3

Detected Code Clones

B2B1 D1

Clone Sets Detected by CCFinder

Clone Set 1

Clone Set 2

Page 15: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Step4. Generate all clone sets

Generate all clone sets

15

A1 A2 A3

. . . .

B2B1

. . . .

Identical File Set A

Identical File Set B

A2 A3

C1 D1 A1

D1 B2B1

All Clone Sets

Clone Set 1

Clone Set 2. . . .

Identical File Sets

Page 16: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Overview of Case Study (2/2)

ApproachCompare detection time between our method and

using only CCFinderConfirm that the detection result of our method is

the same as the one of using only CCFinder

Detection Environment64 bits Windows 7 Professional workstation

equipped with 2 processors and 24 gigabytes of main memory.

16

Page 17: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Results of Case Study (1/2)

Detection TimeOur approach detects code clones faster than

using only CCFinder.

17

Project Name Only CCFinder Our Approach#Clone Sets

Time (sec.)

#Clone Sets

#Identical File Sets

Time (sec.)

Apache ant 11,169 241 10,692 3,083 89Linux kernel 24,235 1,119 21,343 1,967 168Galxy y pro(GT-B5510 model

325,274 113,445

148,761 28,181 11,902

Page 18: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Results of Case Study (2/2)

Accuracy of Results : manually checked outputsArbitrary selected 30 clone sets that are

detected by our approach from each OSSSelected 30 identical file sets from each OSS

project.

18

Page 19: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Summary

Suggest an approach for detecting code clones for a collection of similar large-scale software products. MD5 hash to identify identical file setsCCFinder to detect code clones.

Apply our approach to three OSS projects and compared code clone detection time between using only CCFinder and our approach.Our approach takes shorter time to detect code clones.

19

Page 20: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Future Work

Improve for detecting files with slightly modification as identical file setsOur current approach detects file that are

identical each as a identical file set Apply to various size of software projects in

different domains Introduce other code clones detection tools

and compare results from them in the case study

20

Page 21: A clone detection approach  for  a collection of similar  large-scale software  products

Department of Computer Science, Graduate School of Information Science & Technology, Osaka University

Thank you for your attentions

21