69
Reverse Architecting Arie van Deursen

Reverse Architecting Arie van Deursen 2 Outline Legacy systems Reverse architecting Architecture exploration 3 Extraction 3 Abstraction 3 Presentation

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Reverse Architecting

Arie van Deursen

2

Outline

Legacy systems Reverse architecting Architecture exploration

Extraction Abstraction Presentation

Evaluation

3

Motivation

Multi-channel distribution Web enable existing applications

Due dilligence / QA Company merger

Helping software immigrants

Estimating new functionality

Documentationat best

out of date

4

Legacy Systems

Definition: Any information

system that significantly resists evolution

to meet new and changing business requirements

Characteristics Large Geriatric Outdated

languages Outdated

databases Isolated

5

Software Volume

Capers Jones software size estimate: 700,000,000,000 lines of code (7 * 109 function points ) (1 fp ~ 110 lines of code)

Total nr of programmers: 10,000,000 40% new dev. 45% enhancements, 15%

repair (2020: 30%, 55%, 15%)

6

Legacy By Example

7

Reverse Architecting: Motivation

Architecture description lost or outdated

Obtain advantages of expl. arch.: Stakeholder communication Explicit design decisions Transferable abstraction

Architecture conformance checking Quality attribute analysis

8

Software Architecture

Structure(s) of a system which comprise the software components the externally visible properties of

those systems and the relationships among them

9

Architectural Structures

Module structure Data model structure Process structure Call structure Type structure GUI flow ...

10

Processview

Physicalview

Developmentview

Logicalview

Use caseview

The 4 + 1 View Model

Extract & compare!

11

Reverse Engineering

The process of analyzing a subject system with two goals in mind: to identify the system's components

and their interrelationships; and, to create representations of the system

in another form or at a higher level of abstraction.

DecompilationReverse Architecting

12

Reengineering

The examination and alteration of a subject system

to reconstitute it in a new form and the subsequent implementation

of that new form

Beyond analysis -- actually improve.

13

Reengineering

14

Program Understanding

the task of building mental models of an underlying software system

at various abstraction levels, ranging from models of the code itself to ones of the underlying application domain,

for software maintenance, evolution, and reengineering purposes 50% of

maintenanceeffort!!

15

Cognitive Processes

Building a mental model Top down / bottom up / opportunistic Generate and validate hypotheses Chunking: create higher structures

from chunks of low-level information Cross referencing: understand

relationships

16

Supporting Program Understanding

Architects build up mental models: various abstractions of software system hierarchies for varying levels of detail graph-like structures for dependencies

How can we support this process? infer number of predefined abstractions enrich system’s source code with

abstractions let architect explore result

17

Architecture Exploration

Lesson from compiler construction:split processing in separate stages

parsing turns source code into intermediate form

optimisation improves intermediate form code generation emits the machine code

Goal: Translate source code into form that can easily be processed by humans

Similarity with compilers: translate source code into form that can

be processed by machines

18

Architecture Exploration

Extract src models from system artifacts Query/manipulate to infer new knowledge Present different views on results

extract resultsrepository view

query

artifacts

19

Source Model Extraction

extract resultsrepository view

query

artifacts

20

Source Model Extraction

Derive information from system artifacts variable usage, call graphs, file

dependencies, database access, …

Challenges Accurate & complete results Flexible: easy to write and adapt Robust: deal with irregularities in input

21

Grammar Challenges

Syntax Errors Language Dialects Local Idioms

Missing Parts Embedded Languages Preprocessing

• Additional problem: grammar availability– process languages without grammar

(e.g. undisclosed proprietary languages)– development of full grammar is expensive

(Cobol: 1500 productions, 4-5 months)

22

Processing Artifacts

Syntactical analysis generate / hand-code / reuse parser

Lexical analysis tools like perl, grep, Awk or LSME, MultiLex generally easier to develop

accurate complete flexible robust

syntactical + + – –lexical – – + +

23

Island Grammars

Grammar containing: detailed productions for constructs of interest liberal productions that catch remainder

Islands:accuracy & completeness

Water:robustness

24

Island Grammars

Grammar containing: detailed productions for constructs of interest liberal productions that catch remainder

Input

Parse tree “standard” grammar

Parse tree island grammar

25

Accept larger language: catch dialects, syntax errors, embedded languages, …

Lisland

Island Grammars

Grammar containing: detailed productions for constructs of interest liberal productions that catch remainder

L

26

GL

Gi

GL

Gi’

Island Grammars

Grammar containing: detailed productions for constructs of interest liberal productions that catch remainder

Often smaller grammar can share productions can have different structure

27

lexical syntax~[] Water {avoid}

context-free syntaxWater PartPart* Input

Example (Water)

Water is “fall-back”

28

lexical syntax~[] Water {avoid}[A-Z][A-Z0-9]* Id

context-free syntaxWater PartPart* Input“CALL” Id CallCall Part

Example (Program Calls)

Water is “fall-back”

29

Query and Manipulate

extract resultsrepository view

query

artifacts

30

Query and Manipulate

Goals: infer new knowledge & abstractions filter information

Example structures: Perform graph Call graph (OI, PVL) Screen flow Batch job Subsystem dbs

In search formore abstraction

31

Combining Data & Functionality

Cluster analysis technique for finding groups in data Relies on metrics to compare distance

between data items Concept analysis

for finding groups too Relies on maximal subsets of data items

sharing a set of features

32

Cluster Analysis

Calculate distance (similarity) number between all data items (record fields)

Use clustering to find hierarchyField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

33

DendrogramField Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

0 1

NameTitleInitialPrefix

34

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

35

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

Distance is 1

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

36

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

CityDistance is 1

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

37

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

City

Street

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

38

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

City

Street

Field Name P1 P2 P3 P4NAME 1 0 0 0TITLE 1 0 0 0INITIAL 1 0 0 0PREFIX 1 0 0 0NUMBER 0 0 0 1NUMBER-EXT 0 0 0 1ZIPCODE 0 0 0 1STREET 0 0 1 1CITY 0 1 0 1

39

Dendrogram0 1

NameTitleInitialPrefix

NumberNb-ExtZipcode

City

Street

40

Dendrogram from Real Data0 1 2

AmountAccountOfficeName

BankCityIntAccountOfficeType

PaymentKindRelationNr

ChangeDate

TitleCdPrefixInitial

ZipCdCountyCd

StreetNr

MortSeqNrMortNr

CityStreet

Name

41

Concept Analysis

Relies on maximal subsets of data items sharing a set of features

Concept analysis finds a latticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

42

Concept LatticeField Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

All Variablestop

bottomP1 P2 P3 P4

Set of features

Set of items(field names)

43

Concept Lattice

top

P1

Name TitleInitial Prefix

P4

Number Nb-ExtZipcode Street City

P1 P2 P3 P4

bottom

All Variables

Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

44

Concept Lattice

top

P1

Name TitleInitial Prefix

P4

P1 P2 P3 P4

P3 P4

Street

P2 P4

City

Number Nb-ExtZipcode Street City

All Variables

bottom

Field Name P1 P2 P3 P4NAME xTITLE xINITIAL xPREFIX xNUMBER xNUMBER-EXT xZIPCODE xSTREET x xCITY x x

45

Concept Lattice

top

P1

Name TitleInitial Prefix

P4

P1 P2 P3 P4

P3 P4

Street

P2 P4

City

All Variables

Number Nb-ExtZipcode Street City

bottom

46

Concept

Fields

Progr. nrs

One field

Many fields

47

System Views

Grouping method based on feature table

Metrics or subset based Find alternative system views:

Kruchten’s logical view Object-based view on procedural code Starting point for “objectification”

Keep “human in the loop”

48

Types

A type describes a set of possible values

A type groups variables A type encapsulates representation Parameter types provide interfaces Types provide component connectors

Types are architectural structures

49

But types are already available...

Not in a legacy language like Cobol: Data division declares variables +

structure No separation between type/variable. Repeated structure per variable. No enumeration types, no ranges. No parameters for sections

Similar problems with other legacy languages

50

Automatic Type Inference

Group variables based on usage Initially:

Each variable unique primitive type

From statements infer equivalencies: Assignment v := ev := e Comparison e1 > e2e1 > e2 Computation e1 + e2e1 + e2

DATA DIVISION.01 PERSON. 03 INITIALS PIC X(05). 03 NAME PIC X(27). 03 STREET PIC X(18).01 TAB000 03 A00-NAME-PART. 05 A00-POS PIC X(01) OCCURS 40. 03 A00-MAX PIC S9(03) COMP-3 VALUE 40. 03 A00-FILLED PIC S9(03) COMP-3 VALUE 0.01 N000. 03 N100 PIC S9(03) COMP-3 VALUE 0. ...PROCEDURE DIVISION. R210-INITIAL SECTION. MOVE INITIALS TO A00-NAME-PART. PERFORM R300-COMPOSE-NAME. R300-COMPOSE-NAME SECTION. ... PERFORM UNTIL N100 > A00-MAX ... IF A00-FILLED = N100 ...

Example

DATA DIVISION.01 PERSON. 03 INITIALS PIC X(05). 03 NAME PIC X(27). 03 STREET PIC X(18).01 TAB000 03 A00-NAME-PART. 05 A00-POS PIC X(01) OCCURS 40. 03 A00-MAX PIC S9(03) COMP-3 VALUE 40. 03 A00-FILLED PIC S9(03) COMP-3 VALUE 0.01 N000. 03 N100 PIC S9(03) COMP-3 VALUE 0. ...PROCEDURE DIVISION. R210-INITIAL SECTION. MOVE INITIALS TO A00-NAME-PART. PERFORM R300-COMPOSE-NAME. R300-COMPOSE-NAME SECTION. ... PERFORM UNTIL N100 > A00-MAX ... IF A00-FILLED = N100 ...

N100, A00-MAX and A00-FILLED are equivalent

Example

DATA DIVISION.01 PERSON. 03 INITIALS PIC X(05). 03 NAME PIC X(27). 03 STREET PIC X(18).01 TAB000 03 A00-NAME-PART. 05 A00-POS PIC X(01) OCCURS 40. 03 A00-MAX PIC S9(03) COMP-3 VALUE 40. 03 A00-FILLED PIC S9(03) COMP-3 VALUE 0.01 N000. 03 N100 PIC S9(03) COMP-3 VALUE 0. ...PROCEDURE DIVISION. R210-INITIAL SECTION. MOVE INITIALS TO A00-NAME-PART. PERFORM R300-COMPOSE-NAME. R300-COMPOSE-NAME SECTION. ... PERFORM UNTIL N100 > A00-MAX ... IF A00-FILLED = N100 ...

Example

DATA DIVISION.01 PERSON. 03 INITIALS PIC X(05). 03 NAME PIC X(27). 03 STREET PIC X(18).01 TAB000 03 A00-NAME-PART. 05 A00-POS PIC X(01) OCCURS 40. 03 A00-MAX PIC S9(03) COMP-3 VALUE 40. 03 A00-FILLED PIC S9(03) COMP-3 VALUE 0.01 N000. 03 N100 PIC S9(03) COMP-3 VALUE 0. ...PROCEDURE DIVISION. R210-INITIAL SECTION. MOVE INITIALS TO A00-NAME-PART. PERFORM R300-COMPOSE-NAME. R300-COMPOSE-NAME SECTION. ... PERFORM UNTIL N100 > A00-MAX ... IF A00-FILLED = N100 ...

INITIALSsubtype of A00-NAME-PART

Example

DATA DIVISION.01 PERSON. 03 INITIALS PIC X(05). 03 NAME PIC X(27). 03 STREET PIC X(18).01 TAB000 03 A00-NAME-PART. 05 A00-POS PIC X(01) OCCURS 40. 03 A00-MAX PIC S9(03) COMP-3 VALUE 40. 03 A00-FILLED PIC S9(03) COMP-3 VALUE 0.01 N000. 03 N100 PIC S9(03) COMP-3 VALUE 0. ...PROCEDURE DIVISION. R210-INITIAL SECTION. MOVE INITIALS TO A00-NAME-PART. PERFORM R300-COMPOSE-NAME. R300-COMPOSE-NAME SECTION. ... PERFORM UNTIL N100 > A00-MAX ... IF A00-FILLED = N100 ...

Example

56

System Level Types

Propagate types across modules Calls Database operations File I/O Include files / copybooks

Lift type dependencies to package level

57

Type Inference Case Study (I)

100,000 lines Cobol / CICS system First param of all batch progs:

program-fields info required for restart and error recovery literals in subroutine field: all progs

First param of all on line progs: dfhcommarea mapped to appropriate record --> type

58

Type Inference Case Study (II)

Programs with integer parameter Used as enumeration type Value represents function to be performed Program as package

Parameter links Formal parameters of same type RA31.6 = RA36.4

Relations between copybooks

59

Presentation of Results

extract resultsrepository view

query

artifacts

60

Presentation Desiderata

Show multiple structures Show relationships between structures Multiple levels of abstraction

Zoom in, zoom out Visual as well as textual information

Graph visualization Browsing and searching

61

Presenting ArchitecturesUsing Hypertext

Hyperlinked pages for system elements

Multiple structures, multiple views Backbone: system hierarchy, sources Abstractions become additional

navigation structures Text & clickable graphs

62

Types of navigation

Vertical browsing supported by hierarchical structures zoom into more detailed level

system subsystem program … source

Horizontal browsing supported by graph-like structures find related on same abstraction level

called programs, variables of same type, etc

63

Presentation Challenges

Handling abstractions not visible in code Giving abstractions a meaningful name

e.g., name for inferred type Defining starting points for browsing

lists of types, programs, copybooks, words, lits

add cross-cutting hyperlinks on all levels

64

Advanced Documentation Generation

DocGen Provide technical documentation Used for all ABN AMRO Cobol sources Customizable product line

TypeExplorer Include inferred types as navigation

structure Advance level of abstraction

65

Tool Sets

Rigi (Victoria) Bauhaus

(Stuttgart) Dali (SEI) Portable Bookshelf

(Toronto) DocGen

(Amsterdam)

Extract Query Abstract Present Visualize Browse Search

66

SWARM / WCRE 2001

The UML Rationale recovery Pattern-oriented software architecture Architecture description languages Dynamic analysis Software product lines Software architecture “user’s guide”

67

Summary

Extract, abstract, present Multiple structures Zoom in/out, switch abstraction levels Browse / hypertext Compiler construction technology Active area of research Experiment in your projects

68

Further Reading (I)

A. van Deursen and T. Kuipers. Identifying Objects using Cluster &Concept Analysis. ICSE’99

A. van Deursen and T. Kuipers. Building Documentation Generators. ICSM’99.

A. van Deursen and L. Moonen. Exploring Legacy Systems Using Types. WCRE’00.

A. van Deursen. Software Architecture Recovery and Modeling. WCRE’2001 workshop report. Applied Computing Review, ACM, 2002.

69

Further Reading (II)

www.cwi.nl/~arie/papers/

www.cwi.nl/~arie/swarm2001/

www.program-transformation.org