1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

1

Analysis - Improving GP with Statistics

Chap. 8

발표자 : 김정집

2

0 Introduction

Gap between theory and practice is wide measuring is significant

dynamic run control data preprocessing significance or meaning of run

online analysis tools offline analysis tools

3

1 Statistical Tools for GP 1.1 Basic Statistics Concepts

statistical population entire group of instances (measured and

unmeasured) sample

subset of a statistical population statistical significance level

a percentage value, chosen for judging the value of a measurement

4

1.2 Basic Tools for GP

Confidence interval range around a measured occurrence of an event in wh

ich the statistician estimates a specified portion of future measurements of that same event

example 3000 programs/population avg. indiv. is 200 nodes 3000*200->600 000 nodes suppose that 600 out of 1000 sampling nodes were introns

95% confidence level is 57%~63% 2000 sampling nodes : 59%~61% 30 sampling nodes : 40%~78%

5

Correlation Measures

Correlation Measures Correlation Coefficient

0.8 means 80% of the variation in one variable may be explained by variations in the other variable

+ : increasing values of the first variable are related to increasing values of the second variable

student’s t-test t-test>=2 : two variables are related at the 95%

confidence level or better

use of correlation analysis in GP ex) relationship between mutation rate/performance

6

Testing Propositions

Multiple Regression more sophisticated technique than simple correlation

coefficient

Testing Propositions F-Test

test whether a proposition is not false

Caveat statistics has a number of assumptions in practice , the tests work pretty well even if

assumptions does not meet

7

2. Offline Preprocessing and Analysis

Task of researcher select data series and data instances determine transformation on the data

Preprocessing and analysis preprocessing to meet input representation

constraints preprocessing to extract useful information from

the data to enable the machine learning system to learn

analyzing the data to select a training set

8

2.1 Feature Representation Constraints

Representation of features ANN:[-1:1] Boolean system : 0 or 1, true or false GP

great freedom of representation of the features can accept inputs that can be handled by the computer

language

9

2.2 Feature Extraction

Feature Extraction extract useful types of information from the raw data filter out noise

Principal Components Analysis(PCA) purpose:reducing redundancy in raw data extracts the useful variation from several partially co

rrelated data series and condenses that information into fewer but completely uncorrelated data series

not automatic process

10

Extraction of Periodic Information in Time Series

Data Extraction of Periodic Information in Time Ser

ies Data simple techniques

simple or exponential moving averages (SMAs or EMAs) SMA:serve as a sort of low pass filter->lag

discrete fourier transform

11

2.3 Analysis of Input Data

Selecting a training set how to choose among input series how to choose training instances

Choosing among Input Series meta-learning approach correlation coefficients between each potential input a

nd output->narrow what input to use correlation coefficients between each potential input->

grouping inputs try different runs with different combinations of variab

les->select variables that are associated with good runs

12

Choosing Training Instances

Choosing Training Instances Data Mining

many more training instances available than a GP system could possibly digest

approach select a random sample of training instances calculate approximate sample size pick the random sample that is picked matches sampled

distribution closely GP system is programmed to pick a new small training set

13

3 Offline Postprocessing 3.1 Measurement of Processing Effort

processing effort the number of indivs that have to be processed in o

rder to find a solution Instantaneous Probability

a certain run with M indivs generates a solution in generation i

Success Probability prob. That one obtains a solution for the given prob

lem if one performs a run over i generations

14

the probability of finding a solution by generation i, using R runs

how many runs do we need to solve a program with a certain probability z?

15

3.2 Trait Mining

Code generated by GP problem-relevant but redundant code and irrelevant

to the problem at hand avoiding useless code

restrict the size and complexity of the programs gene banking

keeps book on all expressions evolved so far during a GP run

16

4 Analysis and Measurement of Online Data

4.1 Online Data Analysis monitor the transition from randomness to stability highlight how the transition takes place raise the possibility of being able to control GP

runs through feedback from the online measurements

17

4.2 Measurement of Online Data

Generational Online Measurement population as a whole

avg. fitness, percentage of the population that is comprised of introns, ...

Steady-State Online Measurement

18

4.3 Survey of Available Online Tools

Fitness best fitness avg. fitness variance of the fitness

19

Diversity

Diversity genotypic diversity:structural difference phenotypic diversity:behavioral difference

Measuring Genotypic Diversity no quality(fitness) information contained based on a comparison of the structure of the indivi

duals only

20

Edit distance

Edit distance the number of elementary substitution operations

necessary to traverse the search space from one program to another

Fixed Length Genomes

Tree Genomes

21

Phenotypic Diversity

Fitness Variance Fitness Histograms

Entropy

22

Measuring Operator Effects

Crossover Effects comparison avg. Fitness of both parents with avg.

Fitness of both offspring the fitness of children and parents are compared by

one by one formal way

23

Intron Measurements

Intron counting process of ascertaining the number of introns intron counting in tree structures

flag:indicating whether this node has the same output as one of the inputs

intron counting in linear genomes replace instruction with a NOP-instruction, if there was n

o change for any of the fitness cases, classify this instruction as intron

24

Compression Measurement

Compression effective length of an individual often is reduced d

uring evolution effective length=total length - intron length

25

Node use and Block activation

Node use how often individual nodes are used in the present

generation incrementing a counter associated with a node

Block activation the number of times the root node of a bock is

executed Salient block

code which influence fitness evaluation

26

Real Time Run Control Using Online Measurements

PADO meta-learning module changed the crossover opera

tor itself during the run. AIMGP

termination condition : when destructive crossover fell below 10% of total events

doubled speed

27

5. Generalization and Induction

Generalization problem sampling error overfitting

complexity of the learned solution amount of time spent training size of the training set

28

5.1 An Example of Overfitting and Poor Generalization

Training function

error function

29

30

31

32

5.2 Dealing with Generalization Issues

Training set and test set best indiv. From training set is run on the test set which is the best indiv. of training set?

Highest fitness one may be overfitted the training data.

Adding a new test set

33

6 Conclusion

Observe, measure, test tools to estimate the predictive value of GP

models tools to improve GP system or to test theories

about

Documents

1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집