33
1 Analysis - Improving GP with Statistics Chap. 8 발발발 : 발발발

1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

Embed Size (px)

Citation preview

Page 1: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

1

Analysis - Improving GP with Statistics

Chap. 8

발표자 : 김정집

Page 2: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

2

0 Introduction

Gap between theory and practice is wide measuring is significant

dynamic run control data preprocessing significance or meaning of run

online analysis tools offline analysis tools

Page 3: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

3

1 Statistical Tools for GP 1.1 Basic Statistics Concepts

statistical population entire group of instances (measured and

unmeasured) sample

subset of a statistical population statistical significance level

a percentage value, chosen for judging the value of a measurement

Page 4: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

4

1.2 Basic Tools for GP

Confidence interval range around a measured occurrence of an event in wh

ich the statistician estimates a specified portion of future measurements of that same event

example 3000 programs/population avg. indiv. is 200 nodes 3000*200->600 000 nodes suppose that 600 out of 1000 sampling nodes were introns

95% confidence level is 57%~63% 2000 sampling nodes : 59%~61% 30 sampling nodes : 40%~78%

Page 5: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

5

Correlation Measures

Correlation Measures Correlation Coefficient

0.8 means 80% of the variation in one variable may be explained by variations in the other variable

+ : increasing values of the first variable are related to increasing values of the second variable

student’s t-test t-test>=2 : two variables are related at the 95%

confidence level or better

use of correlation analysis in GP ex) relationship between mutation rate/performance

Page 6: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

6

Testing Propositions

Multiple Regression more sophisticated technique than simple correlation

coefficient

Testing Propositions F-Test

test whether a proposition is not false

Caveat statistics has a number of assumptions in practice , the tests work pretty well even if

assumptions does not meet

Page 7: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

7

2. Offline Preprocessing and Analysis

Task of researcher select data series and data instances determine transformation on the data

Preprocessing and analysis preprocessing to meet input representation

constraints preprocessing to extract useful information from

the data to enable the machine learning system to learn

analyzing the data to select a training set

Page 8: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

8

2.1 Feature Representation Constraints

Representation of features ANN:[-1:1] Boolean system : 0 or 1, true or false GP

great freedom of representation of the features can accept inputs that can be handled by the computer

language

Page 9: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

9

2.2 Feature Extraction

Feature Extraction extract useful types of information from the raw data filter out noise

Principal Components Analysis(PCA) purpose:reducing redundancy in raw data extracts the useful variation from several partially co

rrelated data series and condenses that information into fewer but completely uncorrelated data series

not automatic process

Page 10: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

10

Extraction of Periodic Information in Time Series

Data Extraction of Periodic Information in Time Ser

ies Data simple techniques

simple or exponential moving averages (SMAs or EMAs) SMA:serve as a sort of low pass filter->lag

discrete fourier transform

Page 11: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

11

2.3 Analysis of Input Data

Selecting a training set how to choose among input series how to choose training instances

Choosing among Input Series meta-learning approach correlation coefficients between each potential input a

nd output->narrow what input to use correlation coefficients between each potential input->

grouping inputs try different runs with different combinations of variab

les->select variables that are associated with good runs

Page 12: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

12

Choosing Training Instances

Choosing Training Instances Data Mining

many more training instances available than a GP system could possibly digest

approach select a random sample of training instances calculate approximate sample size pick the random sample that is picked matches sampled

distribution closely GP system is programmed to pick a new small training set

Page 13: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

13

3 Offline Postprocessing 3.1 Measurement of Processing Effort

processing effort the number of indivs that have to be processed in o

rder to find a solution Instantaneous Probability

a certain run with M indivs generates a solution in generation i

Success Probability prob. That one obtains a solution for the given prob

lem if one performs a run over i generations

Page 14: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

14

the probability of finding a solution by generation i, using R runs

how many runs do we need to solve a program with a certain probability z?

Page 15: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

15

3.2 Trait Mining

Code generated by GP problem-relevant but redundant code and irrelevant

to the problem at hand avoiding useless code

restrict the size and complexity of the programs gene banking

keeps book on all expressions evolved so far during a GP run

Page 16: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

16

4 Analysis and Measurement of Online Data

4.1 Online Data Analysis monitor the transition from randomness to stability highlight how the transition takes place raise the possibility of being able to control GP

runs through feedback from the online measurements

Page 17: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

17

4.2 Measurement of Online Data

Generational Online Measurement population as a whole

avg. fitness, percentage of the population that is comprised of introns, ...

Steady-State Online Measurement

Page 18: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

18

4.3 Survey of Available Online Tools

Fitness best fitness avg. fitness variance of the fitness

Page 19: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

19

Diversity

Diversity genotypic diversity:structural difference phenotypic diversity:behavioral difference

Measuring Genotypic Diversity no quality(fitness) information contained based on a comparison of the structure of the indivi

duals only

Page 20: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

20

Edit distance

Edit distance the number of elementary substitution operations

necessary to traverse the search space from one program to another

Fixed Length Genomes

Tree Genomes

Page 21: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

21

Phenotypic Diversity

Fitness Variance Fitness Histograms

Entropy

Page 22: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

22

Measuring Operator Effects

Crossover Effects comparison avg. Fitness of both parents with avg.

Fitness of both offspring the fitness of children and parents are compared by

one by one formal way

Page 23: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

23

Intron Measurements

Intron counting process of ascertaining the number of introns intron counting in tree structures

flag:indicating whether this node has the same output as one of the inputs

intron counting in linear genomes replace instruction with a NOP-instruction, if there was n

o change for any of the fitness cases, classify this instruction as intron

Page 24: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

24

Compression Measurement

Compression effective length of an individual often is reduced d

uring evolution effective length=total length - intron length

Page 25: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

25

Node use and Block activation

Node use how often individual nodes are used in the present

generation incrementing a counter associated with a node

Block activation the number of times the root node of a bock is

executed Salient block

code which influence fitness evaluation

Page 26: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

26

Real Time Run Control Using Online Measurements

PADO meta-learning module changed the crossover opera

tor itself during the run. AIMGP

termination condition : when destructive crossover fell below 10% of total events

doubled speed

Page 27: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

27

5. Generalization and Induction

Generalization problem sampling error overfitting

complexity of the learned solution amount of time spent training size of the training set

Page 28: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

28

5.1 An Example of Overfitting and Poor Generalization

Training function

error function

Page 29: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

29

Page 30: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

30

Page 31: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

31

Page 32: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

32

5.2 Dealing with Generalization Issues

Training set and test set best indiv. From training set is run on the test set which is the best indiv. of training set?

Highest fitness one may be overfitted the training data.

Adding a new test set

Page 33: 1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

33

6 Conclusion

Observe, measure, test tools to estimate the predictive value of GP

models tools to improve GP system or to test theories

about