Upload
ralph-bryan
View
216
Download
0
Embed Size (px)
Citation preview
1
Analysis - Improving GP with Statistics
Chap. 8
발표자 : 김정집
2
0 Introduction
Gap between theory and practice is wide measuring is significant
dynamic run control data preprocessing significance or meaning of run
online analysis tools offline analysis tools
3
1 Statistical Tools for GP 1.1 Basic Statistics Concepts
statistical population entire group of instances (measured and
unmeasured) sample
subset of a statistical population statistical significance level
a percentage value, chosen for judging the value of a measurement
4
1.2 Basic Tools for GP
Confidence interval range around a measured occurrence of an event in wh
ich the statistician estimates a specified portion of future measurements of that same event
example 3000 programs/population avg. indiv. is 200 nodes 3000*200->600 000 nodes suppose that 600 out of 1000 sampling nodes were introns
95% confidence level is 57%~63% 2000 sampling nodes : 59%~61% 30 sampling nodes : 40%~78%
5
Correlation Measures
Correlation Measures Correlation Coefficient
0.8 means 80% of the variation in one variable may be explained by variations in the other variable
+ : increasing values of the first variable are related to increasing values of the second variable
student’s t-test t-test>=2 : two variables are related at the 95%
confidence level or better
use of correlation analysis in GP ex) relationship between mutation rate/performance
6
Testing Propositions
Multiple Regression more sophisticated technique than simple correlation
coefficient
Testing Propositions F-Test
test whether a proposition is not false
Caveat statistics has a number of assumptions in practice , the tests work pretty well even if
assumptions does not meet
7
2. Offline Preprocessing and Analysis
Task of researcher select data series and data instances determine transformation on the data
Preprocessing and analysis preprocessing to meet input representation
constraints preprocessing to extract useful information from
the data to enable the machine learning system to learn
analyzing the data to select a training set
8
2.1 Feature Representation Constraints
Representation of features ANN:[-1:1] Boolean system : 0 or 1, true or false GP
great freedom of representation of the features can accept inputs that can be handled by the computer
language
9
2.2 Feature Extraction
Feature Extraction extract useful types of information from the raw data filter out noise
Principal Components Analysis(PCA) purpose:reducing redundancy in raw data extracts the useful variation from several partially co
rrelated data series and condenses that information into fewer but completely uncorrelated data series
not automatic process
10
Extraction of Periodic Information in Time Series
Data Extraction of Periodic Information in Time Ser
ies Data simple techniques
simple or exponential moving averages (SMAs or EMAs) SMA:serve as a sort of low pass filter->lag
discrete fourier transform
11
2.3 Analysis of Input Data
Selecting a training set how to choose among input series how to choose training instances
Choosing among Input Series meta-learning approach correlation coefficients between each potential input a
nd output->narrow what input to use correlation coefficients between each potential input->
grouping inputs try different runs with different combinations of variab
les->select variables that are associated with good runs
12
Choosing Training Instances
Choosing Training Instances Data Mining
many more training instances available than a GP system could possibly digest
approach select a random sample of training instances calculate approximate sample size pick the random sample that is picked matches sampled
distribution closely GP system is programmed to pick a new small training set
13
3 Offline Postprocessing 3.1 Measurement of Processing Effort
processing effort the number of indivs that have to be processed in o
rder to find a solution Instantaneous Probability
a certain run with M indivs generates a solution in generation i
Success Probability prob. That one obtains a solution for the given prob
lem if one performs a run over i generations
14
the probability of finding a solution by generation i, using R runs
how many runs do we need to solve a program with a certain probability z?
15
3.2 Trait Mining
Code generated by GP problem-relevant but redundant code and irrelevant
to the problem at hand avoiding useless code
restrict the size and complexity of the programs gene banking
keeps book on all expressions evolved so far during a GP run
16
4 Analysis and Measurement of Online Data
4.1 Online Data Analysis monitor the transition from randomness to stability highlight how the transition takes place raise the possibility of being able to control GP
runs through feedback from the online measurements
17
4.2 Measurement of Online Data
Generational Online Measurement population as a whole
avg. fitness, percentage of the population that is comprised of introns, ...
Steady-State Online Measurement
18
4.3 Survey of Available Online Tools
Fitness best fitness avg. fitness variance of the fitness
19
Diversity
Diversity genotypic diversity:structural difference phenotypic diversity:behavioral difference
Measuring Genotypic Diversity no quality(fitness) information contained based on a comparison of the structure of the indivi
duals only
20
Edit distance
Edit distance the number of elementary substitution operations
necessary to traverse the search space from one program to another
Fixed Length Genomes
Tree Genomes
21
Phenotypic Diversity
Fitness Variance Fitness Histograms
Entropy
22
Measuring Operator Effects
Crossover Effects comparison avg. Fitness of both parents with avg.
Fitness of both offspring the fitness of children and parents are compared by
one by one formal way
23
Intron Measurements
Intron counting process of ascertaining the number of introns intron counting in tree structures
flag:indicating whether this node has the same output as one of the inputs
intron counting in linear genomes replace instruction with a NOP-instruction, if there was n
o change for any of the fitness cases, classify this instruction as intron
24
Compression Measurement
Compression effective length of an individual often is reduced d
uring evolution effective length=total length - intron length
25
Node use and Block activation
Node use how often individual nodes are used in the present
generation incrementing a counter associated with a node
Block activation the number of times the root node of a bock is
executed Salient block
code which influence fitness evaluation
26
Real Time Run Control Using Online Measurements
PADO meta-learning module changed the crossover opera
tor itself during the run. AIMGP
termination condition : when destructive crossover fell below 10% of total events
doubled speed
27
5. Generalization and Induction
Generalization problem sampling error overfitting
complexity of the learned solution amount of time spent training size of the training set
28
5.1 An Example of Overfitting and Poor Generalization
Training function
error function
29
30
31
32
5.2 Dealing with Generalization Issues
Training set and test set best indiv. From training set is run on the test set which is the best indiv. of training set?
Highest fitness one may be overfitted the training data.
Adding a new test set
33
6 Conclusion
Observe, measure, test tools to estimate the predictive value of GP
models tools to improve GP system or to test theories
about