1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

Analysis - Improving GP with Statistics

Chap. 8

발표자 : 김정집

0 Introduction

Gap between theory and practice is wide measuring is significant

dynamic run control data preprocessing significance or meaning of run

online analysis tools offline analysis tools

1 Statistical Tools for GP 1.1 Basic Statistics Concepts

statistical population entire group of instances (measured and

unmeasured) sample

subset of a statistical population statistical significance level

a percentage value, chosen for judging the value of a measurement

1.2 Basic Tools for GP

Confidence interval range around a measured occurrence of an event in wh

ich the statistician estimates a specified portion of future measurements of that same event

example 3000 programs/population avg. indiv. is 200 nodes 3000*200->600 000 nodes suppose that 600 out of 1000 sampling nodes were introns

95% confidence level is 57%~63% 2000 sampling nodes : 59%~61% 30 sampling nodes : 40%~78%

Correlation Measures

Correlation Measures Correlation Coefficient

0.8 means 80% of the variation in one variable may be explained by variations in the other variable

+ : increasing values of the first variable are related to increasing values of the second variable

student’s t-test t-test>=2 : two variables are related at the 95%

confidence level or better

use of correlation analysis in GP ex) relationship between mutation rate/performance

Testing Propositions

Multiple Regression more sophisticated technique than simple correlation

coefficient

Testing Propositions F-Test

test whether a proposition is not false

Caveat statistics has a number of assumptions in practice , the tests work pretty well even if

assumptions does not meet

2. Offline Preprocessing and Analysis

Task of researcher select data series and data instances determine transformation on the data

Preprocessing and analysis preprocessing to meet input representation

constraints preprocessing to extract useful information from

the data to enable the machine learning system to learn

analyzing the data to select a training set

2.1 Feature Representation Constraints

Representation of features ANN:[-1:1] Boolean system : 0 or 1, true or false GP

great freedom of representation of the features can accept inputs that can be handled by the computer

language

2.2 Feature Extraction

Feature Extraction extract useful types of information from the raw data filter out noise

Principal Components Analysis(PCA) purpose:reducing redundancy in raw data extracts the useful variation from several partially co

rrelated data series and condenses that information into fewer but completely uncorrelated data series

not automatic process

Extraction of Periodic Information in Time Series

Data Extraction of Periodic Information in Time Ser

ies Data simple techniques

simple or exponential moving averages (SMAs or EMAs) SMA:serve as a sort of low pass filter->lag

discrete fourier transform

2.3 Analysis of Input Data

Selecting a training set how to choose among input series how to choose training instances

Choosing among Input Series meta-learning approach correlation coefficients between each potential input a

nd output->narrow what input to use correlation coefficients between each potential input->

grouping inputs try different runs with different combinations of variab

les->select variables that are associated with good runs

Choosing Training Instances

Choosing Training Instances Data Mining

many more training instances available than a GP system could possibly digest

approach select a random sample of training instances calculate approximate sample size pick the random sample that is picked matches sampled

distribution closely GP system is programmed to pick a new small training set

3 Offline Postprocessing 3.1 Measurement of Processing Effort

processing effort the number of indivs that have to be processed in o

rder to find a solution Instantaneous Probability

a certain run with M indivs generates a solution in generation i

Success Probability prob. That one obtains a solution for the given prob

lem if one performs a run over i generations

the probability of finding a solution by generation i, using R runs

how many runs do we need to solve a program with a certain probability z?

3.2 Trait Mining

Code generated by GP problem-relevant but redundant code and irrelevant

to the problem at hand avoiding useless code

restrict the size and complexity of the programs gene banking

keeps book on all expressions evolved so far during a GP run

4 Analysis and Measurement of Online Data

4.1 Online Data Analysis monitor the transition from randomness to stability highlight how the transition takes place raise the possibility of being able to control GP

runs through feedback from the online measurements

4.2 Measurement of Online Data

Generational Online Measurement population as a whole

avg. fitness, percentage of the population that is comprised of introns, ...

Steady-State Online Measurement

4.3 Survey of Available Online Tools

Fitness best fitness avg. fitness variance of the fitness

Diversity

Diversity genotypic diversity:structural difference phenotypic diversity:behavioral difference

Measuring Genotypic Diversity no quality(fitness) information contained based on a comparison of the structure of the indivi

duals only

Edit distance

Edit distance the number of elementary substitution operations

necessary to traverse the search space from one program to another

Fixed Length Genomes

Tree Genomes

Phenotypic Diversity

Fitness Variance Fitness Histograms

Entropy

Measuring Operator Effects

Crossover Effects comparison avg. Fitness of both parents with avg.

Fitness of both offspring the fitness of children and parents are compared by

one by one formal way

Intron Measurements

Intron counting process of ascertaining the number of introns intron counting in tree structures

flag:indicating whether this node has the same output as one of the inputs

intron counting in linear genomes replace instruction with a NOP-instruction, if there was n

o change for any of the fitness cases, classify this instruction as intron

Compression Measurement

Compression effective length of an individual often is reduced d

uring evolution effective length=total length - intron length

Node use and Block activation

Node use how often individual nodes are used in the present

generation incrementing a counter associated with a node

Block activation the number of times the root node of a bock is

executed Salient block

code which influence fitness evaluation

Real Time Run Control Using Online Measurements

PADO meta-learning module changed the crossover opera

tor itself during the run. AIMGP

termination condition : when destructive crossover fell below 10% of total events

doubled speed

5. Generalization and Induction

Generalization problem sampling error overfitting

complexity of the learned solution amount of time spent training size of the training set

5.1 An Example of Overfitting and Poor Generalization

Training function

error function

5.2 Dealing with Generalization Issues

Training set and test set best indiv. From training set is run on the test set which is the best indiv. of training set?

Highest fitness one may be overfitted the training data.

Adding a new test set

6 Conclusion

Observe, measure, test tools to estimate the predictive value of GP

models tools to improve GP system or to test theories

1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집

Documents

Improving Designer Workflow

PPMMS BrochureMMS Brochure VVol. 2.03.11ol. 2.03can monitor the status of power consumption of each device and perform statistics and analysis of the power information, thus improving

Improving political participation

3.3 Pathfinding Design Architecture 저 자 : Dan Higgins 발표자 : 김용욱

치약을 파헤치다 학부 : 기계공학부 학번 : 200421593 발표자 : 전현진 발표일자 :2010.10.29

Official Whānau Statistics, Statistics NZ, 2013

발표자 : 송 준 상

KoreaPlus Statistics - Embedded on SPSS Statistics 26spss.datasolution.kr/product/file/KoreaPlus_Statistics_26_Standard.pdf · KoreaPlus Statistics - Embedded on SPSS Statistics Standard

2008 년 10 월 13 일 지도교수 : 정창근 발표자 : 조명자

발표자 : 한국기술교육대학교 디자인공학과 문무경 교수

Workshop on Improving the Integration of a Gender Perspective into Statistics, Amman, Jordan 1 – 4 December 2014 Neda Jafar, Head Statistical Policy and

Improving page migration분산처리

Fisheries Statistics - Philippine Statistics Authority Statistics of the... · FISHERIES STATISTICS OF THE PHILIPPINES, 2016-2018 FOREWORD The Fisheries Statistics of the Philippines,

STATISTICS BOTSWANA INTERNATIONAL MERCHANDISE TRADE STATISTICS · STATISTICS BOTSWANA INTERNATIONAL MERCHANDISE TRADE STATISTICS Contact Statistician: Mogotsi J. Morewanare Email:

Computer Science and Mathematical Basics Chap. 3 발표자 : 김정집

발표자: 경희사이버대학교 관광레저경영학과 교수임근욱 · 기후변화에따른골프시즌영향 발표자: 경희사이버대학교 관광레저경영학과

High-Resolution Interactive Panoramas with MPEG-4 발표자 : 김영백 임베디드시스템연구실

대학 창업교육 발전방안 - 발표자 박상혁

B.E.C.(Bose-Einstein Condensation) 발표자 : 이수룡 (98)

발표자 : 이윤미