42
K236: Basis of Data Science Lecture 6: Data Preprocessing Lecturer: Tu Bao Ho and Hieu Chi Dam TA: Moharasan Gandhimathi and Nuttapong Sanglerdsinlapachai

K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

K236: Basis of Data ScienceLecture 6: Data Preprocessing

Lecturer: Tu Bao Ho and Hieu Chi DamTA: Moharasan Gandhimathi

and Nuttapong Sanglerdsinlapachai

Page 2: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

2

Schedule of K236

1. Introduction to data science データ科学入門 6/9

2. Introduction to data science データ科学入門 6/13

3. Data and databases データとデータベース 6/16

4. Review of univariate statistics 単変量統計 6/20

5. Review of linear algebra 線形代数 6/23

6. Data mining software データマイニングソフトウェア 6/27

7. Data preprocessing データ前処理 6/30

8. Classification and prediction (1) 分類と予測 (1) 7/4

9. Knowledge evaluation 知識評価 7/7

10. Classification and prediction (2) 分類と予測 (2) 7/11

11. Classification and prediction (3) 分類と予測 (3) 7/14

12. Mining association rules (1) 相関ルールの解析 7/18

13. Mining association rules (2) 相関ルールの解析 7/21

14. Cluster analysis クラスター解析 7/25

15. Review and Examination レビューと試験 (the data is not fixed) 7/27

Page 3: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

3

Data  organized  by  function  

Create/selecttarget  database

Select  samplingtechnique  and  sample  data

Supply  missing  values

Normalizevalues

Select  DM  task  (s)

Transform  todifferent

representation

Eliminatenoisy  data

Transformvalues

Select  DM  method  (s)

Create  derivedattributes

Extract  knowledge

Find  importantattributes  &value  ranges

Test  knowledge

Refine  knowledge

Query  &  report  generationAggregation  &  sequencesAdvanced  methods

Data  warehousing

1

2

3

4

5

The data analysis process

Lecture  6

Lecture  7-­‐9,  10-­‐14 Lecture  8

Page 4: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

4

1. Why Preprocess the Data?2. Data Cleaning3. Data Integration 4. Data Reduction5. Data Transformation

Outline

Page 5: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

5

Common properties of large real-world databases:

• Incomplete: lacking attribute values or certain of interest

• Noisy: containing errors or outliers

• Inconsistent: containing discrepancies in codes or names

Veracity problem!No quality data, no quality analysis results!

Why preprocess the data?

Page 6: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

6

Data cleaning Data integration

Data reduction (instances and dimensions)

1 2

34 Data transformation

Major tasks in data preprocessing

Page 7: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

7

Major tasks in data preprocessing

• Data cleaningq Fill in missing values, smooth noisy data, identify or remove outliers, and

resolve inconsistencies

• Data integrationq Integration of multiple databases, data cubes, or files

• Data transformationq Normalization and aggregation

• Data reductionq Obtains reduced representation in volume but produces the same or

similar analytical results

• Data discretizationq Part of data reduction but with particular importance, especially for

numerical data

Page 8: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

8

1. Why Preprocess the Data?2. Data Cleaning3. Data Integration 4. Data Reduction5. Data Transformation

Outline

Page 9: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

9

• Fill in missing values

• Identify outliers and smooth out noisy data

• Correct inconsistent data

Data cleaning tasks

Page 10: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

10

Missing data

• Data is not always availableq e.g., many tuples have no recorded value for several attributes,

such as customer income in sales data

• Missing data may be due to q equipment malfunctionq inconsistent with other recorded data and thus deletedq data not entered due to misunderstandingq certain data may not be considered important at the time of

entryq not register history or changes of the data

• Missing data may need to be inferred.

Page 11: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

11

• Missing values may hide a true answer underlying in the data

• Many data mining programs cannot be applied with data that includes missing values

Missing values in databases

Class  attribute:  norm,  lt-­‐norm,  gt-­‐normOther  six  attributes  all  have  missing  values

Page 12: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

12

Methods1. Ignore  the  tuples2. Fill  in  the  missing  value  manually  

(tedious  +  infeasible?)3. Use  a  global  constant  to  fill  in  the  

missing  value4. Use  the  attribute  mean  to  fill  the  

missing  values5. Use  the  attribute  mean  (or  mode  

for  categorical  attribute)  for  all  samples  belonging  to  the  same  class  as  the  given  tuple.

6. Use  the  most  probable  value  to      fill  the  missing  value

7. Others

Methods:    2          4          5                3              6                6

Missing values in databases

yes

no

yesnonoyes

unknown

unknown

unknown

2929

29

29

none

none

nonenonenonenone

13

13713

dna

dna

dna

Page 13: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

13

Noisy data

• Noise: random error or variance in a measured variable• Incorrect attribute values may due to

q faulty data collection instrumentsq data entry problemsq data transmission problemsq technology limitationq inconsistency in naming convention

• Other data problems which requires data cleaningq duplicate recordsq incomplete dataq inconsistent data

Page 14: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

14

How to handle noisy data?

• Binning methodq first sort data and partition into (equi-depth) binsq then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.• Clustering

q detect and remove outliers• Combined computer and human inspection

q detect suspicious values and check by human• Regression

q smooth by fitting the data into regression functions

Page 15: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

15

Binning: to smooth a sorted data value by consulting its “neighborhood”, that is, the value around it (local smoothing)

q Smoothing by bin means: each value in a bin is replaced by the mean value of the bin

q Smoothing by bin medians: each bin value is replaced by the bin median

q Smoothing by bin boundaries: the minimum and maximum values in a given bin are identified as bin boundaries

How to handle noisy data?

Page 16: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

16

• The original data9, 21, 24, 21, 4, 26, 28, 34, 29, 8, 15, 25

• Sort data in the increasing order, and partition into (equidepth) bins:

4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

• Smoothing by bin means9, 9, 9, 9, 22, 22, 22, 22, 29, 29, 29, 29

• Smoothing by bin boundaries (replaced by the closest boundary)4, 4, 4, 15, 21, 21, 25, 25, 26, 26, 26, 34

How to handle noisy data?

Page 17: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

17

• Outliers may be detected by clustering analysis

Values that fall outside of the set of clusters may be considered outliers

How to handle noisy data?

Page 18: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

18

n Combined computer and human inspection: Output patterns with surprise content to a list. A human can identify the actual garbage ones.

n Regression: by fitting the data to a function, such as with regression

q Linear regression

q Multiple linear regression: more than two variables and the data are fit to a multidimensional surface

x

y

y = x + 1

X1

Y1

Y1’

How to handle noisy data?

Page 19: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

19

1. Why Preprocess the Data?2. Data Cleaning3. Data Integration 4. Data Reduction5. Data Transformation

Outline

Page 20: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

20

• Data integration combines data from multiple sources (multiple DBs, data cubes, flat files) into a coherent data store.

• Schema integration (entity identification problem): How can equivalent entities from multiple data sources be matched up?

• Redundancy: An attribute may be redundant if it can be “derived” from another table.

Data integration

Page 21: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

21

• Redundancy: can be detected by correlation analysis (correlation coefficient), e.g., how strongly one attribute implies another attribute.

• Detection and resolution of data value conflicts

BABA n

BBAAr

ss)1())((

, ---

Data integration

Page 22: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

22

1. Why Preprocess the Data?2. Data Cleaning3. Data Integration 4. Data Reduction5. Data Transformation

Outline

Page 23: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

23

• Data cube aggregation

• Dimension reduction

• Data compression

• Numerosity reduction

• Discretization and concept hierarchy generation

Strategies for data reduction

Page 24: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

24

Data cube aggregation

§ Aggregation operations are applied to the data in the construction of a data cube

On  the  left,  the  salesare  shown  per  quarter.On  the  right,  the  dataare  aggregated  toprovide  the  annualsales.

Page 25: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

25

A  data  cube  for  multidimensional  analysis  of  sales  data  with  respect  to  annual  sales  per  item  type  for  each  branch  of  company

Data cube aggregation

Page 26: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

26

Data compression: Attribute selection

Attribute subset selection (also called “feature selection”)q Stepwise forward

selectionq Stepwise backward

eliminationq Combination

of forward and backward elimination

q Many other methods

Page 27: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

27

• Discrete wavelet transformation (DWT): a linear signal processing technique that, when applied to a data vector D, transforms it to a numerically different vector D’ of wavelet coefficients.

• Store only a small fraction of the strongest of the wavelet coefficients

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Real data

WT

J=-1

J=-2

RWT

Data compression: Wavelet transforms

Page 28: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

28

• Principal Components Analysis: transform data points from k-dimensions into c-dimensions (c £ k) with minimum loss of information

• PCA searches for c-dimensional orthogonal vectors that can best be used to represent data. The original data are thus projected onto a much smaller space of c dimensions (c principal components)

• Only used for numerical data

Data compression: PCA

1 2 3

3

2

1 O1

O2

O3

O4

O5

X

Y

Z2 Z1Question: Reduction to one dimension?Z1 and Z2, which is better?

Page 29: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

29

• Can we reduce the data volume by choosing alternative, ‘smaller’ forms of data representation?

• Parameter methods: a model is used to estimate the data, so that typically only the data parameters need be stored, instead of the actual dataq Regression and Log-Linear Models: y = a x + b

• Non-parameter methods: for storing reduced representations of the data include q Histogramsq Clusteringq Sampling

Numerosity reduction

Page 30: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

30

Singleton  buckets:  Each  bucket  represents  one  price-­value/frequency  pair

An  equiwidth  histogram,  where  values  are  aggregated  so  that  each  bucket  has  a  uniform  width  of  $10

Numerosity reduction: histogram

Page 31: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

31

++

+

A  2-­D  plot  of  customer  data  with  respect  to  customer  locations  in  a  city,  showing  three  data  clusters.  Each  cluster  “center” is  marked  with  a  “+”

Numerosity reduction: Clustering

Page 32: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

32

• Simple random sample without replacement of size n (SRSWOR)

• Simple random sample with replacement of size n (SRSWR)

• Cluster sample• Stratified sample

Numerosity reduction: Sampling

equal  proportion  (e.g.,  ½)

Page 33: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

33

1. Why Preprocess the Data?2. Data Cleaning3. Data Integration 4. Data Reduction5. Data Transformation

Outline

Page 34: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

34

• Smoothing: to remove noise from data

• Aggregation: summary or aggregation are applied to the data

• Generalization: low-level or “primitive” data are replaced by higher-level concepts through the use of concept hierarchy

• Normalization: attribute data are scaled so as to fall within a small specified range, says 0.0 to 1.0

• Attribute construction: new attributes are constructed and added from the given set of attributes to help the mining process: from continuous to discrete (discretization) and from discrete to continuous (word embedding).

Data transformation

Page 35: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

35

Min-max and z-score normalization• min-­‐max  normalization:  Suppose  

minA and  maxA are  minimum  and  maximum  values  of  attribute.  Wemap  a  value  v  of  A  to  v’  in  the  range  [newminA,  newmaxA]  by

• Example:  Suppose  minA and  maxAare  $12,000  and  $98,000.  We  want  to  map  minimum  and  maximum  values  of  attribute.  We  want  to map  income  to  the  range  [0.0,  1.0].  So,  $73,600  is  transformed  to  

• z-­‐score  normalization: The  values  for  an  attribute  A  are  normalized  based  on  the  mean  and  standard  deviation  of  A

• Example:  If  the  mean  and  standard  deviation  are  $54,000  and  $16,000,  the  $73,600  is  transformed  to  

𝑣" =𝑣 − 𝑚𝑖𝑛(

𝑚𝑎𝑥( − 𝑚𝑖𝑛(𝑛𝑒𝑤𝑚𝑎𝑥( − 𝑛𝑒𝑤𝑚𝑖𝑛( + 𝑛𝑒𝑤𝑚𝑖𝑛(

73,600 − 12,00098,000 − 12,000 1.0 − 0.0 + 0 = 0.716

𝑣" =𝑣 − �̅�𝜎(

73,600 − 54,00016,000 = 1.225

Page 36: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

36

Discretization

• Three types of attributes:q Nominal (categorical): red, yellow, blue, greenq Ordinal: small, middle, large, extreme largeq Continuous: real numbers

• Discretization: divide the range of a continuous attribute into intervalsq Some classification algorithms only accept categorical

attributes.q Reduce data size by discretizationq Prepare for further analysis

Page 37: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

37

§ Binning§ Histogram  analysis§ Cluster  analysis§ Entropy-­based  discretization§ Segmentation  by  Natural  Partitioning

Discretization

Page 38: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

38

§ Given  a  set  of  samples  S,  if  S  is  partitioned  into  two  intervals  S1  and  S2  using  boundary  T,  the  entropy  after  partitioning  is

E S TSEnt

SEntS S S S( , )

| || |

( )| || |

( )= +11

22

Ent S E T S( ) ( , )- >d

Entropy-based discretization

§ The  boundary  that  minimizes  the  entropy  function  over  all  possible  boundaries  is  selected  as  a  binary  discretization.

§ The  process  is  recursively  applied  to  partitions  obtained  until  some  stopping  criterion  is  met,  e.g.,

§ Experiments  show  that  it  may  reduce  data  size  and  improve  classification  accuracy

Page 39: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

What is word embedding?

• Word embedding: Mapping a word (or phrase) from it's original high dimensional input space to a lower-dimensional numerical vector space.

• Word2vec is a group of related models that are used to produce word embeddings.q These models are shallow, two-layer neural networks that are

trained to reconstruct linguistic contexts of words. q Word2vec takes as its input a large corpus of text and produces a

vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.

• Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

Page 40: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

Some more complex data transformation

f

Input  space Feature  space

X Ff: X à F where the problem can be solved in F

C

documents

wor

ds U

dims

wor

ds D

dims

dim

s

Vdocuments

dim

s

Latent  semantic  indexing  

Normalized co-occurrence matrix

C

documents

wor

ds F

topics

wor

ds

Qdocuments

topi

cs

Topic models

Page 41: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

41

• Data preprocessing is an important issue as real-world data tend to be incomplete, noisy, and inconstant

• Data cleaning routines can be used to fill in missing values, smooth noisy data, identify outliers, and correct data inconsistencies

• Data integration combines data from multiple sources to form a coherent data store

• Data transformation routines convert the data into appropriate forms for analyzing.

• Data reduction techniques can be used to obtain a reduced representation of the data while minimizing the loss of information content

• Automatic generation of concept hierarchies can involve different techniques for numeric data, and may be based on number of distinct values of attributes for categorical data

• Data preprocessing remains as an active area of research

Summary

Page 42: K236: Basis of Data Sciencebao/K236/K236-L6.pdf · produce word embeddings.! These models are shallow, two-layer neural networksthat are trained to reconstruct linguistic contexts

HomeworkThe “labor.arff” provided by WEKA has 57 instances, 16 descriptive attributes, and the class attribute with two values ‘bad’ and ‘good’. The atrributes of “labor.arff” have many missing values. Do the following

(1) Use the methods in Lecture 6 to treat the missing values of all attributes in “labor.arff”

(2) Explain why the method you used for each attribute is appropriate?

Submit the written report (pdf) by July 7, 2017. Hint:

1. You can use ARFF-Viewer in ‘Tool’ of WEKA to visualize the “labor.arff” 2. You may have at least to ways to work on labor data (labor.arff):

• Use the tool ‘arff2csv.zip’ at our website http://www.jaist.ac.jp/~bao/K236/ to convert the data into Excel format, and use the data represented in Excel for your preprocessing, or

• Take the ‘labor’ data from UCI: http://archive.ics.uci.edu/ml/machine-learningdatabases/labor-negotiations/C4.5/ and store it in Excel format (or whatever you like) to process.