Как приготовить тестовые данные для Big Data проекта. Пример из практики

Как приготовить тестовые данные для Big Data проекта. Пример из практики.

Dmitriy Romanov. Itera Consulting UA

Part 1. Introduction

Part 2. Data Sampling Approach Description

Part 3. Data Sampling Tool Implementation

List of Contents

Testing BI and DWH07.03.2013

ETL and DWH

CSV

STG2STG1 DWHData-mart

Cube

BISourcesystem

Mapping & report logic

DB2 ETL

File

Report

Report

Sources of Errors

ETL ETL

How we test ETL processes ?

Test Data DSTSRC

ETL

Error ?

CHECK TYPES

Row Count

Field Values

Aggregations Validation

…

Extraction – all data extracted correctlyTransformation – works as expectedLoad – whole data arrived to destination

Testing BI and DWH07.03.2013

How to get Test Data

“Random”Production Data

Subset

Full sized Production

Data

Sampled

Production data

Coverage

Preparation Cost orExecution Time

GeneratedTest Data

Part 1

Data Sampling Approach description

8

Data sampling. Main idea

"Compress" similar data in source dataset and take for further analysis only "unique" rows

Clustering is one of the data mining techniques, which helps in grouping of most similar rows

9

Data sampling. Clustering

1. Three main data types in sources- Numeric data- Categorical (unordered) data- Linked data

2. At the start numeric and categorical data sets can be split into proportional parts (quartiles).

3. Linked data requires linked content analyses and currently treated as unordered data

Data sampling. Defining quartiles

Partitioning can be done with T-SQL NTILE function.Example for ordered data

With some limitations such method also can be used for categorical data too:

FieldVal Qrt. FieldVal Qrt. FieldVal Qrt. FieldVal Qrt. FieldVal Qrt. FieldVal Qrt. FieldVal Qrt. FieldVal Qrt.

Ansvar 1 Delkasko 2 Innbo 3 Leiebil 4Naturskd Bed 5

Næringsbygg 6 Standard 7

Utvidet m/UP 8

Bedrift 1 Død 2 Invaliditet 3 Med. invalid 4Naturskd Ko 5

Panthaver 6 Super 7 Verdisak 8

Brann Tyveri 1 Død/bruk 2 Kasko 3 MLV 4

Naturskd MLV 5

Reise/Helse 6

Till. k.inbo 7

Veterinær 8

Brann/Tyveri 1

Førerulykke 2 Kontor 3 Naturskade 4

Naturskd NBy 5

Reisedekning 6

Utgår Syk 7

Yrkesskade 8

11

Data sampling. Quartiles properties.

Every quartile can be described by:

Quartiles definitions Examples:

Quartile Attribute Description

Qmin Minimal value in quartile

Qmed Median value in quartile

Qmax Maximal value in quartile

Records count records in original dataset belonged to quartile

"Quartilized" view of source data.

- Quartiles should be defined for all fields in the source table- Every field value should be replaced by quartile code which this value belongs to- Quartiles Minimum, Maximum, Median values should get own code

Example :Q1 – value from quartile #1 QnMin/QnMax/QnMed – fields with Min/Max/Med values for N quartile correspondently

Col001 Col002 Col003 Col004 Col005 Col006 Col007 Col008 Col009 Col010 Col011 Col012Q1 Q2Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1MinQ1 Q4Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1Q1 Q4Min Q0 Q2 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1MinQ1 Q2Min Q0 Q1 Q1Min Q1Min Q1Med Q1Med Q1 Q1 Q1Min Q1Q1 Q4Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1Q1 Q2Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1Q1 Q2Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1MinQ1 Q4Min Q0 Q2 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1Q1 Q4Min Q0 Q1 Q1Min Q1Min Q1Med Q1 Q1 Q1 Q1Min Q1Min

Representative set built from clusters

- For more reliable result set must contain all (if exists) Qmin, Qmed, Qmax values for every quartile. - All other values can be ignored in further analysis and will be replaced by Q0 (other) code.- Each unique combination of quartile codes forms cluster

Data sampling. Clustered view resulting set

Cluster Col001 Col002 Col003 Col004 Col005 Col006 Col007 RecCNT1 Q0 Q4Min Q0 Q0 Q1Min Q1Min Q1Med 1512 Q0 Q4Med Q0 Q0 Q1Min Q1Min Q1Med 23 Q0 Q4Min Q0 Q0 Q1Min Q1Min Q1Med 284 Q0 Q2Min Q0 Q0 Q1Min Q1Min Q1Med 155 Q0 Q2Min Q0 Q3Min Q1Min Q1Min Q1Med 126 Q0 Q2Min Q0 Q0 Q1Min Q1Min Q1Med 117 Q0 Q2Min Q0 Q4Max Q1Min Q1Min Q1Med 1

Perform grouping of similar records and calculate how many rows each cluster contains (RecCNT column)

15

Data sampling. Storing resulting set

Take only one row from every cluster and place it to Resulting set

Resulting set can be stored in following forms:- Set of "INSERT" statements

- Cut original CSV file (if available) and left only one row from every cluster

DELETE from J0.STG1_P_LIV_SALDOGOinsert J0.STG1_P_LIV_SALDO(Col001,Col002,Col003,Col004,Col005,Col006,Col007,Col008,Col009,Col010,Col011,Col012,Col013,Col014,Lastet_dat) values('0A319121','20','20360',null,null,'20','2A','00003207H','320040704','30115335797',null,'000000B','30115335797','02','May 3 2012 1:26PM')…

Part 2

Data Sampling Tool Implementation

17

Data sampling tool. Main idea

1. Keep things as simple as possible

- Dynamic SQL source generation, parameterization widely used

- Results of each step are fixed as SQL statement (level of code

understandability/control remain high as possible)

- Fully automated partitioning/clustering/filtering.

2. Break down process into independent parts. Keep in mind possibility of

future extension and customization

3. All data manipulations performed on T-SQL and massive data tasks

delegated to the SQL Server

18

Data sampling tool. Brief overview

- Control logic implemented on Powershell, database part on T-SQL

- Analyze data from Stage1 table and CSV files

- Received number of quartiles/column as parameter

- Generate quartiles definitions and build clusters

- Only one record from cluster goes into representative set

- Representative set can be stored in:

a) Set of insert SQL statements

b) Copy of source text file with only meaningful rows

19

Data sampling tool. Structure

Create clustered view of source data

Analyze STG1 table

Filter & Store result

STG1 Table nameNumber of quartiles

List of quartiles{XML}


List of RowID's{FILE}


Inserts_to_stg1_tbl.sqlOr<Copy of source file with only lines corresponded to RowID's list>

Input Process Output

Grouping similar records to clusters

Source for "view"{SQL}


Quartile Definition Example

"Clusterised" View Example

Saving Results Example

Results Example

24

Data sampling tool. Structure

Create clustered view of source data

Analyze STG1 table

Filter & Store result

STG1 Table nameNumber of quartiles





Inserts_to_stg1_tbl.sqlOr<Copy of source file with only lines corresponded to RowID's list>

Input Process Output

Grouping similar records to clusters



Data sampling tool. Execution results

Original datasetTable Name: STG1_GEN_ORDRE_TRecords : 11,912,182Columns: 35

2 4 8 12 160

20000400006000080000

100000

Result Set Row Count

2 4 8 12 160

100

200

300

400

500

Total Quartile Count

2 4 8 12 160

50

100

150

200

250

300

Time (min)

Qrt/ColumnTotal

Quartiles Count

Result Set Row Count

Time

2 67 8754 86

4 126 19459 95

8 242 32231 100

12 346 57258 143

16 436 83605 255

Questions ?

Education

Как приготовить тестовые данные для Big Data проекта. Пример из практики