View
223
Download
0
Category
Preview:
Citation preview
7/30/2019 datmindata minig
1/39
Data Mining
Rajagopal Sukumar
Cognizant Technology Solutions
7/30/2019 datmindata minig
2/39
Agenda
What is Data Mining ?
Data Mining Techniques
Data Mining Process
Our work in Data Mining
Tools available in the market
7/30/2019 datmindata minig
3/39
What is Data Mining ? Data mining is the search for relationships
and global patterns that exist in large
databases but are `hidden' among the vast
amount of data
These relationships represent valuable
knowledge about the database and the
objects in the database and, if the database
is a faithful mirror, of the real world registeredby the database.
7/30/2019 datmindata minig
4/39
What is Data Mining ?
The analogy with the mining process is
described as:
Data mining refers to "using a variety of
techniques to identify nuggets of information ordecision-making knowledge in bodies of data,
and extracting these in such a way that they can
be put to use in the areas such as decision
support, prediction, forecasting and estimation.
The data is often voluminous, but as it stands oflow value as no direct use can be made of it; it is
the hidden information in the data that is useful"
7/30/2019 datmindata minig
5/39
Why do we need Data Mining ?
We need it because everybody needs it
!
To uncover strategic competitive insightto drive market share and profits
7/30/2019 datmindata minig
6/39
What can we do with our data ?
Derive Quantitative Information How many people bought our products last month ?
Explain Past Results
Why did my monthly sales for our products have declined
sharply ?
Discover Hidden Patterns
Houses with a male HOH (Head of the HHLD) are more likely to
have both cats and dogs than those with a female. The actualratio is 7:3.
Predict Future Results So those household in our customer base that have a male
Head of Household are likely to have both cats and dogs. If we
are a pet food supplier, think about the value of this prediction ?
7/30/2019 datmindata minig
7/39
Transforming Data
Data
Facts/Information
Knowledge
Recommendations/Decisions
7/30/2019 datmindata minig
8/39
OLAP Vs. Data MiningOLAP Data Mining
Focus Summary Data Detail Data
Dimensions Limited Lots
No. of attributes Total in the tens Hundreds
Size of datasets Small to medium Millions
Analysis Deductive Predictive
Technique Slice and Dice Automatic Discovery
State oftechnology
Mature Mature in StatisticalAnalysis/Emerging inKnowledgeDiscovery
7/30/2019 datmindata minig
9/39
Data Mining Methods
Decision Trees
Case Based Reasoning
Neural Networks
Genetic Algorithms
Linear and Non Linear RegressionAnalysis
7/30/2019 datmindata minig
10/39
ToyType Buyersex Sales month Location Qty
Car Boys Jan FL 50,000
Car Boys Jan GA 10,000Doll Girls Feb FL 20,000
Doll Girls Feb CA 15,000
Car Boys Mar NY 20,000
Car
Boys
Girls
Jan
Feb
...
GA
FL 50,000
10,000
< Highest
< Lowest
...
Decision Tree
7/30/2019 datmindata minig
11/39
Case based Reasoning (CBR)
Finds the closest situation that occurredin the past and adopts the same
solution that was the right one
Disadvantage is that CBR systems donot create rules or models summarizing
the past experiences
Example: Help Desk Support Systems
7/30/2019 datmindata minig
12/39
Neural Networks
Mimic the way learning occurs in the
brain
They are used extensively in thebusiness world as predictive models
Each neuron takes many inputs and
generates an output that is a non-linearfunction of the weighted sum of inputs
7/30/2019 datmindata minig
13/39
Neural Networks
Toy Type
Buyer Sex
Location
Sale Month
Quantity
n1
n2
n3
n4
Good
Bad
7/30/2019 datmindata minig
14/39
Neural Networks
y = Good or Bad
y = w1n1 + w2n2 + w3n3 + w4n4
The weights w1..w4 can be calculated
using backward propagation by training
the net using known values of y and the
inputs
Then the net can be used for
predictions
7/30/2019 datmindata minig
15/39
Genetic Algorithms
Mimic the evolutionary process of
natural selection
It has a fitness function that determines
those solutions that are better fits
Then genetic operations mutations and
mating are performed to generate more
solutions
Currently in research mode rather than
in practical applications
7/30/2019 datmindata minig
16/39
Linear and Non-Linear Regression
Searching for a dependence of the
target variable on other variables in the
form of function of some predetermined
polynomial form
Quantity = A*Buyer Sex + B* Location +
C* Month (This is linear !)
Solving this equation for A, B, C using
the available data can be a predictive
model
7/30/2019 datmindata minig
17/39
Usage
Clustering
Grouping data into disjoint sets that are
similar in some respect. It also attempts to
place dissimilar data in different clusters. For example, in the context of super
market data, clustering of sale items to
perform effective shelf spaceorganization is a typical application
Clustering algorithms typically use a
distance function to separate data
7/30/2019 datmindata minig
18/39
Usage
Classification
Classifies data into distinctive groups
For example, people can be categorized
into the classifications of babies,
children, teenagers, adults, and elderly.
The attribute age two years or younger
can be mapped to babies.
Once data is classified, traits of these
groups can be summarized
7/30/2019 datmindata minig
19/39
Usage
Deviation Detection
Extracting anomalies or deviations in the
dataAn anomaly may show a new fact of great
interest
7/30/2019 datmindata minig
20/39
UsageAssociation Rules
Extracting associations between data
items. Can be used to predict the value of
one object based on the value of another. Find a model that identifies the most
predictive characteristics of people
buying toy pickup trucks ?
Answer - During summer vacation,
single parent families with certain
income levels buy toy pickup trucks
7/30/2019 datmindata minig
21/39
Association Rules
70% of customers who order pen and
pencils also order writing tablets
If Writing Tablets are high margin items
discover all associations that have
Writing Tablets as a consequent
If pencils are low margin items, discover
all associations that have pencils as an
antecedent to determine the impact of
discontinuing pencils
7/30/2019 datmindata minig
22/39
7/30/2019 datmindata minig
23/39
Data Preparation
Data Cleansing
Inconsistencies
Toy types soft and plush mean the same
Stale Data
Address changes are not reflected correctly
Typographical Errors
words are misspelled or typed incorrectly Missing Values
Tough problem to address
7/30/2019 datmindata minig
24/39
Data Cleansing - Missing Values
Treatment of missing numeric values is
more difficult
Artificial assignment change distributionand statistics of the field
Assign using average values
Segment data using another variable andassign segment averages
Build a model and impute the missing
values (the best method)
7/30/2019 datmindata minig
25/39
Data Transformation
Ratio Variables
Time derivatives
Discretization using quantiles
Discretization using other mathematical
transforms
7/30/2019 datmindata minig
26/39
7/30/2019 datmindata minig
27/39
Time Derivatives
Variation of data over time is very
important to understand
For example, toy sales time series = toy
sales of current month - toy sales of
previous month
Cyclic Association Rules can be
identified
monthly sales of goods may have different
correlations based on the season
7/30/2019 datmindata minig
28/39
7/30/2019 datmindata minig
29/39
7/30/2019 datmindata minig
30/39
Data Mining Process
Choose the study
Classification/Clustering
Deviation Detection
Affinity Analysis
Run the algorithm on the prepared data
Analyze the outputs
Make decisions
7/30/2019 datmindata minig
31/39
Our Approach
Demystification of Data Mining
Built a Windows based Prototype to
demonstrate decision trees
Working on adding a module to our
Adhoc Query Generator - Extempore
7/30/2019 datmindata minig
32/39
7/30/2019 datmindata minig
33/39
7/30/2019 datmindata minig
34/39
What is Extempore ?
EXTract M204 and Process On REquest
Generates native M204 UL code
Reports generated on multiple M204 fileswithout any M204 coding
Complex report formatting with the help of
reporting tools like info-maker
Provides user friendly GUI
Dynamically generates customized reports
7/30/2019 datmindata minig
35/39
What is Extempore ?
Structured user interface
Point & click methodology
Limited M204 knowledge required to use
Quick access to M204 data
Reports can be copied/saved and reused
Data retrieved can be saved in formats like
excel, CSV or HTML tables to be used byother systems
Online & batch modes of execution
7/30/2019 datmindata minig
36/39
Extempore Architecture
CT LIB JANUS
RPC to Sybase
& results from
RPC to client
Sybase routes
client RPC
to M204Hidden connection
from M204 to Sybaseto read report
specification
7/30/2019 datmindata minig
37/39
Tools in the market
IBM Intelligent Miner
Data Mind Corps Data Mind
Professional Edition
Angoss Softwares Knowledge Seeker
Neuralwares Neuralworks Predict
Pilot Softwares Discovery Server
Redbrick Systems Data Mine
Thinking Machines Corps Darwin
7/30/2019 datmindata minig
38/39
Web sites
Excellent reference sites
http://www.thearling.com
http://www.kdnuggets.com
Source code sites
C4.5 Decision Tree Algorithm
htttp://ftp.cs.su.oz.au/pub/ml/
OC1 Decision Tree Algorithm
http:/www.cs.jhu.edu/
7/30/2019 datmindata minig
39/39
Thank You !
Recommended