49
ICT619 Intelligent ICT619 Intelligent Systems Systems Topic 6: Data Topic 6: Data Mining Mining

ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining Introduction Business Applications of Data Mining Data Mining Activities

  • View
    242

  • Download
    6

Embed Size (px)

Citation preview

Page 1: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619 Intelligent SystemsICT619 Intelligent Systems

Topic 6: Data MiningTopic 6: Data Mining

Page 2: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 22

Data MiningData Mining

IntroductionIntroduction Business Applications of Data MiningBusiness Applications of Data Mining Data Mining ActivitiesData Mining Activities Data Mining TechniquesData Mining Techniques How to Apply Data MiningHow to Apply Data Mining Data Mining Development Data Mining Development

Methodology Methodology

Page 3: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 33

Why data mining?Why data mining?

““Customers who bought this title also bought … “Customers who bought this title also bought … “- from Amazon.com- from Amazon.comWhy? – More effective (targeted) marketingWhy? – More effective (targeted) marketingHow? – Targeting through association How? – Targeting through association

Abundance of business data typically in terabytesAbundance of business data typically in terabytes- point-of-sale (POS) devices, customer call detail - point-of-sale (POS) devices, customer call detail databases, web log files in e-commerce etcdatabases, web log files in e-commerce etc

Data is being collected mostly for improving efficiency Data is being collected mostly for improving efficiency of underlying operationsof underlying operationsBut not for analysisBut not for analysis

Page 4: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 44

Why data mining? (cont’d)Why data mining? (cont’d)

Useful information (business intelligence) to gain Useful information (business intelligence) to gain competitive advantage can be extracted by "mine"-ing competitive advantage can be extracted by "mine"-ing datadata

Examples: underlying trends, associations or patterns Examples: underlying trends, associations or patterns in market behaviourin market behaviour

According to (Hirji 2001),According to (Hirji 2001), “ … “ … data mining is the analysis and non-trivial extraction data mining is the analysis and non-trivial extraction

of data from databases for the purpose of discovering of data from databases for the purpose of discovering new and valuable information, in the form of patterns new and valuable information, in the form of patterns and rules, from relationships between data elements.” and rules, from relationships between data elements.”

Page 5: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 55

Data mining in perspectiveData mining in perspective

OLAP with data warehouses tells us what is OLAP with data warehouses tells us what is happening and howhappening and how

Data mining tells us what is likely to happenData mining tells us what is likely to happen Data mining is Data mining is knowledge discovery in knowledge discovery in

(commercial) databases - KDD(commercial) databases - KDD Data mining is a process rather than a productData mining is a process rather than a product

Page 6: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 66

Data mining in perspective (cont’d)Data mining in perspective (cont’d)

Statistical methods do not scale up to today's problemsStatistical methods do not scale up to today's problems New "intelligent" tools are neededNew "intelligent" tools are needed

Data mining draws from artificial intelligence/soft Data mining draws from artificial intelligence/soft computing, database theory, data visualization, computing, database theory, data visualization, marketing, statistics, and so onmarketing, statistics, and so on

Our objectives:Our objectives: Understand the role of data mining in businessUnderstand the role of data mining in business Distinguish between different data mining techniquesDistinguish between different data mining techniques Understand how to go about making use of data miningUnderstand how to go about making use of data mining

Page 7: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 77

Business Applications of Data MiningBusiness Applications of Data Mining

Fastest growing segment of business intelligence Fastest growing segment of business intelligence marketmarket

Increasingly an integral and necessary component of Increasingly an integral and necessary component of an organization’s portfolio of analytical techniquesan organization’s portfolio of analytical techniques

Data mining for marketingData mining for marketing Uses data on customer behaviour to identify target Uses data on customer behaviour to identify target

groups for marketinggroups for marketing Reduces cost by avoiding groups unlikely to respondReduces cost by avoiding groups unlikely to respond

Page 8: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 88

Business Applications of Data Mining Business Applications of Data Mining (cont’d)(cont’d)

Data mining for customer relationship Data mining for customer relationship managementmanagement Anticipating customers’ needs and responding to Anticipating customers’ needs and responding to

them proactivelythem proactively

Data mining in R&DData mining in R&D Can lower costs during the R&D phase of the Can lower costs during the R&D phase of the

product life cycle by analysing voluminous test dataproduct life cycle by analysing voluminous test data BioinformaticsBioinformatics - data mining in biology and medicine - data mining in biology and medicine

Page 9: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 99

Data Mining ActivitiesData Mining Activities Two broad groups – Two broad groups – directed directed and and undirected data miningundirected data mining

Directed data mining Directed data mining We know what we are looking forWe know what we are looking for We aim to find the value of a pre-identified target variable in We aim to find the value of a pre-identified target variable in

terms of a collection of input variables, eg, classifying terms of a collection of input variables, eg, classifying insurance claimsinsurance claims

Undirected data mining Undirected data mining Finds patterns in data Finds patterns in data Leaves it to the user to find the significance of these Leaves it to the user to find the significance of these

patternspatterns Eg, identifying groups of customers with similar buying Eg, identifying groups of customers with similar buying

patternspatterns

Page 10: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 1010

Different types of data mining Different types of data mining taskstasks

Page 11: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 1111

Data Mining TasksData Mining Tasks

ClassificationClassification Assigns a given object to a predefined category (class) Assigns a given object to a predefined category (class)

based on the object’s attributes (features)based on the object’s attributes (features)

Objects to be classified are generally database Objects to be classified are generally database records. records.

Discrete outcomes – yes/no, low/medium/high etc,Discrete outcomes – yes/no, low/medium/high etc,

Examples of classification tasks:Examples of classification tasks: Assigning keywords to articlesAssigning keywords to articles Classifying credit applicants as low, medium and high riskClassifying credit applicants as low, medium and high risk Assigning customers to predefined customer segmentsAssigning customers to predefined customer segments

Page 12: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 1212

Data Mining TasksData Mining Tasks

EstimationEstimation Continuously varying outcomesContinuously varying outcomes Eg income, probability of a customer leaving (known in data Eg income, probability of a customer leaving (known in data

mining circles as mining circles as churningchurning)) Outcomes can also be used for classification by ranking and Outcomes can also be used for classification by ranking and

thresholdingthresholding

PredictionPrediction Classification or estimation task performed to predict some Classification or estimation task performed to predict some

future behaviourfuture behaviour Examples include: Examples include:

- Predicting which customers will churn in the next six - Predicting which customers will churn in the next six monthsmonths- Predicting the size of a balance that will be transferred- Predicting the size of a balance that will be transferred

Page 13: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 1313

Data Mining Tasks (cont’d)Data Mining Tasks (cont’d)

Finding affinity grouping or association rulesFinding affinity grouping or association rules Finds out, which things go together, eg, in a supermarket Finds out, which things go together, eg, in a supermarket

shopping trolleyshopping trolley Used for arranging items in shelves or catalogues Used for arranging items in shelves or catalogues Identifying cross-selling opportunitiesIdentifying cross-selling opportunities

ClusteringClustering Segments a group of diverse records into subgroups or Segments a group of diverse records into subgroups or

clusters containing similar recordsclusters containing similar records No predefined classes in clustering; records grouped based No predefined classes in clustering; records grouped based

on similarities in their attributeson similarities in their attributes Eg, people with similar buying habitsEg, people with similar buying habits Data miner must interpret clusters and decide what to doData miner must interpret clusters and decide what to do

Page 14: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 1414

Data Mining Tasks (cont’d)Data Mining Tasks (cont’d)

Description and visualisationDescription and visualisation To help increase our understanding of people, To help increase our understanding of people,

products or processes that produced the dataproducts or processes that produced the data

A good description can provide an explanation A good description can provide an explanation of their behaviourof their behaviour

Data visualisation can be very effective in Data visualisation can be very effective in explaining things by exploiting our ability to explaining things by exploiting our ability to utilise visual cluesutilise visual clues

Page 15: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 1515

Data Mining TechniquesData Mining Techniques

Our aim is a basic understanding of data mining Our aim is a basic understanding of data mining techniques to find out techniques to find out When to apply them When to apply them How to interpret their results How to interpret their results How to evaluate their performanceHow to evaluate their performance

Three major approaches are: Three major approaches are: Decision treesDecision trees Automatic cluster detectionAutomatic cluster detection Artificial neural networks (supervised and unsupervised) Artificial neural networks (supervised and unsupervised)

Page 16: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 1616

Decision TreesDecision Trees

Visual representation of Visual representation of a reasoning processa reasoning process

Particularly suitable for Particularly suitable for solving classification solving classification problemsproblems

Consists of internal Consists of internal nodes, leaf nodes and nodes, leaf nodes and edges edges

Fig. A sample decision tree for catalogue mailing (Ganti et al. 1999).

Page 17: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 1717

Decision Trees (cont’d)Decision Trees (cont’d)

Each leaf node is labelled Each leaf node is labelled with a class label with a class label

The class label decided by The class label decided by the class of the records the class of the records that ended up in that leaf that ended up in that leaf during trainingduring training

A leaf node may also A leaf node may also contain a value depending contain a value depending upon the average of the upon the average of the values ofvalues of such recordssuch records

Fig. A sample decision tree for catalogue mailing (Ganti et al. 1999).Group A contains any self-employed person aged <=40 and earning a salary of more than $50,000.

Page 18: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 1818

Decision Trees (cont’d)Decision Trees (cont’d)

Each edge originating Each edge originating from an internal node is from an internal node is labelled with a labelled with a splitting splitting predicatepredicate involving that involving that node’s splitting attributenode’s splitting attribute

The splitting predicate The splitting predicate forces any record to take forces any record to take a unique path from the a unique path from the root to exactly one leaf root to exactly one leaf node. node. Fig. A sample decision tree for catalogue mailing (Ganti et al. 1999).Group A contains any self-employed person aged less than 41 and earning a salary of more than $50,000.

Page 19: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 1919

How decision trees workHow decision trees work

Each record with Each record with NN attributes is a point in an attributes is a point in an NN--dimensional record spacedimensional record space

Each branch in a decision tree is a test on a single Each branch in a decision tree is a test on a single variable that splits the space into two or more regionsvariable that splits the space into two or more regions

With each successive test and split, the resulting With each successive test and split, the resulting regions get more and more segregated with increasing regions get more and more segregated with increasing homogeneity among the recordshomogeneity among the records

Ultimately, the leaf nodes will contain the purest batch Ultimately, the leaf nodes will contain the purest batch of recordsof records

Page 20: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 2020

How decision trees workHow decision trees work

For example, in the For example, in the example decision example decision tree, any tree, any self-self-employed person employed person aged <= 40 and aged <= 40 and earning a salary of earning a salary of more than $50,000more than $50,000 will be classified as will be classified as belonging to group belonging to group A. A.

Page 21: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 2121

How decision trees work (cont’d)How decision trees work (cont’d)

Overfitting in decision treesOverfitting in decision trees A decision tree that correctly classifies every single A decision tree that correctly classifies every single

recordrecord

Such a tree is unlikely to generalise to new data setsSuch a tree is unlikely to generalise to new data sets

To prevent overfitting, test data set are used to prune To prevent overfitting, test data set are used to prune decision trees once it has been built using the training decision trees once it has been built using the training

data set.data set.

Page 22: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 2222

How decision trees are builtHow decision trees are built

Recursive partitioningRecursive partitioning An iterative process of splitting the training data into An iterative process of splitting the training data into

partitions (regions of record space)partitions (regions of record space)

Initially, all records are in a training set consisting of Initially, all records are in a training set consisting of pre-classified recordspre-classified records

An algorithm splits up the data, using every possible An algorithm splits up the data, using every possible binary split on every field of the recordsbinary split on every field of the records

The best split is defined as one that creates partitions The best split is defined as one that creates partitions where a single class predominates where a single class predominates

Page 23: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 2323

How decision trees are builtHow decision trees are built(cont’d)(cont’d)

Recursive partitioning (cont’d)Recursive partitioning (cont’d) The most important task in building a decision tree is to The most important task in building a decision tree is to

decide which of the attributes (independent fields in a decide which of the attributes (independent fields in a record) gives the best split record) gives the best split

The measure used to evaluate a potential splitter is the The measure used to evaluate a potential splitter is the reduction in reduction in diversity diversity (or increase in purity)(or increase in purity)

The best split has the largest reduction in diversityThe best split has the largest reduction in diversity One measure of diversity is the Gini index:One measure of diversity is the Gini index:

2p1 * (1 – p2)2p1 * (1 – p2)

Page 24: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 2424

How decision trees are builtHow decision trees are built(cont’d)(cont’d)

Recursive partitioning (cont’d)Recursive partitioning (cont’d) The splitting process is applied to each of the new The splitting process is applied to each of the new

parts and so on until no more useful splits can be foundparts and so on until no more useful splits can be found A node becomes a leaf node when no split can be A node becomes a leaf node when no split can be

found that significantly decreases the diversityfound that significantly decreases the diversity

PruningPruning The full decision tree needs to be pruned to improve its The full decision tree needs to be pruned to improve its

performanceperformance Pruning is done by removing leaves and branches Pruning is done by removing leaves and branches

(edges leading to leaves) that fail to generalise(edges leading to leaves) that fail to generalise There are a number of pruning methodsThere are a number of pruning methods

Eg, a tree is pruned back to the subtree that minimises Eg, a tree is pruned back to the subtree that minimises error on the test set. error on the test set.

Page 25: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 2525

How decision trees are builtHow decision trees are built(cont’d)(cont’d)

Different types of decision treesDifferent types of decision trees Types depend upon Types depend upon

the number of splits allowed at each level the number of splits allowed at each level how these splits are chosen when the tree is built how these splits are chosen when the tree is built how the tree is pruned to prevent overfittinghow the tree is pruned to prevent overfitting

More broadly, decision trees can be grouped More broadly, decision trees can be grouped as: as: Classification trees (leaves represent classes) Classification trees (leaves represent classes) Regression trees (leaves represent a numeric value)Regression trees (leaves represent a numeric value)

Page 26: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 2626

Algorithms for building decision Algorithms for building decision treestrees

Most notable areMost notable are- CHAID, C4.5/C5.0 and CART- CHAID, C4.5/C5.0 and CART

Data mining software tools allow approximation Data mining software tools allow approximation of any of these algorithms by providing choice of any of these algorithms by providing choice of of splitting criteria and pruning strategiessplitting criteria and pruning strategies control parameters such as maximum tree depth control parameters such as maximum tree depth

Page 27: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 2727

Application of decision treesApplication of decision trees

Useful when the data mining task is Useful when the data mining task is classification of records or prediction of classification of records or prediction of outcomesoutcomes

Also chosen to generate understandable rules, Also chosen to generate understandable rules, which can be explained and translated into which can be explained and translated into SQL or a natural languageSQL or a natural language

For example, For example, IF age < 41 IF age < 41 AND income < $50,000AND income < $50,000AND employment = selfAND employment = selfTHEN belongs to group ATHEN belongs to group A

Page 28: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 2828

Automatic Cluster DetectionAutomatic Cluster Detection

Aims to discover structure in a complex data set as a Aims to discover structure in a complex data set as a whole in order to carve it up into simpler groupswhole in order to carve it up into simpler groups

Examples of clusteringExamples of clustering- finding products that should be grouped together in a - finding products that should be grouped together in a catalogue, or catalogue, or - identifying groups of customers with similar tastes in - identifying groups of customers with similar tastes in music music

Many methods for finding clusters in data, a prominent Many methods for finding clusters in data, a prominent one is one is K-means clusteringK-means clustering

Page 29: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 2929

K-means clusteringK-means clustering

Available in a wide Available in a wide variety of commercial variety of commercial data mining tools data mining tools

Divides the data set Divides the data set into a predetermined into a predetermined number, number, k,k, of clusters of clusters

Initial clusters centred Initial clusters centred at random points at random points ((seedsseeds) in the record ) in the record spacespace

Page 30: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 3030

K-means clustering (cont’d)K-means clustering (cont’d)

Records are assigned to the Records are assigned to the clusters through an iterative clusters through an iterative process process

In the first step, In the first step, kk data points data points are selected to be the seedsare selected to be the seeds

Each seed is an embryonic Each seed is an embryonic cluster with only one elementcluster with only one element

In the second step, each In the second step, each record is assigned to the record is assigned to the cluster whose centroid is cluster whose centroid is nearest to that record nearest to that record

This forms the new clusters This forms the new clusters with new intercluster with new intercluster boundaries. boundaries.

Page 31: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 3131

K-means clustering (cont’d)K-means clustering (cont’d)

The centroid of a cluster of The centroid of a cluster of records calculated by taking records calculated by taking average of each field for all average of each field for all the records in that cluster the records in that cluster

Euclidean distance most Euclidean distance most commonly used for commonly used for measuring distance by data measuring distance by data mining software.mining software.

Distance between two points Distance between two points P(x1, x2, .. , xn) and Q(y1, P(x1, x2, .. , xn) and Q(y1, y2, .. , yn) in y2, .. , yn) in nn-dimensional -dimensional space is space is ((x1-y1)((x1-y1)22 + (x2-y2) + (x2-y2)22 + .. + (xn-yn)+ .. + (xn-yn)22).).

Page 32: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 3232

K-means clustering (cont’d)K-means clustering (cont’d)

In the In the kk-means method, the original choice of the value -means method, the original choice of the value of of kk determines the number of clusters that will be determines the number of clusters that will be foundfound

Unless advanced knowledge is available on the likely Unless advanced knowledge is available on the likely number of clusters, value of number of clusters, value of kk is determined by trial- is determined by trial-and-errorand-error

Best results are obtained when Best results are obtained when kk matches the matches the underlying structure of the data.underlying structure of the data.

Page 33: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 3333

Interpreting clustersInterpreting clusters Automatic clustering is undirected data mining Automatic clustering is undirected data mining - We look for something useful without having to know - We look for something useful without having to know

what we are looking forwhat we are looking for Both an advantage and possible disadvantage!Both an advantage and possible disadvantage!

The most frequently used approaches interpreting The most frequently used approaches interpreting clusters areclusters are Building a decision tree with the cluster labels as target Building a decision tree with the cluster labels as target

variables, and variables, and using it to derive rules explaining how to assign new using it to derive rules explaining how to assign new records to the correct clusterrecords to the correct cluster

Using visualisation to see how the clusters are affected Using visualisation to see how the clusters are affected by changes in input variables.by changes in input variables.

Examining the differences in the distributions of variables Examining the differences in the distributions of variables from cluster to cluster, one variable at a time.from cluster to cluster, one variable at a time.

Page 34: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 3434

Application of clustersApplication of clusters

Clustering is usedClustering is used When natural groupings are suspected, When natural groupings are suspected,

Eg, groups representing customers or products that Eg, groups representing customers or products that have a lot in common with each otherhave a lot in common with each other

When there are many competing patterns in the data When there are many competing patterns in the data making it hard to spot any single patternmaking it hard to spot any single pattern

Creating clusters reduces the complexity within Creating clusters reduces the complexity within clusters so that other data mining techniques are more clusters so that other data mining techniques are more likely to succeedlikely to succeed

Page 35: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 3535

Artificial Neural NetworksArtificial Neural Networks

Main generic application of artificial neural networks is Main generic application of artificial neural networks is pattern recognition or classificationpattern recognition or classification

Estimation and prediction can be viewed as variants of Estimation and prediction can be viewed as variants of classificationclassification

The best ANN model for performing classification is the The best ANN model for performing classification is the backpropagation networkbackpropagation network (or the multilayer perceptron) (or the multilayer perceptron)

The ANN model particularly suited for clustering is the The ANN model particularly suited for clustering is the Kohonen netKohonen net or the or the self-organising map (SOM)self-organising map (SOM)

Page 36: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 3636

Artificial Neural Networks (cont’d)Artificial Neural Networks (cont’d)

SOM learning algorithms are unsupervisedSOM learning algorithms are unsupervised Clusters are represented in a SOM by groups of Clusters are represented in a SOM by groups of

adjacent neurons in output layeradjacent neurons in output layer

SOM reduces dimensionality from SOM reduces dimensionality from NN to 2 to 2

SOM can serve as a clustering tool as well as SOM can serve as a clustering tool as well as visualisation tool for high-dimensional datavisualisation tool for high-dimensional data

SOMs claimed to be often more effective than SOMs claimed to be often more effective than kk-means -means for complex shaped clustersfor complex shaped clusters

Page 37: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 3737

Application of neural netsApplication of neural nets

Artificial neural networks can produce very good resultsArtificial neural networks can produce very good results But require extensive data preparation involving But require extensive data preparation involving

normalisation and conversion of categorical values to normalisation and conversion of categorical values to numeric valuesnumeric values

Do not work well when there are many hundreds or Do not work well when there are many hundreds or thousands of input features - long training phasesthousands of input features - long training phases

Difficult to understand because they represent complex Difficult to understand because they represent complex non-linear modelsnon-linear models

Unlike decision trees, do not produce rules readily.Unlike decision trees, do not produce rules readily. A good choice for most classification and prediction A good choice for most classification and prediction

tasks when the results are more important than tasks when the results are more important than understanding how the model worksunderstanding how the model works

Page 38: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 3838

How to Apply Data MiningHow to Apply Data Mining

Four ways of utilising data mining expertise in business:Four ways of utilising data mining expertise in business:1.1. By purchasing readymade scores (such as on credit By purchasing readymade scores (such as on credit

worthiness for a loan applicant) from outside vendors.worthiness for a loan applicant) from outside vendors.2.2. By purchasing software that embodies data mining By purchasing software that embodies data mining

expertise designed for a particular application such as expertise designed for a particular application such as credit approval, fraud detection or churn preventioncredit approval, fraud detection or churn prevention

3.3. By hiring outside consultants to perform data mining By hiring outside consultants to perform data mining for special projectsfor special projects

4.4. By developing own data mining skills within the By developing own data mining skills within the business organizationbusiness organization

Purchasing scores is quick and easy, but the intelligence Purchasing scores is quick and easy, but the intelligence limited to single score valueslimited to single score values

Page 39: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 3939

Purchasing SoftwarePurchasing Software

Two possibilities:Two possibilities: Software may be an actual modelSoftware may be an actual model

Eg, in the form of a set of rules for decision support, or Eg, in the form of a set of rules for decision support, or a fully-trained neural network applied to a particular a fully-trained neural network applied to a particular domaindomain

Software may embody knowledge of the process of Software may embody knowledge of the process of building models for a particular domain in the form of a building models for a particular domain in the form of a model-creation wizard or templatemodel-creation wizard or template

Purchased models work well if the products, customers, Purchased models work well if the products, customers, and market conditions match those used to develop the and market conditions match those used to develop the modelmodel

Page 40: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 4040

Tasks for the data mining model builderTasks for the data mining model builder

Model building software automate the process of Model building software automate the process of creating candidate models and selecting the ones that creating candidate models and selecting the ones that perform bestperform best

Significant tasks left for the user:Significant tasks left for the user: Choosing a suitable business problem to be addressed by Choosing a suitable business problem to be addressed by

data mining.data mining. Identifying and collecting data that is likely to contain the Identifying and collecting data that is likely to contain the

information needed to answer the business question.information needed to answer the business question. Pre-processing the data so that the data mining tool can make Pre-processing the data so that the data mining tool can make

use of ituse of it Transforming the database so that the input variables needed Transforming the database so that the input variables needed

by the model are availableby the model are available Designing a plan of action based on the model and Designing a plan of action based on the model and

implementing it in the marketplaceimplementing it in the marketplace Measuring results of the actions and feeding them back into Measuring results of the actions and feeding them back into

the database for future mining.the database for future mining.

Page 41: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 4141

Hiring Outside ExpertsHiring Outside Experts

Recommended approach if Recommended approach if Organization in early stages of integrating data Organization in early stages of integrating data

mining in its businessmining in its business Data mining activity is to be an one-off processData mining activity is to be an one-off process

Not ifNot if it is to be an ongoing process, eg, data mining for it is to be an ongoing process, eg, data mining for

customer relationship managementcustomer relationship management

Outside expertise for data mining is likely to be Outside expertise for data mining is likely to be available in three possible placesavailable in three possible places

Page 42: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 4242

Hiring Outside Experts (cont’d)Hiring Outside Experts (cont’d)

Outside expertise for data mining is likely to be available Outside expertise for data mining is likely to be available in three possible places:in three possible places:

1.1. From a data mining software vendorFrom a data mining software vendor

2.2. Data mining centresData mining centresUsually collaborations between universities and Usually collaborations between universities and private companiesprivate companiesEg, Monash Data Mining CentreEg, Monash Data Mining Centre

3.3. Consulting companiesConsulting companiesConsulting company chosen should have had Consulting company chosen should have had experience specifically in the area of interest to the experience specifically in the area of interest to the organisationorganisation

Page 43: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 4343

Developing In-house Expertise for data Developing In-house Expertise for data miningmining

Applies particularly to companies which have Applies particularly to companies which have many products and customersmany products and customers

Should be a core competency of all large scale Should be a core competency of all large scale businessesbusinesses

Page 44: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 4444

Data Mining Development MethodologyData Mining Development Methodology

Best practice yet to emerge (Hirji 2001)Best practice yet to emerge (Hirji 2001) A proposed a five-stage model (Cabena A proposed a five-stage model (Cabena

1998):1998):1.1. Business objective determinationBusiness objective determination2.2. Data preparationData preparation3.3. Data miningData mining4.4. Results analysisResults analysis5.5. Knowledge assimilationKnowledge assimilation

Page 45: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 4545

Data Mining Development Methodology Data Mining Development Methodology (cont’d)(cont’d)

Cabena’s five-stage model:Cabena’s five-stage model: Business objective determinationBusiness objective determination

Clearly identifying the business problem to be minedClearly identifying the business problem to be mined Data preparation Data preparation

Data selection, preprocessing and transformationData selection, preprocessing and transformation Data miningData mining

Algorithm selection and executionAlgorithm selection and execution Results analysis Results analysis

Has anything new or interesting been foundHas anything new or interesting been found Knowledge assimilationKnowledge assimilation

Formulate ways of exploiting the new information extracted Formulate ways of exploiting the new information extracted

Page 46: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 4646

A Case Study A Case Study (Hirji 2001)(Hirji 2001)

Involved a large fast food outletInvolved a large fast food outlet

Brought out some deficiencies of the above Brought out some deficiencies of the above methodologymethodology

A new set of stages for data mining development and A new set of stages for data mining development and use proposed:use proposed:1.1. Business objective determinationBusiness objective determination2.2. Data preparationData preparation3.3. Data auditData audit4.4. Interactive data mining and results analysisInteractive data mining and results analysis5.5. Back end data miningBack end data mining6.6. Results synthesis and presentationResults synthesis and presentation

Page 47: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 4747

A Case Study A Case Study (Hirji 2001) (cont’d)(Hirji 2001) (cont’d)

Case study used IBM’s Intelligent Miner for Data on AIX as Case study used IBM’s Intelligent Miner for Data on AIX as the data mining tool the data mining tool

Took 20 actual days of effort across the 6 stages aboveTook 20 actual days of effort across the 6 stages above

Back end data miningBack end data mining involves data enrichment and involves data enrichment and additional data mining algorithm execution by the data additional data mining algorithm execution by the data mining specialistmining specialist

Distribution of time requiredDistribution of time required 45% taken up by stages 4, 5 and 645% taken up by stages 4, 5 and 6 30% required by the data preparation stage (70% predicted 30% required by the data preparation stage (70% predicted

in the earlier model)in the earlier model) Use of a data warehouse saved time needed for selecting, Use of a data warehouse saved time needed for selecting,

cleaning, transforming, coding, and loading the datacleaning, transforming, coding, and loading the data

Page 48: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 4848

A Case Study A Case Study (Hirji 2001) (cont’d)(Hirji 2001) (cont’d)

Interactive data mining and results analysis stage Interactive data mining and results analysis stage

Linking data mining results with business Linking data mining results with business strategy and using application software such strategy and using application software such as spreadsheets to perform sensitivity analysis as spreadsheets to perform sensitivity analysis of results obtainedof results obtained

Aims to demonstrate how data mining results Aims to demonstrate how data mining results support business strategysupport business strategy

Page 49: ICT619 Intelligent Systems Topic 6: Data Mining. ICT6192 Data Mining  Introduction  Business Applications of Data Mining  Data Mining Activities

ICT619ICT619 4949

REFERENCESREFERENCES

Berry, M., & Linoff, G. Berry, M., & Linoff, G. Mastering Data Mining,Mastering Data Mining, Wiley Computer Wiley Computer Publishing, New York 2000.Publishing, New York 2000.

Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., and Zanasi, A. Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., and Zanasi, A. Discovering Data Mining: From Concept to ImplementationDiscovering Data Mining: From Concept to Implementation. . Prentice Hall, Englewood Cliffs, NJ 1998.Prentice Hall, Englewood Cliffs, NJ 1998.

Dhar, V., & Stein, RDhar, V., & Stein, R.,”Deriving Rules from Data” in Seven Methods .,”Deriving Rules from Data” in Seven Methods for Transforming Corporate Data into Business Intelligencefor Transforming Corporate Data into Business Intelligence., ., Prentice Hall 1997, pp. 167-189, 251-258.Prentice Hall 1997, pp. 167-189, 251-258.

Ganti, V., Gehrke, J., & Ramakrishnan, R. Ganti, V., Gehrke, J., & Ramakrishnan, R. Mining Very Large Mining Very Large DatabasesDatabases, IEEE Computer, Vol.32 No.8, August 1999, pp.38-45., IEEE Computer, Vol.32 No.8, August 1999, pp.38-45.

Hirji, K.,Hirji, K., Exploring Data Mining Implementation Exploring Data Mining Implementation, Communications , Communications of the ACM, Vol.44, No.7, July 2001, pp. 87-93.of the ACM, Vol.44, No.7, July 2001, pp. 87-93.

Web site on Data Mining and Web Mining - Web site on Data Mining and Web Mining - http://http://www.kdnuggets.com/software/suites.htmlwww.kdnuggets.com/software/suites.html