Upload
ali-a-jalil
View
515
Download
0
Embed Size (px)
Citation preview
Outline
Introduction
Data Mining vs Text Mining
Text Mining Process
Text Mining Applications
Challenges in Text Mining
Conclusion
Introduction
• What is Text Mining?
– Text mining is the analysis of data contained innatural language text
Introduction
• Why Text Mining?
– Massive amount of new information being created World’s data doubles every 18 months (Jacques Vallee Ph.D)
– 80-90% of all data is held in various unstructured formats
– Useful information can be derived from this unstructured data
Unstructured Data Examples “Ore”
• Insurance claims
• News articles
• Web pages
• Patent portfolios
• Customer complaint letters
• Contracts
• Transcripts of phone calls with customers
• Technical documents
Reasons for Text Mining
0
10
20
30
40
50
60
70
80
90
Percentage
Collections ofText
Structured Data
How Text Mining Differs from Data Mining
Data Mining
• Identify data sets
• Select features
• Prepare data
• Analyze distribution
Text Mining
• Identify documents
• Extract features
• Select features by algorithm
• Prepare data
• Analyze distribution
Mining
Filtering : remove punctuation, special characters .
Segmentation: segment document to words.
Stemming : Techniques used to find out the root/stem of a word:– E.g.,
– user engineering– users engineered – used engineer – using
• Stem (root) : use engineer
Usefulness• improving effectiveness of retrieval and text mining
– matching similar words
• reducing indexing size– combing words with same roots may reduce indexing size as much
as 40-50%.
Mining
Basic stemming methods
• remove ending– if a word ends with a consonant other than s,
followed by an s, then delete s.
– if a word ends in es, drop the s.
– if a word ends in ing, delete the ing unless the remaining word consists onlyof one letter or of th.
– If a word ends with ed, preceded by a consonant, delete the ed unless thisleaves only a single letter.
– …...
• transform words– if a word ends with “ies” but not “eies” or “aies” then “ies ”
Mining
Mining
eliminate excessive words : words that notgive meaning by itself such as preposition, conjunction , conditional particle.
That is performed by comparison with a list
of these words.
Canonical Names
President Bush
Mr. Bush
George Bush
Canonical Name:
George Bush
• The canonical name is the most explicit, least ambiguous name constructed from the different variants found in the document
• Reduces ambiguity of variants
Mining
Clipping : eliminate words that appear in highor low frequency.
o The low frequency’s words will forms smallclusters that not useful , and high frequency’swords that is always appear and it’s also notuseful.
o There is many ways to calculate word’sfrequency in document(s)
Mining
Clustering : Clustering interrelateddocuments, based on documents topics.
Text Mining: Analysis
• Which words are most present.
• Which words are most interesting .
• Which words help define the document.
• What are the interesting text phrases?
Text mining applications
• Call Center Software.
• Anti-Spam.
• Market Intelligence.
• Mining in web .
Actual examples
• One of clinical center in USA be capable ofdetermine one of genes that responsible forone of harmful diseases by treat greater than150,000 news paper.
• Text mining in holy Quran.
• Etc….
Challenges in Text Mining
• Information is in unstructured textual form and it’sin Natural Language (NL).
• Not readily accessible to be used by computers.
• Dealing with huge collections of documents.
• Require Skillful person to choose which documentsthat will treat , and analysis the output .
• Require more time.
• Cost , 50,000$ just to software.
More information
• Central Intelligence Agency (CIA) the mostsupportive to text mining .
- 11/ September events.
- mining in E-mail , chat rooms, and socialnetworks .
-So its support many companies such asAttensity ،Inxight , Intelliseek.
More information
• SPSS company statistic’s : text mining softwareuser’s so little comparing with data miningsoftware user’s.
conclusion
• Finally, most refer to that the field of textmining are still in the research phase
• and still its applications limited operation atthe present time
• but the possibilities that can be provided,which helps to understand the huge amountsof text and extract the core of whichinformation is important and useful prospectsin many areas .