Text mining

Preview:

Citation preview

TEXT MINING

seminar submitted by:

Ali Abdul_ZahraaMsc,MathcompUOK

ali.abdulzahraa@gmail.com

Outline

Introduction

Data Mining vs Text Mining

Text Mining Process

Text Mining Applications

Challenges in Text Mining

Conclusion

Introduction

• What is Text Mining?

– Text mining is the analysis of data contained innatural language text

Introduction

• Why Text Mining?

– Massive amount of new information being created World’s data doubles every 18 months (Jacques Vallee Ph.D)

– 80-90% of all data is held in various unstructured formats

– Useful information can be derived from this unstructured data

Unstructured Data Examples “Ore”

• Email

• Insurance claims

• News articles

• Web pages

• Patent portfolios

• Customer complaint letters

• Contracts

• Transcripts of phone calls with customers

• Technical documents

Reasons for Text Mining

0

10

20

30

40

50

60

70

80

90

Percentage

Collections ofText

Structured Data

How Text Mining Differs from Data Mining

Data Mining

• Identify data sets

• Select features

• Prepare data

• Analyze distribution

Text Mining

• Identify documents

• Extract features

• Select features by algorithm

• Prepare data

• Analyze distribution

Mining

Filtering : remove punctuation, special characters .

Segmentation: segment document to words.

Stemming : Techniques used to find out the root/stem of a word:– E.g.,

– user engineering– users engineered – used engineer – using

• Stem (root) : use engineer

Usefulness• improving effectiveness of retrieval and text mining

– matching similar words

• reducing indexing size– combing words with same roots may reduce indexing size as much

as 40-50%.

Mining

Basic stemming methods

• remove ending– if a word ends with a consonant other than s,

followed by an s, then delete s.

– if a word ends in es, drop the s.

– if a word ends in ing, delete the ing unless the remaining word consists onlyof one letter or of th.

– If a word ends with ed, preceded by a consonant, delete the ed unless thisleaves only a single letter.

– …...

• transform words– if a word ends with “ies” but not “eies” or “aies” then “ies ”

Mining

Mining

eliminate excessive words : words that notgive meaning by itself such as preposition, conjunction , conditional particle.

That is performed by comparison with a list

of these words.

Canonical Names

President Bush

Mr. Bush

George Bush

Canonical Name:

George Bush

• The canonical name is the most explicit, least ambiguous name constructed from the different variants found in the document

• Reduces ambiguity of variants

Mining

Clipping : eliminate words that appear in highor low frequency.

o The low frequency’s words will forms smallclusters that not useful , and high frequency’swords that is always appear and it’s also notuseful.

o There is many ways to calculate word’sfrequency in document(s)

Mining

Clustering : Clustering interrelateddocuments, based on documents topics.

Text Mining: Analysis

• Which words are most present.

• Which words are most interesting .

• Which words help define the document.

• What are the interesting text phrases?

Text mining applications

• Call Center Software.

• Anti-Spam.

• Market Intelligence.

• Mining in web .

Actual examples

• One of clinical center in USA be capable ofdetermine one of genes that responsible forone of harmful diseases by treat greater than150,000 news paper.

• Text mining in holy Quran.

• Etc….

Challenges in Text Mining

• Information is in unstructured textual form and it’sin Natural Language (NL).

• Not readily accessible to be used by computers.

• Dealing with huge collections of documents.

• Require Skillful person to choose which documentsthat will treat , and analysis the output .

• Require more time.

• Cost , 50,000$ just to software.

More information

• Central Intelligence Agency (CIA) the mostsupportive to text mining .

- 11/ September events.

- mining in E-mail , chat rooms, and socialnetworks .

-So its support many companies such asAttensity ،Inxight , Intelliseek.

More information

• SPSS company statistic’s : text mining softwareuser’s so little comparing with data miningsoftware user’s.

conclusion

• Finally, most refer to that the field of textmining are still in the research phase

• and still its applications limited operation atthe present time

• but the possibilities that can be provided,which helps to understand the huge amountsof text and extract the core of whichinformation is important and useful prospectsin many areas .