18
Alan Nochenson IST 511 10/1/2012

10-1 Vocab of Terms

Embed Size (px)

Citation preview

Page 1: 10-1 Vocab of Terms

Alan NochensonIST 511

10/1/2012

Page 2: 10-1 Vocab of Terms

Motivation Real-world example Techniques

Tokenization Stop words Normalization Stemming/lemmatization

Page 3: 10-1 Vocab of Terms

Using a variety of techniques, we want to improve IR systems so that they “understand” more of what we want from a query

E.g. When searching for a paper about Facebook, the following queries should all return the paper The facebook, facebook, face-book

Page 4: 10-1 Vocab of Terms
Page 5: 10-1 Vocab of Terms
Page 6: 10-1 Vocab of Terms
Page 7: 10-1 Vocab of Terms

Damerau–Levenshtein distance is the number of ops between two words Insert Delete Change Swap

adidas = adiidas == adifas (distance 1) But: cat != rat != hat (distance 1)

Page 8: 10-1 Vocab of Terms

Breaking up sentences on a variety of rules Split on non-alphanumeric?

Good: The dog ran to the park Bad: Ms. O’Hannety went to O’Flaggerty’s pub

(Ms, O, Hannety, went, to, O, Flaggerty, s, pub) Split on space?

Bad: San Fransisco is a great city.

Page 9: 10-1 Vocab of Terms

E.g. Lebensversicherungsgesellschaftsangestellter  = life insurance company employee

Would not get split by any of the previously mentioned methods

Page 10: 10-1 Vocab of Terms

Drop common ‘useless’ words How useless are they (“President of the USA”)

Not a big problem to include them, space or time-wise

Page 11: 10-1 Vocab of Terms

What I did at Amazon (codenamed BrandSims normalization)

Maps words/phrases that are semantically related to each other, so they can refer to the same content

E.g. Alan went to the store = Alan go store

Page 12: 10-1 Vocab of Terms

Mainly dropped since they were not always supported

Problematic since in certain languages accents are critical to understanding

Page 13: 10-1 Vocab of Terms

Standardize to all caps or all lowercase (more common)

Everywhere in the sentence? Bad: We went to the White House

Better solution is the beginning of a sentence and in titles

Page 14: 10-1 Vocab of Terms

More complicated than previous normalization techniques

Goal is to remove things like tense, number, possession from strings

Page 15: 10-1 Vocab of Terms

Chop off the end of the word Con: Crude and sometime ineffective Pro: Fast and no overhead

E.g. cookies -> cooki, cup->c

Page 16: 10-1 Vocab of Terms

Use a vocab list and morphological (structural) list [which may or may not help much]

Recognize context in a sentence (saw would become see if used as a verb, not a noun)

Porter’s algorithm:

Page 17: 10-1 Vocab of Terms

Understand the type of queries that will be submitted

It is all about tradeoffs between precision and recall

These techniques can be used differently depending on the context.

Page 18: 10-1 Vocab of Terms