Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Course Summary

LING 572

Fei Xia

03/06/07

Outline

• Problem description

• General approach

• ML algorithms

• Important concepts

• Assignments

• What’s next?

Problem descriptions

Two types of problems

• Classification problem

• Sequence Labeling problem

• In both cases:– A predefined set of labels: C = {c1, c2, …cn}

– Training data: { (xi, yi) }, where yi 2 C, and yi is known or unknown.

– Test data

NLP tasks

• Classification problems:– Document classification– Spam detection– Sentiment analysis– …

• Sequence labeling problems:– POS tagging– Word segmentation– Sentence segmentation– NE detection– Parsing– IGT detection – …

General approach

Step 1: Preprocessing

• Converting the NLP task to a classification or sequence labeling problem

• Creating the attribute-value table:– Define feature templates– Instantiate feature templates and select features– Decide what kind of feature values to use (e.g.,

binarizing features or not) – Converting a multi-class problem to a binary problem

(optional)

Feature selection

• Dimensionality reduction– Feature selection

• Wrapping methods• Filtering methods:

– Mutual info, 2, Information gain, ….

– Feature extraction• Term clustering: • Latent semantic indexing (LSI)

Multiclass Binary

• One-vs-all

• All-pairs

• Error-correcting Output Codes (ECOC)

Step 2: Training and decoding

• Choose a ML learner

• Train and test on development set, with different settings of non-model parameters

• Choose the best setting for the development set

• Run the learner on the test data with the best setting

Step 3: Post-processing

• Label sequence the output we want

• System combination– Voting: majority voting, weighted voting– More sophisticated models

Supervised algorithms

Main ideas

• kNN and Ricchio: finding the nearest neighbors / prototypes

• DT and DL: finding the right group

• NB, MaxEnt: calculating P(y | x)

• Bagging: Reducing the instability

• Boosting: Forming a committee

• TBL: Improving the current guess

ML learners

• Modeling

• Training

• Testing (a.k.a. decoding)

Modeling

• NB: assuming features are conditionally independent.

• MaxEnt:

Training

• kNN: no training• Rocchio: calculate prototypes

• DT: build a decision tree– Choose a feature and then split data

• DL: build a decision list:– Choose a decision rule and then spit data

• TBL: build a transformation list by – Choose a transformation and then update the current

label field

Training (cont)

• NB: calculate P(ci) and P(fj | ci) by simple counting.

• MaxEnt: calculate the weights of feature functions by iteration.

• Bagging: create bootstrap samples and learn base classifiers.

• Boosting: learn base classifiers and their weights.

Testing

• kNN: calculate distances between x and xi, find the closest neighbors.

• Rocchio: calculate distances between x and prototypes.

• DT: traverse the tree

• DL: find the first matched decision rule.

• TBL: apply transformations one by one.

Testing (cont)

• NB: calc

• MaxEnt: calc

• Bagging: run the base classifiers and choose the class with highest votes.

• Boosting: run the base classifiers and calc the weighted sum.

Sequence labeling problems

• With classification algorithms:– Having features that refer to previous tags– Using beam search to find good sequences

• With sequence labeling algorithms:– HMM– TBL– MEMM– CRF– …

Semi-supervised algorithms

• Self-training

• Co-training

• …

Adding some unlabeled data to the labeled data

Unsupervised algorithms

• MLE

• EM:– General algorithm: E-step, M-step– EM for PM models

• Forward-backward for HMM• Inside-outside for PCFG• IBM models for MT

Important concepts

Concepts

• Attribute-value table

• Feature templates vs. features

• Weights:– Feature weights– Classifier weights– Instance weights– Feature values

Concepts (cont)

• Maximum entropy vs. Maximum likelihood

• Maximize likelihood vs. minimize training error

• Training time vs. test time

• Training error vs. test error

• Greedy algorithm vs. iterative approach

Concepts (cont)

• Local optima vs. global optima

• Beam search vs. Viterbi algorithm

• Sample vs. resample

• Model parameters vs. non-model parameters

Assignments

Assignments

• Read code: – NB: binary features?– DT: difference between DT and C4.5– Boosting: AdaBoost and AdaBoostM2– MaxEnt: binary features?

• Write code:– Info2Vectors– BinVectors– 2

• Complete two projects

Projects

• Steps:– Preprocessing– Training and testing– Postprocssing

• Two projects:– Project 1: Document classification– Project 2: IGT detection

Project 1: Document classification

• A typical classification problem

• Data are prepared already– Feature template: word appeared in the doc– Feature value: word frequency

Project 2: IGT detection

• Can be framed as a sequence labeling problem – Preprocessing: Define label set– Postprocessing: Tag sequence spans

• Sequence labeling problem using classification algorithm with beam search

• To use classification classifiers: – Preprocessing:

• Define features• Choose feature values• …

Project 2 (cont)

• Preprocessing:– Define label set– Define feature templates– Decide on feature values

• Training and decoding– Write beam search

• Postprocessing– Convert label sequence spans

Project 2 (cont)

• Presentation

• Final report

• A typical conference paper:– Introduction– Previous work– Methodology– Experiments– Discussion– Conclusion

Using Mallet

• Difficulties:– Java– A large package

• Benefits:– Java– A large package– Many learning algorithms: comparing the

implementation with “standard” algorithms

Bugs in Mallet?

• In Hw9, include a new section:– Bugs– Complaints– Things you like about Mallet

Course summary• 9 weeks: 18 sessions

• 2 kinds of problems• 9 supervised algorithms• 1 semi-supervised algorithm• 1 unsupervised algorithm• 4 related issues: feature selection, multiclass binary, system

combination, beam search

• 2 projects• 1 well-known package• 9 assignments, including 1 presentation and 1 final report

• N papers

What’s the next?

• Learn more about the algorithms covered in class.

• Learn new algorithms:– SVM, CRF, regression algorithms, graphical models,

…

• Try new tasks:– Parsing, spam filtering, reference resolution, …

Misc

• Hw7: due tomorrow 11pm

• Hw8: due Thursday 11pm

• Hw9: due 3/13 11pm

• Presentation: No more than 15+5 minutes

What must be included in the presentation?

• Label set

• Feature templates

• Effect of beam search

• 3+ ways to improve the system and results on dev data (test_data/)

• Best system: results on dev data and the setting

• Results on test data (more_test_data/)

Grades, etc.

• 9 assignments + class participation

• Hw1-Hw6:– Total: 740– Max: 696.56– Min: 346.52– Ave: 548.74– Median: 559.08

Documents

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?