Mlas06 Nigam Tie 01

Embed Size (px)

Citation preview

  • 8/3/2019 Mlas06 Nigam Tie 01

    1/52

    Machine Learning for InformationExtraction: An Overview

    Kamal NigamGoogle Pittsburgh

    With input, slides and suggestions from William Cohen, Andrew McCallum and Ion Muslea

  • 8/3/2019 Mlas06 Nigam Tie 01

    2/52

    Example: A Problem

    Genomics job

    Mt. Baker, the school district

    Baker Hostetler, the company

    Baker, a job opening

  • 8/3/2019 Mlas06 Nigam Tie 01

    3/52

    Example: A Solution

  • 8/3/2019 Mlas06 Nigam Tie 01

    4/52

    Job Openings:Category = Food ServicesKeyword = BakerLocation = Continental U.S.

  • 8/3/2019 Mlas06 Nigam Tie 01

    5/52

    Extracting Job Openings from the Web

    Title: Ice Cream Guru

    Description:If you dream of cold creamy

    Contact:[email protected]

    Category: Travel/Hospitality

    Function: Food Services

    mailto:[email protected]:[email protected]
  • 8/3/2019 Mlas06 Nigam Tie 01

    6/52

    Potential Enabler of Faceted Search

  • 8/3/2019 Mlas06 Nigam Tie 01

    7/52

    Lots of Structured Information in Text

  • 8/3/2019 Mlas06 Nigam Tie 01

    8/52

    IE from Research Papers

  • 8/3/2019 Mlas06 Nigam Tie 01

    9/52

    What is Information Extraction?

    Recovering structured data from formatted text

  • 8/3/2019 Mlas06 Nigam Tie 01

    10/52

    What is Information Extraction?

    Recovering structured data from formatted text

    Identifying fields (e.g. named entity recognition)

  • 8/3/2019 Mlas06 Nigam Tie 01

    11/52

    What is Information Extraction?

    Recovering structured data from formatted text

    Identifying fields (e.g. named entity recognition)

    Understanding relations between fields (e.g. record

    association)

  • 8/3/2019 Mlas06 Nigam Tie 01

    12/52

    What is Information Extraction?

    Recovering structured data from formatted text

    Identifying fields (e.g. named entity recognition)

    Understanding relations between fields (e.g. record

    association) Normalization and deduplication

  • 8/3/2019 Mlas06 Nigam Tie 01

    13/52

    What is Information Extraction?

    Recovering structured data from formatted text

    Identifying fields (e.g. named entity recognition)

    Understanding relations between fields (e.g. record

    association) Normalization and deduplication

    Today, focus mostly on field identification &

    a little on record association

  • 8/3/2019 Mlas06 Nigam Tie 01

    14/52

    IE Posed as a Machine Learning Task

    Training data: documents marked up withground truth

    In contrast to text classification, local features

    crucial. Features of: Contents

    Text just before item

    Text just after item

    Begin/end boundaries

    00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun

    prefix contents suffix

  • 8/3/2019 Mlas06 Nigam Tie 01

    15/52

    Good Features for Information Extraction

    Example word features:

    identity of word

    is in all caps

    ends in -ski

    is part of a noun phrase is in a list of city names

    is under node X in WordNet orCyc

    is in bold font

    is in hyperlink anchor

    features of past & future

    last person name was female

    next two words are and

    Associates

    begins-with-number

    begins-with-ordinal

    begins-with-punctuation

    begins-with-question-wordbegins-with-subject

    blank

    contains-alphanum

    contains-bracketed-

    number

    contains-http

    contains-non-space

    contains-number

    contains-pipe

    contains-question-mark

    contains-question-word

    ends-with-question-mark

    first-alpha-is-capitalized

    indented

    indented-1-to-4

    indented-5-to-10

    more-than-one-third-space

    only-punctuationprev-is-blank

    prev-begins-with-ordinal

    shorter-than-30

    Creativity and Domain Knowledge Required!

  • 8/3/2019 Mlas06 Nigam Tie 01

    16/52

    Is Capitalized

    Is Mixed Caps

    Is All Caps

    Initial Cap

    Contains Digit

    All lowercaseIs Initial

    Punctuation

    Period

    Comma

    Apostrophe

    Dash

    Preceded by HTML tag

    Character n-gram classifiersays string is a personname (80% accurate)

    In stopword list(the, of, their, etc)

    In honorific list(Mr, Mrs, Dr, Sen, etc)

    In person suffix list(Jr, Sr, PhD, etc)

    In name particle list(de, la, van, der, etc)

    In Census lastname list;segmented by P(name)

    In Census firstname list;segmented by P(name)

    In locations lists(states, cities, countries)

    In company name list(J. C. Penny)

    In list of company suffixes

    (Inc, & Associates,Foundation)

    Word Features lists of job titles, Lists of prefixes Lists of suffixes 350 informative phrases

    HTML/Formatting Features

    {begin, end, in} x{, , , } x{lengths 1, 2, 3, 4, or longer}

    {begin, end} of line

    Creativity and Domain Knowledge Required!

    Good Features for Information Extraction

  • 8/3/2019 Mlas06 Nigam Tie 01

    17/52

    IE History

    Pre-Web

    Mostly news articles De Jongs FRUMP[1982]

    Hand-built system to fill Schank-style scripts from news wire

    Message Understanding Conference (MUC)DARPA [87-95],TIPSTER[92-96]

    Most early work dominated by hand-built models

    E.g. SRIs FASTUS, hand-built FSMs.

    But by 1990s, some machine learning: Lehnert, Cardie, Grishman andthen HMMs: Elkan [Leek 97], BBN [Bikel et al 98]

    Web

    AAAI 94 Spring Symposium on Software Agents

    Much discussion of ML applied to Web. Maes, Mitchell, Etzioni. Tom Mitchells WebKB, 96

    Build KBs from the Web.

    Wrapper Induction

    Initially hand-build, then ML: [Soderland 96], [Kushmeric 97],

  • 8/3/2019 Mlas06 Nigam Tie 01

    18/52

    Landscape of ML Techniques for IE:

    Any of these models can be used to capture words, formatting or both.

    Classify Candidates

    Abraham Lincoln was born in Kentucky.

    Classifier

    which class?

    Sliding Window

    Abraham Lincoln was born in Kentucky.

    Classifier

    which class?

    Try alternatewindow sizes:

    Boundary Models

    Abraham Lincoln was born in Kentucky.

    Classifier

    which class?

    BEGIN END BEGIN END

    BEGIN

    Finite State Machines

    Abraham Lincoln was born in Kentucky.

    Most likely state sequence?

    Wrapper Induction

    Abraham Lincoln was born in Kentucky.

    Learn and apply pattern for a website

    PersonName

  • 8/3/2019 Mlas06 Nigam Tie 01

    19/52

    Sliding Windows & Boundary Detection

  • 8/3/2019 Mlas06 Nigam Tie 01

    20/52

    Information Extraction by Sliding Windows

    GRAND CHALLENGES FOR MACHINE LEARNING

    Jaime Carbonell

    School of Computer Science

    Carnegie Mellon University

    3:30 pm

    7500 Wean Hall

    Machine learning has evolved from obscurity

    in the 1970s into a vibrant and popular

    discipline in artificial intelligence

    during the 1980s and 1990s. As a result

    of its success and growth, machine learning

    is evolving into a collection of related

    disciplines: inductive concept acquisition,

    analytic learning in problem solving (e.g.

    analogy, explanation-based learning),

    learning theory (e.g. PAC learning),

    genetic algorithms, connectionist learning,

    hybrid systems, and so on.

    CMU UseNet Seminar Announcement

    E.g.

    Looking forseminar

    location

  • 8/3/2019 Mlas06 Nigam Tie 01

    21/52

    Information Extraction by Sliding Windows

    GRAND CHALLENGES FOR MACHINE LEARNING

    Jaime Carbonell

    School of Computer Science

    Carnegie Mellon University

    3:30 pm

    7500 Wean Hall

    Machine learning has evolved from obscurity

    in the 1970s into a vibrant and popular

    discipline in artificial intelligence

    during the 1980s and 1990s. As a result

    of its success and growth, machine learning

    is evolving into a collection of related

    disciplines: inductive concept acquisition,

    analytic learning in problem solving (e.g.

    analogy, explanation-based learning),

    learning theory (e.g. PAC learning),

    genetic algorithms, connectionist learning,

    hybrid systems, and so on.

    CMU UseNet Seminar Announcement

    E.g.

    Looking forseminar

    location

  • 8/3/2019 Mlas06 Nigam Tie 01

    22/52

    Information Extraction by Sliding Window

    GRAND CHALLENGES FOR MACHINE LEARNING

    Jaime Carbonell

    School of Computer Science

    Carnegie Mellon University

    3:30 pm

    7500 Wean Hall

    Machine learning has evolved from obscurity

    in the 1970s into a vibrant and popular

    discipline in artificial intelligence

    during the 1980s and 1990s. As a result

    of its success and growth, machine learning

    is evolving into a collection of related

    disciplines: inductive concept acquisition,

    analytic learning in problem solving (e.g.

    analogy, explanation-based learning),

    learning theory (e.g. PAC learning),

    genetic algorithms, connectionist learning,

    hybrid systems, and so on.

    CMU UseNet Seminar Announcement

    E.g.

    Looking forseminar

    location

  • 8/3/2019 Mlas06 Nigam Tie 01

    23/52

    Information Extraction by Sliding Window

    GRAND CHALLENGES FOR MACHINE LEARNING

    Jaime Carbonell

    School of Computer Science

    Carnegie Mellon University

    3:30 pm

    7500 Wean Hall

    Machine learning has evolved from obscurity

    in the 1970s into a vibrant and popular

    discipline in artificial intelligence

    during the 1980s and 1990s. As a result

    of its success and growth, machine learning

    is evolving into a collection of related

    disciplines: inductive concept acquisition,

    analytic learning in problem solving (e.g.

    analogy, explanation-based learning),

    learning theory (e.g. PAC learning),

    genetic algorithms, connectionist learning,

    hybrid systems, and so on.

    CMU UseNet Seminar Announcement

    E.g.

    Looking forseminar

    location

  • 8/3/2019 Mlas06 Nigam Tie 01

    24/52

    Information Extraction by Sliding Window

    GRAND CHALLENGES FOR MACHINE LEARNING

    Jaime Carbonell

    School of Computer Science

    Carnegie Mellon University

    3:30 pm

    7500 Wean Hall

    Machine learning has evolved from obscurity

    in the 1970s into a vibrant and popular

    discipline in artificial intelligence

    during the 1980s and 1990s. As a result

    of its success and growth, machine learning

    is evolving into a collection of related

    disciplines: inductive concept acquisition,

    analytic learning in problem solving (e.g.

    analogy, explanation-based learning),

    learning theory (e.g. PAC learning),

    genetic algorithms, connectionist learning,

    hybrid systems, and so on.

    CMU UseNet Seminar Announcement

    E.g.

    Looking forseminar

    location

  • 8/3/2019 Mlas06 Nigam Tie 01

    25/52

    Information Extraction with Sliding Windows[Freitag 97, 98; Soderland 97; Califf 98]

    00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

    prefix contents suffix

    Standard supervised learning setting Positive instances: Candidates with real label

    Negative instances: All other candidates

    Features based on candidate, prefix and suffix

    Special-purpose rule learning systems work wellcourseNumber(X) :-

    tokenLength(X,=,2),every(X, inTitle, false),some(X, A, , inTitle, true),

    some(X, B, . tripleton, true)

  • 8/3/2019 Mlas06 Nigam Tie 01

    26/52

    Rule-learning approaches to sliding-window classification: Summary

    Representations for classifiers allowrestriction of the relationships betweentokens, etc

    Representations are carefully chosensubsets of even more powerfulrepresentations based on logic programming(ILP and Prolog)

    Use of these heavyweight representations iscomplicated, but seems to pay off in results

  • 8/3/2019 Mlas06 Nigam Tie 01

    27/52

    IE by Boundary Detection

    GRAND CHALLENGES FOR MACHINE LEARNING

    Jaime Carbonell

    School of Computer Science

    Carnegie Mellon University

    3:30 pm

    7500 Wean Hall

    Machine learning has evolved from obscurity

    in the 1970s into a vibrant and popular

    discipline in artificial intelligence

    during the 1980s and 1990s. As a result

    of its success and growth, machine learning

    is evolving into a collection of related

    disciplines: inductive concept acquisition,

    analytic learning in problem solving (e.g.

    analogy, explanation-based learning),

    learning theory (e.g. PAC learning),

    genetic algorithms, connectionist learning,

    hybrid systems, and so on.

    CMU UseNet Seminar Announcement

    E.g.

    Looking forseminar

    location

  • 8/3/2019 Mlas06 Nigam Tie 01

    28/52

    IE by Boundary Detection

    GRAND CHALLENGES FOR MACHINE LEARNING

    Jaime Carbonell

    School of Computer Science

    Carnegie Mellon University

    3:30 pm

    7500 Wean Hall

    Machine learning has evolved from obscurity

    in the 1970s into a vibrant and popular

    discipline in artificial intelligence

    during the 1980s and 1990s. As a result

    of its success and growth, machine learning

    is evolving into a collection of related

    disciplines: inductive concept acquisition,

    analytic learning in problem solving (e.g.

    analogy, explanation-based learning),

    learning theory (e.g. PAC learning),

    genetic algorithms, connectionist learning,

    hybrid systems, and so on.

    CMU UseNet Seminar Announcement

    E.g.

    Looking forseminar

    location

  • 8/3/2019 Mlas06 Nigam Tie 01

    29/52

    IE by Boundary Detection

    GRAND CHALLENGES FOR MACHINE LEARNING

    Jaime Carbonell

    School of Computer Science

    Carnegie Mellon University

    3:30 pm

    7500 Wean Hall

    Machine learning has evolved from obscurity

    in the 1970s into a vibrant and popular

    discipline in artificial intelligence

    during the 1980s and 1990s. As a result

    of its success and growth, machine learning

    is evolving into a collection of related

    disciplines: inductive concept acquisition,

    analytic learning in problem solving (e.g.

    analogy, explanation-based learning),

    learning theory (e.g. PAC learning),

    genetic algorithms, connectionist learning,

    hybrid systems, and so on.

    CMU UseNet Seminar Announcement

    E.g.

    Looking forseminar

    location

  • 8/3/2019 Mlas06 Nigam Tie 01

    30/52

    IE by Boundary Detection

    GRAND CHALLENGES FOR MACHINE LEARNING

    Jaime Carbonell

    School of Computer Science

    Carnegie Mellon University

    3:30 pm

    7500 Wean Hall

    Machine learning has evolved from obscurity

    in the 1970s into a vibrant and popular

    discipline in artificial intelligence

    during the 1980s and 1990s. As a result

    of its success and growth, machine learning

    is evolving into a collection of related

    disciplines: inductive concept acquisition,analytic learning in problem solving (e.g.

    analogy, explanation-based learning),

    learning theory (e.g. PAC learning),

    genetic algorithms, connectionist learning,

    hybrid systems, and so on.

    CMU UseNet Seminar Announcement

    E.g.

    Looking forseminar

    location

  • 8/3/2019 Mlas06 Nigam Tie 01

    31/52

    IE by Boundary Detection

    GRAND CHALLENGES FOR MACHINE LEARNING

    Jaime Carbonell

    School of Computer Science

    Carnegie Mellon University

    3:30 pm

    7500 Wean Hall

    Machine learning has evolved from obscurity

    in the 1970s into a vibrant and popular

    discipline in artificial intelligence

    during the 1980s and 1990s. As a result

    of its success and growth, machine learning

    is evolving into a collection of related

    disciplines: inductive concept acquisition,analytic learning in problem solving (e.g.

    analogy, explanation-based learning),

    learning theory (e.g. PAC learning),

    genetic algorithms, connectionist learning,

    hybrid systems, and so on.

    CMU UseNet Seminar Announcement

    E.g.

    Looking forseminar

    location

  • 8/3/2019 Mlas06 Nigam Tie 01

    32/52

    BWI: Learning to detect boundaries

    Another formulation: learn three probabilisticclassifiers: START(i) = Prob( position istarts a field)

    END(j) = Prob( positionjends a field)

    LEN(k) = Prob( an extracted field has length k)

    Then score a possible extraction (i,j) bySTART(i) * END(j) * LEN(j-i)

    LEN(k) is estimated from a histogram

    [Freitag & Kushmerick, AAAI 2000]

  • 8/3/2019 Mlas06 Nigam Tie 01

    33/52

    BWI: Learning to detect boundaries

    BWI uses boostingto find detectors forSTARTand END

    Each weak detector has a BEFOREandAFTERpattern (on tokens before/afterposition i).

    Each pattern is a sequence of tokens and/orwildcards like: anyAlphabeticToken, anyToken,anyUpperCaseLetter, anyNumber,

    Weak learner for patterns uses greedysearch (+ lookahead) to repeatedly extend apair of empty BEFORE,AFTERpatterns

  • 8/3/2019 Mlas06 Nigam Tie 01

    34/52

    BWI: Learning to detect boundaries

    Field F1Person Name: 30%Location: 61%Start Time: 98%

  • 8/3/2019 Mlas06 Nigam Tie 01

    35/52

    Problems with Sliding Windowsand Boundary Finders

    Decisions in neighboring parts of the inputare made independently from each other.

    Nave Bayes Sliding Window may predict a

    seminar end time before the seminar start time. It is possible for two overlappingwindows to both

    be above threshold.

    In a Boundary-Finding system, left boundaries arelaid down independently from right boundaries,and their pairing happens as a separate step.

  • 8/3/2019 Mlas06 Nigam Tie 01

    36/52

    Finite State Machines

  • 8/3/2019 Mlas06 Nigam Tie 01

    37/52

    Hidden Markov Models

    St-1

    St

    Ot

    St+1

    Ot+1

    Ot-1

    ...

    ...

    Finite state model Graphical model

    Parameters: for all states S={s1,s2,}

    Start state probabilities: P(st)

    Transition probabilities: P(st|st-1 )

    Observation (emission) probabilities: P(ot|st)Training:

    Maximize probability of training observations (w/ prior)

    ||

    1

    1 )|()|(),(o

    t

    tttt soPssPosP

    HMMs are the standard sequence modeling tool in

    genomics, music, speech, NLP,

    ...transitions

    observations

    o1 o2 o3 o4 o5 o6 o7 o8

    Generates:

    Statesequence

    Observationsequence

    Usually a multinomial overatomic, fixed alphabet

  • 8/3/2019 Mlas06 Nigam Tie 01

    38/52

    IE with Hidden Markov Models

    Yesterday Lawrence Saul spoke this example sentence.

    YesterdayLawrence Saulspoke this example sentence.

    Person name: Lawrence Saul

    Given a sequence of observations:

    and a trained HMM:

    Find the most likely state sequence: (Viterbi)

    Any words said to be generated by the designated person namestate extract as a person name:

    ),(maxarg osPs

  • 8/3/2019 Mlas06 Nigam Tie 01

    39/52

    Generative Extraction with HMMs

    Parameters: {P(st|st-1), P(ot|st), for all states st, words ot} Parameters define generative model:

    [McCallum, Nigam, Seymore & Rennie 00]

    ||

    1

    1 )|()|(),(o

    t

    tttt soPssPosP

  • 8/3/2019 Mlas06 Nigam Tie 01

    40/52

    HMM Example: Nymble

    Other examples of HMMs in IE: [Leek 97; Freitag & McCallum 99; Seymore et al. 99]

    Task: Named Entity Extraction

    Train on 450k words of news wire text.Case Language F1 .Mixed English 93%Upper English 91%Mixed Spanish 90%

    [Bikel, et al 97]

    Person

    Org

    Other

    (Five other name classes)

    start-of-sentence

    end-of-sentence

    Transitionprobabilities

    Observationprobabilities

    P(st | st-1,ot-1) P(ot | st,st-1)

    Back-off to: Back-off to:

    P(st | st-1)

    P(st)

    P(ot | st,ot-1)

    P(ot | st)

    P(ot)

    or

    Results:

  • 8/3/2019 Mlas06 Nigam Tie 01

    41/52

    Regrets from Atomic View of Tokens

    Would like richer representation of text:

    multiple overlapping features, whole chunks of text.

    line, sentence, or paragraph features:

    length

    is centered in page

    percent of non-alphabetics

    white-space aligns with next line

    containing sentence has two verbs

    grammatically contains a question

    contains links to authoritative pages

    emissions that are uncountable

    features at multiple levels of granularity

    Example word features:

    identity of word

    is in all caps

    ends in -ski

    is part of a noun phrase

    is in a list of city names

    is under node X in WordNet or Cyc

    is in bold font

    is in hyperlink anchor

    features of past & future last person name was female

    next two words are and Associates

    P bl i h Ri h R i

  • 8/3/2019 Mlas06 Nigam Tie 01

    42/52

    Problems with Richer Representationand a Generative Model

    These arbitrary features are not independent: Overlapping and long-distance dependences

    Multiple levels of granularity (words, characters)

    Multiple modalities (words, formatting, layout)

    Observations from past and future

    HMMs are generativemodels of the text:

    Generative models do not easily handle these non-independent features. Two choices:

    Model the dependencies. Each state would have its ownBayes Net. But we are already starved for training data!

    Ignore the dependencies. This causes over-counting ofevidence (ala nave Bayes). Big problem when combiningevidence, as in Viterbi!

    ),( osP

  • 8/3/2019 Mlas06 Nigam Tie 01

    43/52

    Conditional Sequence Models

    We would prefer a conditionalmodel:P(s|o) instead of P(s,o):

    Can examine features, but not responsible for generatingthem.

    Dont have to explicitly model their dependencies.

    Dont waste modeling effort trying to generate what we are

    given at test time anyway.

    If successful, this answers the challenge of

    integrating the ability to handle many arbitraryfeatures with the full power of finite state automata.

  • 8/3/2019 Mlas06 Nigam Tie 01

    44/52

    Conditional Markov Models

    St-1

    St

    Ot

    St+1

    Ot+1

    Ot-1

    ...

    ...

    Generative (traditional HMM)

    ||

    1

    1 )|()|(),(

    o

    t

    tttt soPssPosP

    ...

    transitions

    observations

    St-1

    St

    Ot

    St+1

    Ot+1

    Ot-1

    ...

    ...

    Conditional

    ...transitions

    observations

    ||

    11 ),|()|(

    o

    tttt ossPosP

    Standard belief propagation: forward-backward procedure.Viterbi and Baum-Welch follow naturally.

    Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000]

    MaxEnt POS Tagger [Ratnaparkhi, 1996]SNoW-based Markov Model [Punyakanok & Roth, 2000]

    E ti l F

  • 8/3/2019 Mlas06 Nigam Tie 01

    45/52

    Exponential Formfor Next State Function

    k

    ttkk

    tt

    ttsttt sofsoZ

    osPossPt

    ),(exp),(

    1)|(),|(

    1

    11

    Capture dependency on st-1 with |S|independent functions, Pst-1(st|ot).

    Each state contains a next-state classifierthat, given the next observation, produces aprobability of the next state, Pst-1(st|ot).

    st-1

    st

    Recipe:- Labeled data is assigned to transitions.-Train each states exponential model by maximum entropy

    weight feature

  • 8/3/2019 Mlas06 Nigam Tie 01

    46/52

    Consider this MEMM, and enough training data to perfectly model it:

    Pr(0123|rob) = Pr(1|0,r)/Z1 * Pr(2|1,o)/Z2 * Pr(3|2,b)/Z3

    = 0.5 * 1 * 1

    Pr(0453|rib) = Pr(4|0,r)/Z1 * Pr(5|4,i)/Z2 * Pr(3|5,b)/Z3= 0.5 * 1 *1

    Pr(0123|rib)=1

    Pr(0453|rob)=1

    Label Bias Problem

  • 8/3/2019 Mlas06 Nigam Tie 01

    47/52

    nn oooossss ,...,,..., 2121

    HMM

    MEMM

    CRF

    St-1 St

    Ot

    St+1

    Ot+1Ot-1

    St-1 St

    Ot

    St+1

    Ot+1Ot-1

    St-1 St

    Ot

    St+1

    Ot+1Ot-1

    ...

    ...

    ...

    ...

    ...

    ...

    ||

    1

    1 )|()|(),(o

    t

    tttt soPssPosP

    ||

    1

    1

    ,

    ||

    1

    1

    ),(

    ),(

    exp1

    ),|()|(

    1

    o

    t

    k

    ttkk

    j

    ttjj

    os

    o

    t

    ttt

    osg

    ssf

    Z

    ossPosP

    tt

    ||

    1

    1

    ),(

    ),(

    exp1

    )|(o

    t

    k

    ttkk

    j

    ttjj

    o osg

    ssf

    ZosP

    (A special case of MEMMs and CRFs.)

    Conditional Random Fields (CRFs)[Lafferty, McCallum, Pereira 2001]

    From HMMs to MEMMs to CRFs

  • 8/3/2019 Mlas06 Nigam Tie 01

    48/52

    Conditional Random Fields (CRFs)

    St St+1 St+2

    O = Ot, Ot+1, Ot+2, Ot+3, Ot+4

    St+3 St+4

    Markov on s, conditional dependency on o.

    ||

    1

    1 ),,,(exp1

    )|(o

    t k

    ttkk

    o

    tossfZ

    osP

    Hammersley-Clifford-Besag theorem stipulates that the CRF

    has this forman exponential function of the cliques in the graph.

    Assuming that the dependency structure of the states is tree-shaped(linear chain is a trivial tree), inference can be done by dynamicprogramming in time O(|o| |S|2)just like HMMs.

    [Lafferty, McCallum, Pereira 2001]

  • 8/3/2019 Mlas06 Nigam Tie 01

    49/52

    Training CRFs

    ),,,(),(

    ),()|(),(

    :gradientlikelihood-Log

    }),{|}({

    :dataninggiven traiparametersoflikelihood-logMaximize

    1

    2)()(}{

    )()(

    penaltysmoothingparameterscurrentbyassignedlabelsusingcountfeaturelabelscorrectusingcountfeature

    )(

    tt

    t

    kk

    k

    i s

    ik

    i

    i

    iik

    k

    i

    k

    sstofosC

    osCosPosCL

    --

    soL

    k

    Methods: iterative scaling (quite slow) conjugate gradient (much faster) conjugate gradient with preconditioning (super fast) limited-memory quasi-Newton methods (also super fast)

    Complexity comparable to standard Baum-Welch

    [Sha & Pereira 2002]& [Malouf 2002]

  • 8/3/2019 Mlas06 Nigam Tie 01

    50/52

    Sample IE Applications of CRFs

    Noun phrase segmentation [Sha & Pereira, 03]

    Named entity recognition [McCallum & Li 03]

    Protein names in bio abstracts [Settles 05]

    Addresses in web pages [Culotta et al. 05] Semantic roles in text [Roth & Yih 05]

    RNA structural alignment [Sato & Satakibara 05]

  • 8/3/2019 Mlas06 Nigam Tie 01

    51/52

    Examples of Recent CRF Research

    Semi-Markov CRFs [Sarawagi & Cohen 05] Awkwardness of token level decisions for segments

    Segment sequence model alleviates this

    Two-level model with sequences of segments,

    which are sequences of tokens

    Stochastic Meta-Descent [Vishwanathan 06] Stochastic gradient optimization for training

    Take gradient step with small batches of examples Order of magnitude faster than L-BFGS

    Same resulting accuracies for extraction

  • 8/3/2019 Mlas06 Nigam Tie 01

    52/52

    Further Reading about CRFs

    Charles Sutton and Andrew McCallum. AnIntroduction to Conditional Random Fields forRelational Learning. In Introduction to Statistical

    Relational Learning. Edited by Lise Getoor andBen Taskar. MIT Press. 2006.

    http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf

    http://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdfhttp://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdfhttp://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdfhttp://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdfhttp://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdfhttp://www.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf