CLTL presentation: training an opinion mining system from KAF files using CRF

OM, CRF and KAF

Rubén Izquierdo BeviáCLTL tutorial11-July-2013

OM, CRF and KAF1. OM

Develop an Opinion Miner tool

2. CRFUsing supervised Machine Learning (CRF)

3. KAF

Using KAF files as input

Opinion Miner

EXPRESSION

TARGETHOLDER

Detecting and extracting fine grained opinions in text.

Opinion elements:Expression the actual subjective statementHolder mentions of whom the opinion is fromTarget what the opinion is about

My wife said that the room was really dirty.

CRF Conditional Random Fields

Statistical modeling method Obtain conditional probably distribution over sequences Suitable for segmenting and labeling structured data (sequences,

trees…) Expressions, holders and targets are sequences

Many different packages: Mallet (http://mallet.cs.umass.edu) CRFSuite (http://www.chokkan.org/software/crfsuite)

Most used input format: Sequential data One token per line, represented by features

KAFKAF modified for OpeNER

Different layers for different information

All the features are extracted from the KAF

No external linguistics processors are called

First stepsDefine which will be our “output classes”

Target, Holder, Positive, Negative

Define which features will represent each tokenToken, lemma, pos, polarity, entity, polarity and

bi/tri-grams around

Study the input format of your selected CRF package (CRFSuite in my case)

CRFSuite input format Input format of CRFSuite

One file with all data Sequences separated by empty lines One token per line with the format:

CLASS [TAB] FEATSCLASS O| B-class | I-class

O no class B-class the first element of a sequence of type “class” I-class element inside of a sequence of type “class”

FEATS feat1=val1 [TAB] feat2=val2 …

B-NP a=He b=reckons c=the d=He|reckons e=P

Simple Example

NP NPVP

B-NP t=He p=PRP

pt=O nt=reckons

pp=O np=VBZ

B-VP t=reckons

p=VBZ

pt=He nt=the pp=PRP

np=DT

B-NP t=the p=DT pt=reckons

nt=current pp=VBZ

np=JJ

I-NP t=current p=JJ pt=the nt=account

pp=DT np=NN

I-NP t=account

p=NN pt=current

nt=O pp=JJ np=NN

We want to train a chunker (also sequences)

Tagged data

He/PRP reckons/VBZ the/DT current/JJ account/NN

Features per token: token (t), pos (p), previous token (pt), next token (nt), previous

pos (pp) next pos (np)

My approach1. Obtain features for each single token

1. Input KAF

2. Output ‘TAB’ format

3. Our own customized feature extractor

2. Generate the final set of features (context)1. Input TAB format

2. Output ‘CRF’ format

3. One existing python script

KAF feature extractorPython script that reads a KAF file and generates

the ‘TAB’ formatKafParser + Python script


the ‘TAB’ format


the ‘TAB’ formatKafParser + Python script

Converting to CRFPython script:

Specify the format of your tab fileSpecify the “templates” (features) for each token


Specify the format of your tab fileSpecify the “templates” for each token


Specify the format of your tab fileSpecify the “templates” for each tokenRun the script using the TAB and generate OUT

Extracting opinionsTraining

1. Get all KAF files with annotations

2. Obtain TAB file for each file

3. Convert to CRF for each file

4. Create a single training file with all CRF files

5. Train the MODEL with crfsuitecrfsuite learn –m my_model my_data.crf

Extracting opinionsTagging one kaf file

1. Generate TAB fileOne line for each TOKEN (<wf>)

2. Convert to CRF

3. Tag with the trained modelcrfsuite tag –m my_model my_kaf.crf

4. Read and align output from crfsuite


Generate TAB fileOne line for each TOKEN (<wf>)

Convert to CRFTag with the trained model

crfsuite tag –m my_model my_kaf.crfRead and align output from crfsuite




crfsuite tag –m my_model my_kaf.crfRead and align output from crfsuiteGenerate the KAF layer




crfsuite tag –m my_model my_kaf.crfRead and align output from crfsuiteGenerate the KAF layer

How to adapt this?1. Adapt the KAF feature extractor (+++)

2. Adapt the TAB-CRF converter (+)

3. Train your model (+)

4. Adapt the CRF-> KAF de-converter (++)

Technology

CLTL presentation: training an opinion mining system from KAF files using CRF