NLTK Book Chapter 2

NLTK Book Chapter 2@torithetorick

Chapter 2の目標

NLTK内蔵のコーパスや語彙目録（lexicon）を使って、文書の特徴にアクセスする方法をマスターします。

条件付き確率分布が使えたらOKです。

Python寄りの話題はあまり触れません。

メソッド定義、クラス定義など

NLTKコーパス

Gutenberg Corpus

25,000冊の書籍

Web & Chat Corpus

Brown Corpus

カテゴリ分けされた500件の素材

文体の違い

Reuters Corpus

タグ付けた10,788件のニュース素材

Inaugural Address Corpus

歴代米国大統領就任演説集

NLTK語彙目録

Word Corpus

ただの単語リスト

Pronouncing Corpus

発音情報付

Comparative wordlist (Swadesh wordlist)

多言語の比較

WordNet

就任演説の時代感

Example1/5

Brown Corpusのニュースカテゴリ内で、法助動詞（話し手の気持ちを叙述する助動詞）の出現回数を列挙する。

>>> from nltk.corpus import brown

>>> news_text = brown.words(categories='news')

>>> fdist = nltk.FreqDist(w.lower() for w in news_text)

>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']

>>> for m in modals:

... print(m + ':', fdist[m], end=' ’)

can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

Practice 1/3

Brown Corpusのあるカテゴリ内で、wh-単語の出現回数を列挙する。

>>> fiction_text = brown.words(categories='fiction')

>>> fdist = nltk.FreqDist(w.lower() for w in

news_text)

>>> wh_words = list(set([w.lower() for w in

fiction_text if w.lower().startswith('wh')]))

>>> for wh in wh_words:

... print(wh + ':', fdist[wh])

Practice 1/3

>>> wh_words = list(set([w.lower() for w in

fiction_text if w.lower().startswith('wh')]))

['wheel', 'whites', 'whisper', 'whitely', 'whatever',

'whigs', "what'd", 'whir', 'whip', 'whisked',

'wheedled', 'whirl', 'what', 'whiskey', 'whereas',

'when', 'wheels', 'wholly', 'whereabouts', 'whom',

'whinnied', 'which', 'whirled', 'white-clad',

'white', 'whispered', 'wharves', 'who', 'whose',

"what's", "who's", 'whistling', 'whisky-on-the-

rocks', "who'd", 'why', 'whirring', 'whisky',

'whether', 'where', 'whooping', 'wherever',

'whistled', 'whenever', 'while', 'wheezed',

"white's", 'wheeling', 'whole', 'whipping']

Practice 1/3

('what:', 186) ('when:', 192) ('whom:', 8) ('which:',

124) ('who:', 112) ('whose:', 11)("what's:", 6)

('why:', 42) ("who's:", 3) ('whether:', 11)

('whenever:', 8) ('whether:', 11) ('where:', 89)

条件付き確率分布

P(A|B) → BであるときのAである確率むしろBごとにAを数え上げる、という用途が多い。

ジャンルごとに単語の分布を見る結果をプロットしたりテーブルにしたり

Python上では、単語のリストを扱っていたところを、条件と単語のペアのリストを扱うことになる

text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

FreqDist(text)

pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]

ConditionalFreqDist(pairs)

Example 2/5

人権宣言コーパスの複数の言語について、単語の長さの累積頻度

分布プロットを作成する。

>> from nltk.corpus import udhr

>>> languages = ['Chickasaw', 'English', 'German_Deutsch',

... 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']

>>> cfd = nltk.ConditionalFreqDist(

... (lang, len(word))

... for lang in languages

... for word in udhr.words(lang + '-Latin1'))

>>> cfd.plot(cumulative=True)

Example 2/5

Practice 2/3

人権宣言コーパスのある言語において、文字の出現頻度分布プロットを作成する。

>>> from nltk.corpus import udhr

>>> raw_text = udhr.raw('Japanese_Nihongo-UTF8')

>>> nltk.FreqDist(raw_text).plot(20)

Practice 2/3

Example 3/5

米国大統領就任演説中、americaおよびcitizenという単語の出現回数を、就任年度に対してプロットする。

>>> from nltk.corpus import inaugural

>>> cfd = nltk.ConditionalFreqDist(

... (target, fileid[:4])

... for fileid in inaugural.fileids()

... for w in inaugural.words(fileid)

... for target in ['america', 'citizen’]

... if w.lower().startswith(target))

就任演説の時代感

Practice 3/3

Brown Corpusのnewsおよびromanceカテゴリで、どの曜日に対する言及が最も多いか調べる。

>>> cfd = nltk.ConditionalFreqDist((genre,w) for

genre in ['news','romance'] for w in

brown.words(categories=genre))

>>> days = [‘Monday', 'Tuesday', 'Wednesday',

'Thursday', 'Friday', 'Saturday', 'Sunday']

Practice 3/3

>>> cfd.tabulate(samples=days)

Monday Tuesday Wednesday Thursday Friday

Saturday Sunday

news 54 43 22 20 41 33 51

romance 2 3 3 1 3 4 5

WordNet

WordNet

概念な階層構造を持った語彙目録

1985年からプリンストン大学にて開発

日本語WordNetは情報通信研究機構が2009年に公開

最新版は2012年公開のver. 1.1

BSDライセンス（無償）

感情情報も付与したWordNet-Affectなどもある

http://wndomains.fbk.eu/wnaffect.html

WordNetの用語

同義語集合 Synset

見出し語（単語の「基本形」）Lemma

上位語 hypernym

すべてのXがYの種類の一であるならYはXの上位語である。

下位語 hyponym

すべてのYがXの種類の一であるならYはXの下位語である。

全体語 holonym

XがYの一部であるなら、YはXのholonymである。

部分語 meronym

YがXの一部であるなら、YはXのmeronymである。

Example 4/5

単語motorcarについてWordNetを参照する。

>>>from nltk.corpus import wordnet as wn

>>> wn.synsets('motorcar')

[Synset('car.n.01')]

>>> wn.synset('car.n.01').definition()

'a motor vehicle with four wheels; usually propelled

by an internal combustion engine’

>>> wn.synset('car.n.01').lemma_names()

['car', 'auto', 'automobile', 'machine', 'motorcar']

Example 4/5

>>> motorcar = wn.synset('car.n.01')

>>> types_of_motorcar = motorcar.hyponyms()

>>> types_of_motorcar[0]

Synset('ambulance.n.01’)

Example 4/5

>>> motorcar.hypernyms()

[Synset('motor_vehicle.n.01')]

>>> paths = motorcar.hypernym_paths()

>>> len(paths)

2

>>> motorcar.root_hypernyms()

[Synset('entity.n.01')]

Example 5/5

WordNetを使って単語の意味的な類似性を評価する。

>>> right = wn.synset('right_whale.n.01')

>>> orca = wn.synset('orca.n.01')

>>> tortoise = wn.synset('tortoise.n.01')

>>> novel = wn.synset('novel.n.01')

>>> right.path_similarity(minke)

0.25

>>> right.path_similarity(orca)

0.16666666666666666

>>> right.path_similarity(tortoise)

0.07692307692307693

>>> right.path_similarity(novel)

0.043478260869565216

Extra

Brown Corpus内でn回以上出現する単語を列挙する。

>>> fd = nltk.FreqDist(w.lower() for w in

brown.words())

>>> words = [vocab for vocab in fd.keys() if

fd[vocab] >= 3]

>>> len(words)

20615

Technology

NLTK Book Chapter 2