22
Text Analysis Method Using Latent Topics for Field Notes in Area Studies Taizo Yamada Historiographical Institute, The University of Tokyo 2013/12/13 PNC2013 1

Text Analysis Method Using Latent Topics for Field Notes in Area Studies

  • Upload
    winona

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Text Analysis Method Using Latent Topics for Field Notes in Area Studies. Taizo Yamada Historiographical Institute, The University of Tokyo. Contribution. Text analysis for Area Studies applying topic model to a field note for Area studies - PowerPoint PPT Presentation

Citation preview

Page 1: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Text Analysis Method Using Latent Topics for Field Notes in Area Studies

Taizo Yamada Historiographical Institute,

The University of Tokyo

2013/12/13 PNC2013 1

Page 2: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

ContributionText analysis for Area Studies – applying topic model to a field note for Area studies

• We use LDA (Latent Dirichlet Allocation) as a topic model.• Similar fragments or scenes in field note can be obtained.

– Visualization of the relationship between place names• The place information does not have Latitude and

longitude.• We don’t have any dictionaries of place name.

2013/12/13 PNC2013 2

Page 3: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

OutlineBackground, purposeMethodology of text analysis– Text structuring,– Term extraction– Characterization of term– Method of obtaining similar text fragments– Visualization and System

Conclusion

2013/12/13 PNC2013 3

Page 4: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Background Recently, Area Studies has made remarkable progress.

– Researchers in Area Studies can search and analyze large volumes of data easily and quickly.

– using information technology such as web technology, data analysis, data engineering,…

– In order to promote the analysis, the researchers have published databases.• catalogues, images, statistical data, spatial data and temporal data.

For more the progress of the study, – we believe that text analysis is one of the essential elements. – a text such as a field note has a description of sights, scenes and

customs, – but latent topics or subjects can be key elements characterizing the

area.2013/12/13 PNC2013 4

Page 5: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Purpose Text analysis method for a field note in Area Studies. – We prepare a field note database in which the data unit

is a description of a sight or a scene. – In order to detect latent topics, we use latent Dirichlet

allocation (LDA). • LDA is one of a topic model.• in LDA each text can be viewed as a mixture of various latent

topics and each topic can be viewed as a mixture of various words.

– In order to detect the gait of investigator in a field note• Visualization of the gait shows presentation of relations

between place names.

2013/12/13 PNC2013 5

Page 6: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Text(1)Target: Koichi Takaya, “The

Field note collection2 Sumatra” (in Japanese)– 1984. 10. 19 ― 1985. 1. 18– Overall Sumatra Island

2013/12/13 PNC2013 6

Page 7: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Text structuring (1)

2013/12/13 PNC2013 7

Page 8: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Text structuring (1)

2013/12/13 PNC2013 8

Page 9: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Text structuring (2)

2013/12/13 PNC2013 9

Page 10: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Term extraction(1)

morphological analysis– mecab+ipadic (morphological analyzer; dictionary)

2013/12/13 PNC2013 10

マングローブ。前面の海にはバガン ( 魚取り用の櫓 ) いくつもある。

Text (a scene)

マングローブ名詞 , 一般 ,*,*,*,*, マングローブ , マングローブ , マングローブ。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。前面 名詞 , 一般 ,*,*,*,*, 前面 , ゼンメン , ゼンメンの 助詞 , 連体化 ,*,*,*,*, の , ノ , ノ海 名詞 , 一般 ,*,*,*,*, 海 , ウミ , ウミに 助詞 , 格助詞 , 一般 ,*,*,*, に , ニ , ニは 助詞 , 係助詞 ,*,*,*,*, は , ハ , ワバガン 名詞 , 一般 ,*,*,*,*,*。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。魚 名詞 , 一般 ,*,*,*,*, 魚 , サカナ , サカナ取り 名詞 , 接尾 , 一般 ,*,*,*, 取り , トリ , トリ用 名詞 , 接尾 , 一般 ,*,*,*, 用 , ヨウ , ヨーの 助詞 , 連体化 ,*,*,*,*, の , ノ , ノ櫓 名詞 , 一般 ,*,*,*,*, 櫓 , ロ , ロ。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。いくつ 名詞 , 代名詞 , 一般 ,*,*,*, いくつ , イクツ , イクツも 助詞 , 係助詞 ,*,*,*,*, も , モ , モある 動詞 , 自立 ,*,*, 五段・ラ行 , 基本形 , ある , アル , アル。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。EOS

Result of morphological analysis

“ 名詞” : Noun, “ 助詞” : postpositional particle, “ 記号” : Symbol, “ 動詞” : Verb

Page 11: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Term extraction(2)

Extraction target: only noun But following types are not extracted:

– pronoun, number,

2013/12/13 PNC2013 11

Bakauhumi:1マングローブ :1前面 :1海 :1バガン :1魚取り用 :1櫓 :1ココヤシ :1下 :1家 :1チョウジ :1斜面 :1

Bag-of-Wordsマングローブ名詞 , 一般 ,*,*,*,*, マングローブ , マングローブ , マングローブ。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。前面 名詞 , 一般 ,*,*,*,*, 前面 , ゼンメン , ゼンメンの 助詞 , 連体化 ,*,*,*,*, の , ノ , ノ海 名詞 , 一般 ,*,*,*,*, 海 , ウミ , ウミに 助詞 , 格助詞 , 一般 ,*,*,*, に , ニ , ニは 助詞 , 係助詞 ,*,*,*,*, は , ハ , ワバガン 名詞 , 一般 ,*,*,*,*,*。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。魚 名詞 , 一般 ,*,*,*,*, 魚 , サカナ , サカナ取り 名詞 , 接尾 , 一般 ,*,*,*, 取り , トリ , トリ用 名詞 , 接尾 , 一般 ,*,*,*, 用 , ヨウ , ヨーの 助詞 , 連体化 ,*,*,*,*, の , ノ , ノ櫓 名詞 , 一般 ,*,*,*,*, 櫓 , ロ , ロ。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。いくつ 名詞 , 代名詞 , 一般 ,*,*,*, いくつ , イクツ , イクツも 助詞 , 係助詞 ,*,*,*,*, も , モ , モある 動詞 , 自立 ,*,*, 五段・ラ行 , 基本形 , ある , アル , アル。 記号 , 句点 ,*,*,*,*, 。 , 。 , 。EOS

Result of morphological analysis

The number of the kinds of term is 5,666.

Page 12: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Term extraction(3) Markup the extracted terms

– The terms may characterize the scene in the text.

– Extracted terms for each scene are different.

By the way, What features do the terms have? – We should prepare a method of a

detection of the features.– But we don’t have any thesaurus or

dictionaries.

Then, in order to detect, we introduce topic model.– Using topic model, we can detect

latent topics as the features.

2013/12/13 PNC2013 12

720km: Jakarta 出発830km: Bakauhumi   (*1) ①  マングローブ。前面の海にはバガン ( 魚取り用の櫓 ) いくつもある。 ② ココヤシ多い。この下に少し家ある。 ③ チョウジの多い斜面。 853km: 稲。今若実り。54km: このあたりよりチョウジ多くなる。その下を時に耕している。トウモロコシを植えるらしい。70km: 水田をよく見る。東に海見える。77-79km: ココヤシが多い。時に水田あり、それ実っている。85km: ココヤシ園広い。時にチョウジがある。90km: 西海岸に来る。マングローブあるが、その背後にはココヤシ多い。97km: チョウジが多い。この辺りは殆どがジャワ人だという。01km: Sidomulyo 。周り、シラス台地。11km: 5 ~ 10 年生のココヤシ多い。他に、チョウジ、バナナ、ランブータン、ドリアン。18km; 左の海にはバガンが 100 基ほど見える。22km: 海岸は広くココヤシ。これ 60 年生。高みはチョウジ多い。

Page 13: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Using topic model(1) We use LDA ( Latent Dirichlet Allocation) as

topic model.– Topic model

• Modeling of co-occurrence of terms.• The results show term classification.

– The kind of topic model• LSI(Latent Semantic Indexing): the model of introducing

latent topic to VSM(Vector Space Model).• PLSI(Probabilistic Latent Semantic Indexing): The re-

definition as a probabilistic model of LSI.• LDA: improved PLSI based on Bayesian learning

132013/12/13 PNC2013

Page 14: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Using topic model(2) LDA :D.M. Blei, et al. “Latent Dirichlet Allocation”, 2003.

– document generation model where generating probability of latent topic follows Dirichlet distribution.

– Latent topics can be determined if parameters of LDA can be tuned.

– parameter of LDA– : latent topic– : generating probability – : document . : term . : the total number of term in d– Dir: Dirichlet distribution

142013/12/13 PNC2013

Page 15: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Using topic model(2) LDA :D.M. Blei, et al. “Latent Dirichlet Allocation”, 2003.

– document generation model where generating probability of latent topic follows Dirichlet distribution.

– Latent topics can be determined if parameters of LDA can be tuned.

– parameter of LDA– : latent topic– : generating probability – : document . : term . : the total number of term in d– Dir: Dirichlet distribution

15

Topic can be generated according to θ.

The term can be generated according to topic z_k and β.

Document can be generated according to terms

θ can be generated by α

2013/12/13 PNC2013

Page 16: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Detection of latent topic

Feature of LDA– text

• A set of terms• Having multiple topics

– term• Belong to multiple topics• Not only specific topic

Spatial changing(scene changing)– Because of the visualization of detection

results, we can understand the changing .– Latent topics are changed according to

the spatial changing.

By the way, which is similar?2013/12/13 PNC2013 16

Page 17: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Similarity between texts (1) We introduce VSM (Vector Space Model).

– Feature vectors are needed by VSM.– The vector has an element which is total number of terms

per topic.

– Similarity between vectors is calculated by cosine similarity.

– x,y: text(scene)– : The weight of topic in text x.– : tf.idf weighting – : the frequency of in text x.– : the number of text which has topic .– N: the number of text

2013/12/13 PNC2013 17

Page 18: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Similarity between texts (2)

2013/12/13 PNC2013 18

Page 19: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Track of investigation (1) Beginning of text

– Date: Oct. 19. ‘84– “Jakarta より Kotabumi へ行

く。”– The text means the movement

from ”Jakarta” to ”Kotabumi”.

Tracking the movement– Extracting place name.– Rule:

• from: ○○[ から | より | 出発 |…]

• to: ○○[ へ | まで | に | 泊 |…]

– Unfortunately, we don’t have any dictionaries or gazetteers.

– I connect extracted place names for the time being.

2013/12/13 PNC2013 19

Page 20: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Track of investigation (2)

2013/12/13 PNC2013 20

Using D3.js

Force-Directed Graph

Oct. ‘84

Nov. ‘84

Jan. ‘85

Dec. ‘84

Jakarta

SolokTembilahan

Pekanbaru

Singapore

http://d3js.org/

Page 21: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

Conclusion, Future works We introduce text analysis for field note in Area

Studies. – Using topic model LDA– Tracking of the investigator.

Future work– Improvement of text analysis for Area Studies.

• What is the system that the researcher for Area Studies wants?

• We consider about the answer, and develop system according to the answer.

2013/12/13 PNC2013 21

Page 22: Text Analysis Method Using Latent Topics  for  Field Notes in Area  Studies

PNC2013 22

Thank you for listening to my presentation.

– E-mail: [email protected]

2013/12/13