Text Analysis Method Using Latent Topics for Field Notes in Area Studies

  • View

  • Download

Embed Size (px)


Text Analysis Method Using Latent Topics for Field Notes in Area Studies. Taizo Yamada Historiographical Institute, The University of Tokyo. Contribution. Text analysis for Area Studies applying topic model to a field note for Area studies - PowerPoint PPT Presentation


Text Analysis Method Using Latent Topics for Field Notes in Area StudiesTaizo Yamada Historiographical Institute, The University of Tokyo2013/12/13PNC20131ContributionText analysis for Area Studies applying topic model to a field note for Area studiesWe use LDA (Latent Dirichlet Allocation) as a topic model.Similar fragments or scenes in field note can be obtained.Visualization of the relationship between place namesThe place information does not have Latitude and longitude.We dont have any dictionaries of place name.2013/12/13PNC20132OutlineBackground, purposeMethodology of text analysisText structuring,Term extractionCharacterization of termMethod of obtaining similar text fragmentsVisualization and SystemConclusion

2013/12/13PNC20133BackgroundRecently, Area Studies has made remarkable progress.Researchers in Area Studies can search and analyze large volumes of data easily and quickly.using information technology such as web technology, data analysis, data engineering,In order to promote the analysis, the researchers have published databases.catalogues, images, statistical data, spatial data and temporal data.

For more the progress of the study, we believe that text analysis is one of the essential elements. a text such as a field note has a description of sights, scenes and customs, but latent topics or subjects can be key elements characterizing the area.

2013/12/13PNC20134PurposeText analysis method for a field note in Area Studies. We prepare a field note database in which the data unit is a description of a sight or a scene. In order to detect latent topics, we use latent Dirichlet allocation (LDA). LDA is one of a topic model.in LDA each text can be viewed as a mixture of various latent topics and each topic can be viewed as a mixture of various words. In order to detect the gait of investigator in a field noteVisualization of the gait shows presentation of relations between place names.

2013/12/13PNC20135Text(1)Target: Koichi Takaya, The Field note collection2 Sumatra (in Japanese)1984. 10. 19 1985. 1. 18Overall Sumatra Island2013/12/13PNC20136

Text structuring (1)2013/12/13PNC20137

Text structuring (1)2013/12/13PNC20138

Text structuring (2)2013/12/13PNC20139

Term extraction(1)morphological analysismecab+ipadic (morphological analyzer; dictionary)2013/12/13PNC201310( ) Text (a scene),,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,*,,*,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,*,*,*,*,,,,,*,*,,,,,,,*,*,*,*,,,EOSResult of morphological analysis: Noun, : postpositional particle, : Symbol, : VerbTerm extraction(2)Extraction target: only nounBut following types are not extracted: pronoun, number,2013/12/13PNC201311Bakauhumi:1:1:1:1:1:1:1:1:1:1:1:1Bag-of-Words,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,*,,*,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,*,*,*,*,,,,,,*,*,*,,,,,*,*,*,*,,,,,*,*,,,,,,,*,*,*,*,,,EOSResult of morphological analysisThe number of the kinds of term is 5,666.Term extraction(3)Markup the extracted termsThe terms may characterize the scene in the text.Extracted terms for each scene are different.

By the way, What features do the terms have? We should prepare a method of a detection of the features.But we dont have any thesaurus or dictionaries.

Then, in order to detect, we introduce topic model.Using topic model, we can detect latent topics as the features.

2013/12/13PNC201312720km: Jakarta 830km: Bakauhumi(*1) ( ) 853km: 54km: 70km: 77-79km: 85km: 90km: 97km: 01km: Sidomulyo11km: 5 10 18km; 100 22km: 60 Using topic model(1)We use LDALatent Dirichlet Allocation) as topic model.Topic modelModeling of co-occurrence of terms.The results show term classification.

The kind of topic modelLSI(Latent Semantic Indexing): the model of introducing latent topic to VSM(Vector Space Model).PLSI(Probabilistic Latent Semantic Indexing): The re-definition as a probabilistic model of LSI.LDA: improved PLSI based on Bayesian learning

132013/12/13PNC2013Using topic model(2)142013/12/13PNC2013Using topic model(2)15Topic can be generated according to .The term can be generated according to topic z_k and .Document can be generated according to terms can be generated by 2013/12/13PNC2013Detection of latent topicFeature of LDAtextA set of termsHaving multiple topicstermBelong to multiple topicsNot only specific topic

Spatial changing(scene changing)Because of the visualization of detection results, we can understand the changing .Latent topics are changed according to the spatial changing.

By the way, which is similar?2013/12/13PNC201316

Similarity between texts (1)2013/12/13PNC201317Similarity between texts (2)2013/12/13PNC201318

Track of investigation (1)Beginning of textDate: Oct. 19. 84Jakarta Kotabumi The text means the movement from Jakarta to Kotabumi.

Tracking the movementExtracting place name.Rule:from: [|||]to: [||||]Unfortunately, we dont have any dictionaries or gazetteers.I connect extracted place names for the time being.2013/12/13PNC201319

Track of investigation (2)2013/12/13PNC201320Using D3.jsForce-Directed GraphOct. 84 Nov. 84 Jan. 85 Dec. 84 JakartaSolokTembilahanPekanbaruSingaporehttp://d3js.org/Conclusion, Future worksWe introduce text analysis for field note in Area Studies. Using topic model LDATracking of the investigator.

Future workImprovement of text analysis for Area Studies.What is the system that the researcher for Area Studies wants?We consider about the answer, and develop system according to the answer.

2013/12/13PNC201321PNC201322 Thank you for listening to my presentation.

E-mail: t_yamada@hi.u-tokyo.ac.jp2013/12/13