8
Cross-Lingual Concern Analysis from Multilingual Weblog Articles Tomohiro Fukuhara RACE (Research into Artifacts), The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba JAPAN http://www.race.u-tokyo.ac.jp/~fukuhara/ Takehito Utsuro Graduate School of Systems and Information Engineering, Tsukuba University 1-1-1 Tennodai, Tsukuba, Ibaraki JAPAN http://nlp.iit.tsukuba.ac.jp/member/utsuro/ Hiroshi Nakagawa Information Technology Center, The University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo JAPAN http://www.r.dl.itc.u-tokyo.ac.jp/~nakagawa/English.html Abstract Cross-lingual concern analysis system from multilingual Weblog (blog) articles is proposed. To find concerns of people is important in various domains such as business and education. On the other hand, various languages are appearing on the Internet. It is useful for supporting users to find various viewpoints on a topic across languages. The aim of this research is to facilitate users to find differences of concerns using multilingual blog articles. Our prototype system called KANSHIN collects and analyzes blog articles written in Chinese, Japanese, Korean, and English (CJKE). Users can find differences of focuses on a topic across languages because the system automatically translates keywords into other languages and retrieve articles in each language. An overview of the system, and preliminary results are described. Keywords: Cross-lingual concern analysis, CJKE blog analysis, multilingual Web 1 Introduction Today many people communicate with each other, and share information and knowledge on the Internet. Although English is a dominant language on the current Internet, we consider that information written or spoken in various languages would appear in the near future. We call this phenomenon the multilingual Web. The tendency of the multilingual Web can be seen at (1) Wikipedia, and (2) blogosphere. In Wikipedia 1 , which is a large-scale public encyclopaedia on the Internet, there are 6 million articles, and these articles are written in 250 languages 2 . Top 5 languages used in Wikipedia are English (1,663,419 articles), Germany (549,653 articles), French (453,201 articles), Polish (354,394 articles), and Japan (334,237 articles) 3 . Although there are several articles written in several languages on the same topic 4 , contents are different by languages. By comparing these differences among languages, we can find various viewpoints for that topic. 1 http://www.wikipedia.org/ 2 http://en.wikipedia.org/wiki/Wikipedia:Multilingual statistics 3 As of March 1st, 2007. (from http://en.wikipedia.org/wiki/Wikipedia:Multilingual statistics) 4 For example, descriptions about Shinzo Abe, who is the current Prime Minister of Japan, are different by languages, especially in Japanese, Chinese, and Korean.

Fukuhara

Embed Size (px)

DESCRIPTION

jurnal

Citation preview

Page 1: Fukuhara

Cross-Lingual Concern Analysis from

Multilingual Weblog Articles

Tomohiro FukuharaRACE (Research into Artifacts), The University of Tokyo

5-1-5 Kashiwanoha, Kashiwa, Chiba JAPANhttp://www.race.u-tokyo.ac.jp/~fukuhara/

Takehito UtsuroGraduate School of Systems and Information Engineering, Tsukuba University

1-1-1 Tennodai, Tsukuba, Ibaraki JAPANhttp://nlp.iit.tsukuba.ac.jp/member/utsuro/

Hiroshi NakagawaInformation Technology Center, The University of Tokyo

7-3-1 Hongo, Bunkyo-ku, Tokyo JAPANhttp://www.r.dl.itc.u-tokyo.ac.jp/~nakagawa/English.html

Abstract

Cross-lingual concern analysis system from multilingual Weblog (blog) articles is proposed. Tofind concerns of people is important in various domains such as business and education. On theother hand, various languages are appearing on the Internet. It is useful for supporting usersto find various viewpoints on a topic across languages. The aim of this research is to facilitateusers to find differences of concerns using multilingual blog articles. Our prototype systemcalled KANSHIN collects and analyzes blog articles written in Chinese, Japanese, Korean, andEnglish (CJKE). Users can find differences of focuses on a topic across languages because thesystem automatically translates keywords into other languages and retrieve articles in eachlanguage. An overview of the system, and preliminary results are described.

Keywords: Cross-lingual concern analysis, CJKE blog analysis, multilingual Web

1 IntroductionToday many people communicate with each other, and share information and knowledge on theInternet. Although English is a dominant language on the current Internet, we consider thatinformation written or spoken in various languages would appear in the near future. We call thisphenomenon the multilingual Web. The tendency of the multilingual Web can be seen at (1)Wikipedia, and (2) blogosphere.

In Wikipedia1, which is a large-scale public encyclopaedia on the Internet, there are 6 millionarticles, and these articles are written in 250 languages2. Top 5 languages used in Wikipedia areEnglish (1,663,419 articles), Germany (549,653 articles), French (453,201 articles), Polish (354,394articles), and Japan (334,237 articles)3. Although there are several articles written in severallanguages on the same topic4, contents are different by languages. By comparing these differencesamong languages, we can find various viewpoints for that topic.

1http://www.wikipedia.org/2http://en.wikipedia.org/wiki/Wikipedia:Multilingual statistics3As of March 1st, 2007. (from http://en.wikipedia.org/wiki/Wikipedia:Multilingual statistics)4For example, descriptions about Shinzo Abe, who is the current Prime Minister of Japan, are different by

languages, especially in Japanese, Chinese, and Korean.

Page 2: Fukuhara

Figure 1: Overview of the cross-lingual concern analysis system using multilingual (CJKE) blogarticles.

Another example is the blogosphere. Today, many people read and write blog articles aroundthe world. According to a report5 that analyzes the state of blogoshpere in June, 2006, severallanguages are used in blog articles such as English(39%), Japanese(31%), Chinese(12%), and soon. From multilingual blog articles, we can find differences of concerns and opinions about a topicbecause concerns of people are different by countries and language communities. If we can finddifferences of concerns across countries or language communities, it is useful not only for mutualunderstanding, but also for solving social problems such as global warming, world poverty, war,conflicts, and so on.

In this paper, we propose a cross-lingual concern analysis system from multilingual blog articles.Figure 1 shows an overview of the prototype system of a cross-lingual concern analysis system thatcollects and analyzes multilingual blog articles. We aim to facilitate users to find and compareconcerns of people across languages. For achieving this goal, the system translates keywordswritten in a language into other languages automatically, and retrieves articles across languages.

This paper consists of following sections. Section 2 describes previous work on blog analysissystems, and requirements for a cross-lingual concern analysis system. Section 3 describes anoverview of the prototype system called KANSHIN. Section 4 shows preliminary results obtainedfrom the system. In Section 5, we discuss related work and a keyword translation issue. InSection 6, we summarize arguments, and describe future work.

2 Previous workIn this section, we describe previous work on blog analysis systems, and requirements for a cross-lingual concern analysis system.

5http://www.sifry.com/alerts/archives/000436.html

Page 3: Fukuhara

2.1 Previous workThere are several previous work and services on blog analysis systems. Nanno et al. (2004) pro-posed a system called blogWatcher that collects and analyzes Japanese blog articles. Glance et al.(2004) proposed a system called BlogPulse that analyzes trends of blog articles. With respect toblog analysis services on the Internet, there are several commercial or non-commercial services.Technorati6, BlogPulse7, kizasi.jp8, and blogWatcher9 are examples of popular blog analysis ser-vices. These services, however, analyze mono-lingual blog articles, i.e., they analyze English orJapanese or other specific language blog articles.

With respect to multilingual blog services, Globe of Blogs10 provides a retrieval function of blogarticles across languages. Best Blogs in Asia Directory11 also provides a retrieval function for Asianlanguage blogs. These services are not cross-lingual services. Blogwise12 analyzes multilingual blogarticles, but cross-lingual analysis is not realized.

Cross-lingual blog retrieval or blog analysis is needed for understanding differences of view-points on a topic. In this research, we focus on a cross-lingual concern analysis system that canretrieve and analyze blog articles written in several languages.

2.2 System requirementFollowings are needed for cross-lingual concern analysis.

(1) Automatic translation of keywords into other languages.

(2) Clustering and summarizing articles.

(3) Automatic translation of content of blog articles.

(1) is needed for realizing a cross-lingual concern analysis because the system is required to retrieveand analyze articles written in various languages. (2) is required for supporting users to findimportant clusters of articles easily. (3) is also needed for facilitating users to understand contentof blog articles.

In this paper, we implemented function (1) in the prototype system. An overview of the systemis described in the next section.

3 Cross-lingual concern analysis systemIn this section, we describe a prototype system of the cross-lingual concern analysis. We describe(1) an overview of the system, (2) summary of blog data, and (3) an automatic translation ofkeywords using Wikipedia.

3.1 OverviewThe prototype system called KANSHIN collects blog articles written in Chinese, Japanese, Korean,and English. The system provides users with functions for retrieving and analyzing articles.

Figure 1 shows an overview of the system. The system has lists of blog sites for each language.By using these lists, the system collects RSS13 or Atom feed files provided by each blog site, andextracts keywords from feed files, and store keywords and articles in each database. We describecurrent state of the blog data in 3.2.

The system uses several linguistic tools for extracting and indexing keywords from blog articlesfor each language. For Japanese, we used a morphological analysis tool called Juman14. For

6http://technorati.com/7http://www.blogpulse.com/8http://kizasi.jp/ (in Japanese)9http://blogwatcher.pi.titech.ac.jp/ (in Japanese)

10http://www.globeofblogs.com/11http://www.misohoni.com/bba/12http://www.blogwise.com/13Several references such as RDF Site Summary or Really Simple Syndication or Rich Site Summary are existed.14http://nlp.kuee.kyoto-u.ac.jp/nl-resource/juman.html (in Japanese)

Page 4: Fukuhara

Figure 2: Screen image of the prototype system.

Korean, we used a morphological analysis tool called KLT15. For Chinese, we use a morphologicalanalysis tool called ICTCLAS (Zhang et al. (2003)). Chinese characters are separated into twotypes: (1) traditional Chinese, which is mainly used in Taiwan and Hong Kong, and (2) simplifiedChinese, which is mainly used in mainland China. We treat both of types in the system. ForEnglish, we use a part-of-speech tagger called SS tagger (Tsuruoka and Tsujii (2005)).

Users can retrieve and analyze blog articles for each language (monolingual analysis), andacross languages (cross-lingual analysis).

With respect to monolingual concern analysis, users can retrieve blog articles, find co-occurredwords with a keyword16, and find daily and monthly topics for each language. We implementedthese functions into the previous system (Fukuhara et al. (2005)).

With respect to cross-lingual concern analysis, users can retrieve articles across CJKE lan-guages. The system retrieves articles across languages by translating keywords into other lan-guages. The system provides a comparative trend graph that shows a daily trend of articles con-taining the keywords. Figure 2 shows a screen image of a result of cross-lingual concern analysis.The system provides a user with graphs on (1) articles by language, (2) daily trend of articles, (3)comparison of number of articles, and (4) a list of retrieved articles. We describe the translationprocedure in 3.3.

3.2 DataTable 1 shows the summary of blog data stored in the system17. 2.6 million sites and 139 millionarticles are registered for Japanese since March 18th, 2004. 633 thousand blog sites and 7.0 millionarticles are registered for Chinese since January 25th, 2005. For Korean blog, 471 thousand blogsites, and 28 million articles are registered since August 1st, 2005. For English, 70 thousand blogsites, and 6.0 million articles are registered since November 11th, 2006. As a whole, 3.8 millionsites, and 180 million articles are registered in the system.

15http://nlp.kookmin.ac.kr/HAM/kor/ (in Korean)16For calculating co-occurred words, we use the Dice coefficient (Manning and Schutze, 1999, p.299).17Checked at March 10, 2007.

Page 5: Fukuhara

Table 1: Summary of blog data (at March 10, 2007, 0:00)Language # of blog sites # of articles DaysChinese 633,678 6,970,596 775Japanese 2,648,008 139,712,963 1,088Korean 471,869 28,081,261 587English 70,167 6,081,399 121Total 3,823,722 180,846,219 —

Figure 3: A procedure for finding translation of a keyword using Wikipedia.

3.3 Translating keywords using WikipediaFor translating keywords given by a user, we use Wikipedia18 which is a public online encyclopedia.Because Wikipedia entries are added and modified by many people, new words, which are notregistered in traditional dictionaries, are added quickly.

Figure 3 shows an overview of the translation procedure. A Wikipedia entry often has hyper-links to the same entries written in other languages. Therefore, we follow those hyperlinks froman entry written in the source language to the target language. If there are no entries associatedwith the keywords, or if there are no hyperlinks to the target language, the procedure fails.

4 Preliminary resultsIn this section, we show preliminary results obtained by the system.

4.1 Differences of concerns about ‘cloning’ among languagesWe can find differences of focuses on a topic by comparing co-occurred words with the samekeyword across languages. We show results of co-occurred words with ‘cloning’ found in CJKEblog articles. The search term is from December 1st, 2006, through January 25th, 2007. Figure 4compares frequency of articles containing the keyword ‘cloning’ for each language. During thisperiod, 5,009 articles are found. Among total articles, 2,873 articles (57.4%) are English, 1,771articles (35.4%) are Japanese, 271 articles (5.4%) are Korean, and 94 articles (1.9%) are Chinese.

In Japanese blog, we found ‘disease’, ‘technology’, ‘human’, ‘mobile (phone)’, and ‘body’ asco-occurred words. Table 2 shows a list of co-occurred words. The first word ‘disease’ is appeared

18http://www.wikipedia.org/

Page 6: Fukuhara

0

20

40

60

80

100

120

140

12/1

12/8

12/15

12/22

12/29

1/5

1/12

1/19

Freq. of A

rticles

English

Japanese

Korean

Chinese

Figure 4: Comparison of concerns about ‘cloning’ appeared in CJKE blog articles (from December1st, 2006, through January 25th, 2007).

as ‘Crohns disease’. This is because the pronunciation of the keyword19 used for retrieving articlesresembles to the pronunciation of ‘clone’. The fourth word ‘mobile (phone)’ is appeared as ‘clonedmobile (phone)’ in blog articles. This is because cellular phone cloning20 came up in gossip inearly December, 2006.

In Chinese, ‘human’, ‘time’, ‘country’, ‘problem’, ‘mode’, ‘cell’, ‘period’, and ‘world’ are ex-tracted as co-occurred words in this period. Table 3 shows a list of co-occurred words. We foundfew interesting articles because a few articles mentioned about ‘cloning’. It seems that Chinesespeaking bloggers have little concerns about ‘cloning’.

In Korean, ‘Kang Won Rae’, ‘attack’, ‘song’, ‘star wars’, and ‘phrase and a clause’ are foundas co-occurred words21. Table 4 shows a list of co-occurred words. The first word ‘Kang Won Rae’is the name of a member of CLON, which is one of popular music groups in South Korea. ‘KangWon Rae’ was a topic among Korean bloggers because he met a car accident during this period.Because his name was a topic in this period, and pronunciations of ‘CLON’ and ‘clone’ resembleeach other, his name was extracted as a co-occurred word in this period.

In English blog, ‘cons’, ‘pros’, ‘animal’, ‘information’, ‘stem’ are found as co-occurred words.Table 5 shows a list of co-occurred words. Although there are many English articles, we found thatmost of these articles are Splog, which means spam blogs. This is a characteristic phenomenonfor English blogs; we hardly paid attention to Splogs in other languages. Consequently, ‘cons’ and‘pros’ are strongly affected by Splog articles. As described in Kolari et al. (2006), filtering outSplog articles is important in English blogosphere. To filter out Splog is our future work.

19We used ‘クローン’ for representing ‘clone’.20http://en.wikipedia.org/wiki/Phone cloning21We used ‘클론’ for a query string that is a transliteration of ‘clone’ in Korean.

Page 7: Fukuhara

Table 2: Co-occurred words with ‘cloning’ in Japanese blog.Term # of articlesDisease (病) 222Technology (技術) 149Human (人間) 146Body (体) 78Mobile (携帯) 56

Table 3: Co-occurred words with ‘cloning’ in Chinese blog.Term # of articlesHuman (人) 21Time (时间) 7Country (国) 6Problem (问题) 5Mode (模式) 4Cell (细胞) 4Period (时候) 4World (世界) 4

5 DiscussionIn this section, we discuss about (1) related work, and (2) a translation procedure using Wikipedia.

5.1 Related workWith respect to cross-lingual concern analysis, Google provides linguistic tools such as GoogleTrends22 that analyzes trends of keywords fed to Google search, and Google News23 that auto-matically collects and displays latest news stories written in various languages. In Google Trends,we can compare frequency of documents containing keyword strings specified by a user acrosslanguages. However, Google Trends does not translate keyword strings into other languages.

Although Google Trends and Google News are monolingual services (at this moment), weproposed a cross-lingual concern analysis service.

5.2 Discussion about Wikipedia translationAs to the keyword translation using Wikipedia, our procedure fails to translate when there is noentry associated with the keywords in Wikipedia. Improving this procedure is also our futurework. In addition to improvement, we will evaluate the results of translation using Wikipedia.

6 ConclusionIn this paper, we proposed a cross-lingual concern analysis system that collects and analyzesmultilingual blog articles. Our prototype system collects and analyzes Chinese, Japanese, Korean,and English blog articles. Users can find differences of focus on a topic across languages becausethe system automatically translates keywords into other languages by using Wikipedia. We founddifferences of concerns about ‘cloning’ among CJKE blog articles. Our future work contains (1)to incorporate more languages into the system, (2) to filter out English Splog articles, and (3) toimprove and evaluate translation procedure using Wikipedia.

22http://www.google.com/trends23http://news.google.com/

Page 8: Fukuhara

Table 4: Co-occurred words with ‘cloning’ in Korean blog.Term # of articlesKang Won Rae (강원래) 27Attack (습격) 20Song (노래) 14Star Wars (스타워즈) 13Phrase and a clause (구절) 11

Table 5: Co-occurred words with ‘cloning’ in English blog.Term # of articlescons 1959pros 1850animal 214information 209stem 188

ReferencesFukuhara, T., Murayama, T., and Nishida, T. (2005). Analyzing concerns of people using

weblog articles and real world temporal data. In Proceedings of WWW 2005 2nd AnnualWorkshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics. (Available athttp://www.blogpulse.com/papers/2005/fukuhara.pdf, Accessed 2007-03-13).

Glance, N., Hurst, M., and Tomokiyo, T. (2004). Blogpulse: Automated trend discovery forweblogs. In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis andDynamics. (Available at http://www.blogpulse.com/www2004-workshop.html, Accessed 2007-03-13).

Kolari, P., Finin, T., and Joshi, A. (2006). SVMs for the Blogosphere: Blog identifica-tion and Splog detection. In Proceedings of the 2006 AAAI Spring Symposium on Com-putational Approaches to Analyzing Weblogs, pages 92–99. AAAI Press. (Available athttp://ebiquity.umbc.edu/get/a/publication/213.pdf, Accessed 2007-03-13).

Manning, C. D. and Schutze, H. (1999). Foundations of Statistical Natural Language Processing.The MIT Press.

Nanno, T., Fujiki, T., Suzuki, Y., and Okumura, M. (2004). Automatically collecting, monitoring,and mining Japanese weblogs. In WWW Alt. ’04: Proceedings of the 13th international WorldWide Web conference on Alternate track papers & posters, pages 320–321, New York, NY, USA.ACM Press.

Tsuruoka, Y. and Tsujii, J. (2005). Bidirectional inference with the easiest-first strategy for taggingsequence data. In HLT ’05: Proceedings of the conference on Human Language Technologyand Empirical Methods in Natural Language Processing, pages 467–474, Morristown, NJ, USA.Association for Computational Linguistics.

Zhang, H.-P., Yu, H.-K., Xiong, D.-Y., and Liu, Q. (2003). Hhmm-based chinese lexical analyzerictclas. In Proceedings of the second SIGHAN workshop on Chinese language processing, pages184–187, Morristown, NJ, USA. Association for Computational Linguistics.