Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring in IR (in Japanese)

情報検索研究における失敗分析・進歩の検証

早稲田大学酒井哲也

[email protected]

19th May 2014, Project Next Kickoff@NII

TALK OUTLINE

1. Reliable Information Access (RIA) workshop (2003)2. NTCIRラウンド間の進歩の検証

3. まとめ

RIA overview [Buckley+04, Harman+04]

• NIST主催の6週間の「夏合宿」、2003年、invitation‐only• 12のIR研究機関から28名がシステムを持ち寄り参加

• 目的: IRシステムの有効性はトピック(検索課題)によって大きく異なる。個々のシステムが個別に失敗分析を行っても、システム依存の結果しか得られない。システムを持ち寄って、システム非依存の技術課題を洗い出そう!

• データ: 過去のTRECのトピック45件および適合性判定結果(qrels)。システム出力(runs)のみ新たに作成。

• Bottom‐up track: 検索結果の人手による失敗分析

• Top‐down track: システム間の途中結果共有による擬似適合フィードバックの有効性評価

Query relevance set

RIA failure analysis process [Buckley+04, p.6]1. The topic (or pair of topics) for the day was determined, with a leader being assigned the topic, on a rotating

basis among all participants

2. Each participant was assigned one of the six standard runs (or systems) to examine, either individually or as a

team

3. Each participant or team spent from 60 to 90 minutes investigating how their assigned system did on the assigned

topic, examining how the system did absolutely, how it did compared to the other systems, and how performance

could be improved for it. A template (see Figure 1) was generally filled out to guide both the investigation and

subsequent discussions

4. All participants assigned to a topic discussed the topic for 20 to 30 minutes, in separate rooms if there were two

topics. The failures of each system were discussed, along with any conclusions about the difficulty of the topic

itself.

5. The topic leader summarized the results of the discussion in a short report (a template was developed for this by

week 3 of the workshop). If there were 2 topics assigned for the day, each leader would give a short presentation

on the results to the workshop as a whole.

RIA failure analysis categorisation [Buckley+04, pp.10‐12]1. General success ‐ present systems worked well2. General technical failure (stemming, tokenization)3. All systems emphasize one aspect; missing another required term4. All systems emphasize one aspect; missing another aspect5. Some systems emphasize one aspect; some another; need both6. All systems emphasize one irrelevant aspect; missing point of topic7. Need outside expansion of “general” term (Europe for example)8. Need QA query analysis and relationships9. Systems missed difficult aspect that would need human help10. Need proximity relationship between two aspects

Antarctic vs Antarctica

“What disasters have occurred in tunnels used for transportation?”

“How much sugar does Cuba export and which countries import it?”

“What are new methods of producing steel?”

“What countries are experiencing an increase in tourism?”

Categorisation done byone person

RIA failure analysis conclusions [Buckley+04, p.12]

• The first conclusion is that the root cause of poor performance on any one topic is likely to be the same for all systems.

• The other major conclusion to be reached from these category assignments is that if a system can realize the problem associated with a given topic, then for well over half the topics studied (at least categories 1 through 5), current technology should be able to improve results significantly. This suggests it may be more important for research to discover what current techniques should be applied to which topics, than to come up with new techniques.

TALK OUTLINE


3. まとめ

Improvements that don’t add up [Armstrong+09]Armstrong et al. analysed 106 papers from SIGIR ‘98‐’08, CIKM ‘04‐’08 that used TREC data, and reported:• Researchers often use low baselines• Researchers claim statistically significant improvements, but the results are often not competitive with the best TREC systems

• IR effectiveness has not really improved over a decade!

What we want What we’ve got?

“Running on the spot?” [Armstrong+09]

Each line represents a statistically significantimprovement over a basline

NTCIR 3～5 の3年間における進歩の検証 (1)[Sakai06FIT]

上位3チーム間に有意差なし

上位2チームは共にChaSenを用いており検索された文書が似ている

上位3チームともBM25, 擬似適合フィード

バック、検索結果統合などを利用しており革新的な技術は見られず

NTCIR 3～5 の3年間における進歩の検証 (2)[Sakai06FIT]

同一チームにおける旧システム×旧トピックセットと新システム×新トピックセットの結果についての対応のない検定を行うと、いずれも有意差なし。

同一トピックセット(NTCIR‐3)に対しチームTSBの新旧システムを処理させると、対応のある検定が適用でき、有意差あり。進歩あり?

同一の新システム(TSB)により新旧トピックセッ

トを処理し、対応のない検定を行うと有意差なし。トピックセットはほぼ等価とみなせる?

(チューニングの効果無視)

進歩の検証 [Sakai06FIT]

INTENT‐2 SystemsINTENT‐2 SystemsINTENT‐1 Systems(INTENT‐2 Retro Runs)

INTENT‐1 Systems(INTENT‐2 Retro Runs)

NTCIR‐(X‐1) topic set

NTCIR‐X topic set

Check topic setequivalence

Check progress

NTCIR‐X SystemsNTCIR‐(X‐1) systems

(Revived Runs)

(a)

(b)

(c)(d)

“進歩の有無を明らかにするためには、「アルゴリズム固定・新旧コレクション利用」の実験によりコレクションの等価性を検証しつつ、「コレクション固定・新旧アルゴリズム利用」の実験により進歩の度合いを明らかにする仕組みをNTCIR全体に取り入れるべき” [p.4]

(c) (d) の比較 [ (b) (a) の比較 ]

(d) (a) の比較 [ (c) (b)の比較 ]

進歩の検証@NTCIR (2007‐) [See References (2) slide]

INTENT‐2 SystemsINTENT‐2 SystemsINTENT‐1 Systems(INTENT‐2 Retro Runs)

INTENT‐1 Systems(INTENT‐2 Retro Runs)

NTCIR‐(X‐1) topic set

NTCIR‐X topic set

Check topic setequivalence

Check progress

NTCIR‐X SystemsNTCIR‐(X‐1) systems

(Revived Runs)

(a)

(b)

(c)(d)

• CLIR@NTCIR‐6: (a)+(b). Test collection reusabilityを(implicitに)仮定• PATENT@NTCIR‐6: (a)+(b). Invalidity searchにおける正解はcompleteと見做せるのでOK!• PatentMT@NTCIR‐10: (a)に加え(b)(c)を比較。自動評価では(b)(c)は同条件なのでOK!• INTENT@NTCIR‐10: (a)(b)(c)(d)を実行。 (b) はデータがreusableでないので使用せず。同一チームに関する(a)(d) の比較により、統計的に有意な差を得た

基礎体力の向上による

TALK OUTLINE


3. まとめ

まとめ: 情報検索評価の経験から

既存の技術的課題を解決するためにとるべき手順

１．現状データの収集（特定のシステム依存を回避せよ）

２．失敗分析と体系化（どのような現象が何割起こるか、何故？事後分析の効率化のため、早めにテンプレートを確立せよ）

３．失敗を減らすための技術改良・革新

４．技術進歩の統計的モニタリング（検定・効果量・信頼区間）

５．GOTO Step 1

それでいて、ある程度の自由度は確保せよ

References (1)

• Armstrong, T.G., Moffat, A., Webber, W. and Zobel, J.: Improvements that Don’t Add Up: Ad‐hoc Retrieval Results Since 1998, ACM CIKM 2009, pp.601‐610, 2009.

• Buckley and Harman: Reliable Information Access Final Workshop Report, 2004 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.105.3683&rep=rep1&type=pdf

• Harman and Buckley: The NRRC Reliable Information Access (RIA) Workshop, ACM SIGIR 2004, pp.528‐529, 2004.

• 酒井: NTCIR公式結果に基づく文書検索技術の進歩に関する一考察, FIT 2006, LD‐007 http://ci.nii.ac.jp/lognavi?name=nels&lang=jp&type=pdf&id=ART0009459193

References (2) – Available from http://research.nii.ac.jp/ntcir/publication1‐en.html

• Fujii, A., Iwayama, M. and Kando, N.: Overview of the Patent Retrieval Task at the NTCIR‐6 Workshop, NTCIR‐6, pp.359‐365, 2007.

• Goto, I., Chow, K.P. Lu, B., Sumita, E. and Tsou, B.K.: Overview of the Patent Machine Translation Task at the NTCIR‐10 Workshop, NTCIR‐10, pp.260‐286, 2013.

• Kishida, K., Chen, K.‐H., Lee,, S., Kuriyama, K., Kando, N. and Chen, H.‐H.: Overview of CLIR Task at the Sixth NTCIR Workshop, NTCIR‐6, 2007.

• Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., Song, R., Kato, M.P. and Iwata, M.: Overview of the NTCIR‐10 INTENT‐2 Task, NTCIR‐10 Proceedings, pp.94‐123, 2013.

Technology

Project Next Kickoff (May 19, 2014): Failure Analysis and Progress Monitoring in IR (in Japanese)