Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
生命科学のための情報統合とテキストマイニング
Junichi TSUJIIUniversity of Tokyo, JapanUniversity of Manchester
UK National Centre for Text Mining, UK
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
• 背景
• 生命科学での情報統合
• テキスト処理と情報統合
• 研究の一例
• 今後の課題
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
• 背景
• 生命科学での情報統合
• テキスト処理と情報統合
• 研究の一例
• 今後の課題
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
4
Increments
:accumulation
Increase in Medline
0
100,000
200,000
300,000
400,000
500,000
600,000
年
increm
ents
0
2,000,000
4,000,000
6,000,000
8,000,000
10,000,000
12,000,000
14,000,000
accumulation
G‐protein coupled receptor
Before 19889 papers
1992256 papers2005
14,000 papers
MEDLINE alone
More than 0.5 million per year More than 1.3 thousand per day
Articles added
Medline Access
1997: 0.163 M accesses/month2006: 82.027 M accesses/month
[D.L.Banville 2006]
500 times more
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
5
NaCTeMwww.nactem.ac.uk
• First such centre in the world • Funding: JISC, BBSRC, EPSRC• Consortium investment
• Chair in TM (Prof. J. Tsujii, Univ. Tokyo)
• Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trust
• Initial focus: biomedical academic community• Extend services to industry• Extend focus to other domains (social sciences)
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Semantic Web
Tim Berners‐Lee
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
The Semantic Web is an extension of the current web in which information is given well‐defined meaning, better enabling computers and people to work in cooperation.
‐‐ Tim Berners‐Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001
Autonomous Processing of Meaning by Agents:
The Semantic Web will bring structure to the meaningful content ofweb pages, creating an environment where software agents roamingfrom page to page can readily carry out sophisticated tasks for users.
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
The Semantic Web is an extension of the current web in which information is given well‐defined meaning, better enabling computers and people to work in cooperation.
‐‐ Tim Berners‐Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001
Expressing Meanings explicitly:
Human language thrives when using the same term to mean somewhat different things, but automation does not.
Using a different URI – Universal Resource Identifier – for each specificconcept solves that problem. An address that is a mailing address can be distinguished from one that is a street address, and both can bedistinguished from an address that is a speech.
Concept ID
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
The Semantic Web is an extension of the current web in which information is given well‐defined meaning, better enabling computers and people to work in cooperation.
‐‐ Tim Berners‐Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001
Ontologies:
A program that wants to compare or combine information acrossthe two databases has to know that these two terms are being usedto mean the same thing.
The most typical kind of ontology for the Web has a taxonomyand a set of inference rules.
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
• 背景
• 生命科学での情報統合
• テキスト処理と情報統合
• 研究の一例
• 今後の課題
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Entity, Concept, Ontologyを使った情報統合
ー生命科学からの例ー
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Normalization of entitiesSurface named entities are mapped to unique IDs in ontology
Named Entity recognition + Disambiguation
MEDIE生命事象に基づく検索システム
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Normalization of entitiesSurface named entities are mapped to unique IDs in ontology
Named Entity recognition + Disambiguation
MEDIE生命事象に基づく検索システム
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
• 背景
• 生命科学での情報統合
• テキスト処理と情報統合
• 研究の一例
• 今後の課題
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
テキスト処理の必要性
電子文書の急激な増大と使用の一般化– PubMedから電子ジャーナルのフルペーパへ– 機関アーカイブ、電子カルテなど論文以外のテキスト– 論文とサプリメント・データ– 生データとメタデータ、テキスト
• 膨大な人手による作業(Curation)の軽減• 粒度の細かな情報統合
• テキスト以外の構造化データ、実験データとの統合
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Semantics‐based, Fine‐Grained Information Access
Document Retrieval, Information Retrieval– Unit of retrieval : Article, Document
– Expression of User Intention: • Controlled or non‐controlled keywords
– Indexes: character sequences, keywords
Question Answering
Semantics‐based, Fine‐Grained Information Access system
Unit of retrieval : paragraphs, sentences, phrasesExpression of user intention: Simple but semantically enriched Indexes: Semantics‐based structured meta‐data
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Coarse‐grained text retrieval
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Fine‐grained information access
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
EXAMPLE:PATHTEXTNACTEM (U‐MANCHESTER), U‐TOKYO, SBI
B.Kemper,T.Matsuzaki,Y.Matsuoka,Y.Tsuruoka,H.Kitano,S. Ananiadou, J.Tsujii :PathText: a text mining integrator for biological pathway visualizations, Bioinformatics, Vol.26 (12), Oxford University Press, 2010
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Toll‐Like Receptor (TLR) pathway
Oda K, Matsuoka Y, Funahashi A, Kitano H: A comprehensive pathway map of epidermal growth factor receptor signaling. Mol Syst Biol 2005, 1:2005
0010.
Nodes : 652
Links: 444
600 papers were read to
construct the pathway
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Knowledge Integration Pathways and Literature
Pathways integrate biological knowledge pieces into coherent interpretations
Pathways have been recognized as important means of representing biological knowledge.
Medline contains over 18 million articles
More than 0.5 million articles are being added every year, which means 1.3 thousand articles per day
Pathways
Literature
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Pathways and Literature• Pathways construction and literature
– Pathway construction mostly relies on literature• Most important discoveries are reported by paper publications.
• The full context of each discovery is described by the paper reporting it.
• Pathway maintenance and literature– New discovery should lead to revisions of the relevant portions of pathways.
– However, rapidly growing amount of literature makes it extremely difficult to identify relevant new discoveries.
PathTextNaCTeM, U‐Tokyo, SBI
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Network for Simulation : Quantitative Model
SBML
Pathway : Qualitative Model
Cell Designer
Literature : Piecewise Knowledge
Interpretation, Abstraction
Enrichment, Grounding
University of TokyoNaCTeM/University of ManchesterSystems Biology Institute/OIST
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
IKK IKK_p
TAK1
SBML Network Network by Cell Designer
Text Mining Resources
KLEIO
MEDIEInfo‐
Pubmed
FACTA
GUI
Visualization
Kineticparameters
Textual SemanticsUser Semantics
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
• 背景
• 生命科学での情報統合
• テキスト処理と情報統合
• 研究の一例
• 今後の課題
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
(214)
(287)
(494)
(4)
(567)
(10,411 / 1,250)
(1,568 / 2)
(2,448)
(6,030 / 114)
(464 / 32)
(26)
(5)
(84)
(343)
(3,633)
(671 / 58)
(154)
(415 / 28)
(2)
(1)
(326)
(12)
(40)
(0)
(6)
(1,122)
(44)
(683)
(632 / 388)
(244)
(567)
(1,733)
(21,616 / 4,552)
(4,712)
(12,352)
GENIA event ontology
• GENIA event ontology– 30 GO terms
under Biological Process
– Regulation
• Regulatory events
• Causal relationship
– Artificial process (experimental)• Artificially performed processes.
• E.g. Transfection, treatment, …
– Correlation (experimental)
• meaning ‘any’ relation between events.
Events of the Shared Tasks(BioNLP 09)
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
EvaluationBioNLP 2009 Shared Task Data
• BioNLP ST 2009 evaluation server
Top system at the 2009 evaluation campaign
Our current system
Simple 70.21 72.91
Binding 44.41 51.63
Regulation 40.11 44.00
ALL 51.95 55.96
24 teams joined the campaign. The performances of the other systems were less than 45.00.
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
39
S1 = We found that Y activates the expression of XS2 = We examined the effect of Y on expression of XS3 = These results suggest that Y has no effect on
expression of XS4 = Y is known to increase expression of XS5 = Addition of Y slightly increased the expression of XS6 = These results suggest that Y might affect the
expression of X
The same events
Nawaz, R., Thompson, P. and Ananiadou, S.. (2010). Evaluating a meta‐knowledge annotation scheme for bio‐events. In: Proceedings ofthe Workshop on Negation and Speculation in Natural Language Processing, pp. 69‐‐77
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Annotation Scheme
10 July 2010Evaluating a Meta‐Knowledge Annotation
Scheme for Bio‐Events40
Class / Type(Grounded to an event ontology)
Bio‐Event(Centred on an Event
Trigger)
Knowledge Type• Investigation• Observation• Analysis• General
Manner• High• Low• Neutral
Certainty Level•L3•L2•L1
Hyper‐Dimensions1)New Knowledge (Yes/No)
2) Hypothesis (Yes/No)
Polarity• Negative• Positive
Source• Other• Current
Participants• Theme(s)• Actor(s)
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Annotation Scheme
10 July 2010Evaluating a Meta‐Knowledge Annotation
Scheme for Bio‐Events41
Class / Type(Grounded to an event ontology)
Bio‐Event(Centred on an Event
Trigger)
Knowledge Type• Investigation• Observation• Analysis • General
Participants• Theme(s)• Actor(s)
examinedinvestigatedstudied
foundobservedreport
(past tense)
suggestindicateconclude
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Annotation Scheme
10 July 2010Evaluating a Meta‐Knowledge Annotation
Scheme for Bio‐Events42
Class / Type(Grounded to an event ontology)
Bio‐Event(Centred on an Event
Trigger)
Polarity• Negative• Positive
Participants• Theme(s)• Actor(s)
nonot
fail, lack, unableindependent exception
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
BioNLP 2011 Shared Task• Scientific Committee
Jun’ichi Tsujii (Univ. Tokyo, NaCTeM) - ChairSophia Ananiadou (NaCTeM, Manchester)Kevin Cohen (Corolado)Claire Nedellec (INRA)Andrey Rzhetsky (Univ. Chicago)Bruno Sobral (Virginia Bioinformatics Inst.)Tapio Salakoski (Univ. Turku)Toshihisa Takagi (DBCLS)
• Organizing CommitteeJin-Dong Kim (DBCLS) -ChairSampo Pyysalo (Univ. Tokyo) -ChairTomoko Ohta (Univ. Tokyo)Robert Bossy (INRA)Chunhong Mao (Virginia Bioinformatics Inst.)Dan Sullivan (Virginia Bioinformatics Inst.)Rafal Rak (NaCTeM, Manchester)Nguyen Luu Thuy Ngan (Univ. Tokyo)
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Main TaskEpigenetics and post‐translational modifications
U‐Tokyo • Basic task setting and data following BioNLP'09 shared task format
• DNA modification and PTM events similar to '09 Phosphorylation events• Existing retrainable systems can be applied with little modification
• New event types: DNA methylation, six PTM types, reverse reactions (e.g. deacetylation) and catalysis: 15 event types in total
• New PTM‐specific participant roles (optional subtask) • Side chain attached to proteins in Glycosylation• Context gene affected by histone modifications
• Annotation for PubMed abstracts relevant to these events• No further subdomain restrictions, data selected to avoid biasRepresentative of general distribution of epigenetics and PTM‐related
publications in the whole literature
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Main Task Epigenetics and post‐translational modifications
• Epigenetic control of gene expression without changes in DNA sequence major focus of recent study
• Key events DNA methylation and histone post‐translational modifications (acetylation and methylation) • Important roles in many biological processes, implicated in cancer
• Phosphorylation, a protein post‐translational modification (PTM), most reliably extracted event at the BioNLP'09 shared task
• 76% F‐score for extraction of phosphorylated protein and site
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Main Task : Infectious DiseasesNaCTeM, U‐Tokyo, Virginia Tech
• Task setup and core events following BioNLP'09 Shared Task• Expression, Catabolism, Localization, Binding, etc.• New event type: Process
• High‐level biological processes such as “virulence” frequently discussed without stating specific participants (e.g. Theme)
• New entity types (given, NER not required) • Chemical, Organism, Two‐component system
• New subtask (optional) • Identification of environmental variables (Acidity and Temperature) specifying the conditions in which events are stated to occur
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
• 背景
• 生命科学での情報統合
• テキスト処理と情報統合
• 研究の一例
• 今後の課題
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Normalization of eventSTAT protein nuclear translocation (GO:0007262)
In the training set (800 abstracts), there are no occurrences of “STAT protein nuclear translocation”. However, one found 10 occurrences of this concept.
• nuclear translocation of STAT6• nuclear translocation of the latent transcription factor, STAT6• nuclear translocation of STAT6• translocation into nucleus of signal transducers and activators of transcription (STAT)
• STAT5A and STAT5B containing complexes . . . these complexes rapidly translocated(within 1 min) into the nucleus
• STAT5B containing complexes . . . these complexes rapidly translocated (within 1 min) into the nucleus
• STAT1 nuclear import• nuclear import of NF‐kappa B, AP‐1, NFAT, and STAT1
• STAT1 in Jurkat T lymphocytes is significantly inhibited by a cell‐permeable peptide carrying the NLS of the NF‐kappa B p50 subunit. NLS peptide‐mediated disruption of the nuclear import ...
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
MEDLINE →Medie Workflow
• Input: All abstracts in PubMED10,630,000 abstracts19,950,000 bibliographic units
• Processing: POS tagging, NERs, Deep parsing, Event recognition, Indexing for MEDIE (GCL)
• Complex workflowMore than 10 modules
• Computing EnvironmentGRID with 300‐1000 processors
MEDLINE→Medie workflow
Processing modules
Index files
Intermediate processing results
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本
Thank you !
[U‐Tokyo] Yusuke Miyao (NII) , Takuya Matsuzaki, Tomoko Ohta, Jin‐Dong Kim (DBCLS), Rune Saetre, Yoshinobu Kano, Naoaki Okazaki, Makoto Miwa, Sampo Pyysalo, Tadayoshi Hara, Yue Wang
[NaCTeM]Sophia Ananiadou, John McNaught, William Blak, Balakrishna Kolluru, Tingting Mu,Chikashi Nobata, Rafal Rak, Angel Restificar, C.J. Rupp, Paul Thompson, Xinglong Wang,Rahead Nawaz
© 2010 辻井 潤一 (東大・マンチェスター大) licensed by CC表示2.1日本