Upload
arteepu4
View
225
Download
0
Embed Size (px)
Citation preview
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 1/25
Large Scale Semantic Data Integration andAnalytics through Cloud: A Case Study in
Bioinformatics
Tat Thang
Parallel and Distributed Computing Centre,
School of Computer Engineering, NTU, Singapore
Michael Li
Semantic Technology Group,
Institute for Infocomm Research (I2R), A-Star, Singapore
11th Feb 2011
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 2/25
Overview
• Motivation
• Problem Definition
• Objective
• Proposed Architecture
• A case study in Bio-informatics
• Demo
• Future works
• Summary
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 3/25
Motivation
• Deluge of biological data
• Biomedical data is available on heterogeneous
databases
• Data: structured and semi/un-structured
formats
•
Demand for fast, large-scale and cost-effectivecomputing strategies
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 4/25
Problem Definition
• Data
– PubMed contains 20+ million abstracts
– UniProt contains 13.5+ million records
• Case study on antiviral proteins
– Over 70,000 citations in Pubmed
– Over 14,000 proteins in Uniprot
• Integration and Analysis
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 5/25
Related Works
• Using NLP to link documents to existing ontologies(e.g. GoPubMed, Textpresso) – No querying & reasoning
– Not scalable
• RDF/OWL based integration tools (e.g. TopBraidSuite) – No NLP
– Not bio specific. Also not biologist friendly
•Cloud-based bio data mining works (e.g. Kudtarkar P2010) – Still in early stages
– Challenging to perform semantic integration on cloud
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 6/25
Objective
To provide a framework that enables
• Better data infrastructure
– Scalability
– Management of heterogeneity
– Cost-effectiveness
• Better data analytics
– Integrative data mining
– Visual query interface
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 7/25
Proposed Framework
Our Approach
Data Infrastructure Module Data Analytics Module
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 8/25
Data Infrastructure module Data Analytics module
Our Approach
Biomedical
sources
Web
Crawler
Parser
Query &
Reasoner
Knowle
Population
Service
Cloud-based data store
Ontology
User
Interface
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 9/25
Our Approach
• Data Infrastructure Module
– Cloud based: Amazon EC2, Hadoop, MicrosoftAzure
– Parallel processing: MapReduce – Distributed Storage: Big Table, HBase, HDFS
• Data Analytics Module
–
Non-semantic: database driven – Semantic: ontology driven (Knowle, Allegrograph,
TopBraid)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 10/25
Data Infrastructure Module (Hadoop)
• Software framework for data-intensive and
distributed applications
• Hadoop distributed file system provides a distributed,
scalable, and portable file system that support forlarge data set
• Hadoop Map-reduce allows to program in parallel on
large amount of data
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 11/25
Cloud Based Data Store
Hadoop Distributed File System
Name node
Data
node
Data
node
Data
node
Data
node
Data
node
- Meta data (in memory)
- Data nodes
- Data blocks
- Node attributes- Name of files
- Mapping of block-node
Secondary
Name node
- Stores file contents
- File is chunked to block
- each block is spread to data nodes
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 12/25
Data Analytics Module (Knowle)
• Semantic Technology
Toolkit
• Knowle services used in
Data Analytics Module – Data/Text mining
– Ontology Population
– Ontology Query
– Visual Ontology Query
Developed in Institute for Infocomm Research, Singapore
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 13/25
Data Infrastructure module Data Analytics module
Our Approach
Biomedical
data sources
Web
Crawler
Parser
Query &
Reasoner
Knowle
Population
Service
Cloud-based data store
Ontology
User
Interface
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 14/25
Web Crawler
UniProt
Crawler
Cloud-based data
store
Bio-medical
data source
UniProt
PubMedPubMed
Crawler
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 15/25
Parser
UniProt
Parser
PubMed
Parser
Knowle Ontology
Population Service
Crawled
UniProt data
Crawled
PubMed
data
Cloud-based data store
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 16/25
Ontology
Protein OntologyProtein + Literature Ontology
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 17/25
Ontology Populator
Parsed Uniprot
Data
Parsed Pubmed
Data
Ontology
Triplestore
Protein + Literature
ontology
Knowle Ontolgy Population Service
Knowle Text mining Service
Populate
concepts
Assert
DatatypeProperties
Assert
ObjectProperties
EntityDetection
RelationExtraction
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 18/25
Query & Reasoner
Ontology
Triplestore
User
Interface
OWLIM
Reasoner
SAIL
SesameKnowle
Query Service
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 19/25
User Interface
Ontology
Triplestore
Knowle
Population
ServiceSearch Web Crawler Parser
KnowleGator
Ontology
Visual Query
Visual QueryTranslator
OntologyQuery &
Reasoner
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 20/25
A case study in Bio-informatics
• Integration, cross-querying from PubMed andUniProt
• Data
–70,054 citations from Pubmed
– 14,527 proteins in Uniprot
• Infrastructure (virtual computers)
– 4 data node ( RAM : 1Gb, CPU : Intel Xeon 2.4Ghz)
– 2 master node ( 1 name node,1 secondary namenode) (RAM : 512 Mb, CPU : Intel Xeon 2.4Ghz)
– 1 virtual CPU = Intel Xeon 2.4 Ghz
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 21/25
Demo
• Data
– Uniprot : 853 antiviral protein entries
– Pubmed : 2000 citations
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 22/25
Demo Snapshot
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 23/25
Summary
• We proposed a new framework
– Data infrastructure module (cloud-based
infrastructure )
– Data analytics module(semantic technologies)
• We tested on a prototype
– Using our own infrastructure
– With integration, cross-querying from PubMed
and UniProt
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 24/25
Future works
• Integrated user interface
• Explore other cloud-based data store: HBase,
BigTable
• Apply map-reduce concept on data analytics
and crawling
• Integrate Knowle into cloud-based
environment
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 25/25
Large Scale Semantic Data Integration andAnalytics through Cloud: A Case Study in
Bioinformatics
Tat Thang
Parallel and Distributed Computing Centre,
School of Computer Engineering, NTU, Singapore
Michael Li
Semantic Technology Group,
Institute for Infocomm Research (I2R), A-Star, Singapore
11th Feb 2011