25
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre, School of Computer Engineering, NTU, Singapore Michael Li Semantic Technology Group, Institute for Infocomm Research (I 2 R), A-Star, Singapore 11 th  Feb 2011

Thang

Embed Size (px)

Citation preview

Page 1: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 1/25

Large Scale Semantic Data Integration andAnalytics through Cloud: A Case Study in

Bioinformatics

Tat Thang

Parallel and Distributed Computing Centre,

School of Computer Engineering, NTU, Singapore

Michael Li

Semantic Technology Group,

Institute for Infocomm Research (I2R), A-Star, Singapore

11th Feb 2011

Page 2: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 2/25

Overview

• Motivation

• Problem Definition

• Objective

• Proposed Architecture

• A case study in Bio-informatics

• Demo

• Future works

• Summary

Page 3: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 3/25

Motivation

• Deluge of biological data

• Biomedical data is available on heterogeneous

databases

• Data: structured and semi/un-structured

formats

Demand for fast, large-scale and cost-effectivecomputing strategies

Page 4: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 4/25

Problem Definition

• Data

 – PubMed contains 20+ million abstracts

 – UniProt contains 13.5+ million records

• Case study on antiviral proteins

 – Over 70,000 citations in Pubmed

 – Over 14,000 proteins in Uniprot

• Integration and Analysis

Page 5: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 5/25

Related Works

• Using NLP to link documents to existing ontologies(e.g. GoPubMed, Textpresso) – No querying & reasoning

 – Not scalable

• RDF/OWL based integration tools (e.g. TopBraidSuite) – No NLP

 – Not bio specific. Also not biologist friendly

•Cloud-based bio data mining works (e.g. Kudtarkar P2010) – Still in early stages

 – Challenging to perform semantic integration on cloud

Page 6: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 6/25

Objective

To provide a framework that enables

• Better data infrastructure

 – Scalability

 – Management of heterogeneity

 – Cost-effectiveness

• Better data analytics

 – Integrative data mining

 – Visual query interface

Page 7: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 7/25

Proposed Framework

Our Approach

Data Infrastructure Module Data Analytics Module

Page 8: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 8/25

Data Infrastructure module Data Analytics module

Our Approach

Biomedical

sources

Web

Crawler

Parser

Query &

Reasoner

Knowle

Population

Service

Cloud-based data store

Ontology

User

Interface

Page 9: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 9/25

Our Approach

• Data Infrastructure Module

 – Cloud based: Amazon EC2, Hadoop, MicrosoftAzure

 – Parallel processing: MapReduce – Distributed Storage: Big Table, HBase, HDFS

• Data Analytics Module

 –

Non-semantic: database driven – Semantic: ontology driven (Knowle, Allegrograph,

TopBraid)

Page 10: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 10/25

Data Infrastructure Module (Hadoop)

• Software framework for data-intensive and

distributed applications

• Hadoop distributed file system provides a distributed,

scalable, and portable file system that support forlarge data set

• Hadoop Map-reduce allows to program in parallel on

large amount of data

Page 11: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 11/25

Cloud Based Data Store

Hadoop Distributed File System

Name node

Data

node

Data

node

Data

node

Data

node

Data

node

- Meta data (in memory)

- Data nodes

- Data blocks

- Node attributes- Name of files

- Mapping of block-node

Secondary

Name node

- Stores file contents

- File is chunked to block

- each block is spread to data nodes

Page 12: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 12/25

Data Analytics Module (Knowle)

• Semantic Technology

Toolkit

• Knowle services used in

Data Analytics Module – Data/Text mining

 – Ontology Population

 – Ontology Query

 – Visual Ontology Query

Developed in Institute for Infocomm Research, Singapore

Page 13: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 13/25

Data Infrastructure module Data Analytics module

Our Approach

Biomedical

data sources

Web

Crawler

Parser

Query &

Reasoner

Knowle

Population

Service

Cloud-based data store

Ontology

User

Interface

Page 14: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 14/25

Web Crawler

UniProt

Crawler

Cloud-based data

store

Bio-medical

data source

UniProt

PubMedPubMed

Crawler

Page 15: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 15/25

Parser

UniProt

Parser

PubMed

Parser

Knowle Ontology

Population Service

Crawled

UniProt data

Crawled

PubMed

data

Cloud-based data store

Page 16: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 16/25

Ontology

Protein OntologyProtein + Literature Ontology

Page 17: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 17/25

Ontology Populator

Parsed Uniprot

Data

Parsed Pubmed

Data

Ontology

Triplestore

Protein + Literature

ontology

Knowle Ontolgy Population Service

Knowle Text mining Service

Populate

concepts

Assert

DatatypeProperties

Assert

ObjectProperties

EntityDetection

RelationExtraction

Page 18: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 18/25

Query & Reasoner

Ontology

Triplestore

User

Interface

OWLIM

Reasoner

SAIL

SesameKnowle

Query Service

Page 19: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 19/25

User Interface

Ontology

Triplestore

Knowle

Population

ServiceSearch Web Crawler Parser

KnowleGator

Ontology

Visual Query

Visual QueryTranslator

OntologyQuery &

Reasoner

Page 20: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 20/25

A case study in Bio-informatics

• Integration, cross-querying from PubMed andUniProt

• Data

 –70,054 citations from Pubmed

 – 14,527 proteins in Uniprot

• Infrastructure (virtual computers)

 – 4 data node ( RAM : 1Gb, CPU : Intel Xeon 2.4Ghz)

 – 2 master node ( 1 name node,1 secondary namenode) (RAM : 512 Mb, CPU : Intel Xeon 2.4Ghz)

 – 1 virtual CPU = Intel Xeon 2.4 Ghz

Page 21: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 21/25

Demo

• Data

 – Uniprot : 853 antiviral protein entries

 – Pubmed : 2000 citations

Page 22: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 22/25

Demo Snapshot

Page 23: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 23/25

Summary

• We proposed a new framework

 –  Data infrastructure module (cloud-based

infrastructure )

 – Data analytics module(semantic technologies)

• We tested on a prototype

 – Using our own infrastructure

 – With integration, cross-querying from PubMed

and UniProt

Page 24: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 24/25

Future works

• Integrated user interface

• Explore other cloud-based data store: HBase,

BigTable

• Apply map-reduce concept on data analytics

and crawling

• Integrate Knowle into cloud-based

environment

Page 25: Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 25/25

Large Scale Semantic Data Integration andAnalytics through Cloud: A Case Study in

Bioinformatics

Tat Thang

Parallel and Distributed Computing Centre,

School of Computer Engineering, NTU, Singapore

Michael Li

Semantic Technology Group,

Institute for Infocomm Research (I2R), A-Star, Singapore

11th Feb 2011