Thang

7/27/2019 Thang

http://slidepdf.com/reader/full/thang 1/25

Large Scale Semantic Data Integration andAnalytics through Cloud: A Case Study in

Bioinformatics

Tat Thang

Parallel and Distributed Computing Centre,

School of Computer Engineering, NTU, Singapore

Michael Li

Semantic Technology Group,

Institute for Infocomm Research (I2R), A-Star, Singapore

11th Feb 2011

7/27/2019 Thang


Overview

• Motivation

• Problem Definition

• Objective

• Proposed Architecture

• A case study in Bio-informatics

• Demo

• Future works

• Summary

7/27/2019 Thang


Motivation

• Deluge of biological data

• Biomedical data is available on heterogeneous

databases

• Data: structured and semi/un-structured

formats

•

Demand for fast, large-scale and cost-effectivecomputing strategies

7/27/2019 Thang


Problem Definition

• Data

– PubMed contains 20+ million abstracts

– UniProt contains 13.5+ million records

• Case study on antiviral proteins

– Over 70,000 citations in Pubmed

– Over 14,000 proteins in Uniprot

• Integration and Analysis

7/27/2019 Thang


Related Works

• Using NLP to link documents to existing ontologies(e.g. GoPubMed, Textpresso) – No querying & reasoning

– Not scalable

• RDF/OWL based integration tools (e.g. TopBraidSuite) – No NLP

– Not bio specific. Also not biologist friendly

•Cloud-based bio data mining works (e.g. Kudtarkar P2010) – Still in early stages

– Challenging to perform semantic integration on cloud

7/27/2019 Thang


Objective

To provide a framework that enables

• Better data infrastructure

– Scalability

– Management of heterogeneity

– Cost-effectiveness

• Better data analytics

– Integrative data mining

– Visual query interface

7/27/2019 Thang


Proposed Framework

Our Approach

Data Infrastructure Module Data Analytics Module

7/27/2019 Thang


Data Infrastructure module Data Analytics module

Our Approach

Biomedical

sources

Web

Crawler

Parser

Query &

Reasoner

Knowle

Population

Service

Cloud-based data store

Ontology

User

Interface

7/27/2019 Thang


Our Approach

• Data Infrastructure Module

– Cloud based: Amazon EC2, Hadoop, MicrosoftAzure

– Parallel processing: MapReduce – Distributed Storage: Big Table, HBase, HDFS

• Data Analytics Module

–

Non-semantic: database driven – Semantic: ontology driven (Knowle, Allegrograph,

TopBraid)

7/27/2019 Thang


Data Infrastructure Module (Hadoop)

• Software framework for data-intensive and

distributed applications

• Hadoop distributed file system provides a distributed,

scalable, and portable file system that support forlarge data set

• Hadoop Map-reduce allows to program in parallel on

large amount of data

7/27/2019 Thang


Cloud Based Data Store

Hadoop Distributed File System

Name node

Data

node

Data

node

Data

node

Data

node

Data

node

- Meta data (in memory)

- Data nodes

- Data blocks

- Node attributes- Name of files

- Mapping of block-node

Secondary

Name node

- Stores file contents

- File is chunked to block

- each block is spread to data nodes

7/27/2019 Thang


Data Analytics Module (Knowle)

• Semantic Technology

Toolkit

• Knowle services used in

Data Analytics Module – Data/Text mining

– Ontology Population

– Ontology Query

– Visual Ontology Query

Developed in Institute for Infocomm Research, Singapore

7/27/2019 Thang


Data Infrastructure module Data Analytics module

Our Approach

Biomedical

data sources

Web

Crawler

Parser

Query &

Reasoner

Knowle

Population

Service


Ontology

User

Interface

7/27/2019 Thang


Web Crawler

UniProt

Crawler

Cloud-based data

store

Bio-medical

data source

UniProt

PubMedPubMed

Crawler

7/27/2019 Thang


Parser

UniProt

Parser

PubMed

Parser

Knowle Ontology

Population Service

Crawled

UniProt data

Crawled

PubMed

data


7/27/2019 Thang


Ontology

Protein OntologyProtein + Literature Ontology

7/27/2019 Thang


Ontology Populator

Parsed Uniprot

Data

Parsed Pubmed

Data

Ontology

Triplestore

Protein + Literature

ontology

Knowle Ontolgy Population Service

Knowle Text mining Service

Populate

concepts

Assert

DatatypeProperties

Assert

ObjectProperties

EntityDetection

RelationExtraction

7/27/2019 Thang


Query & Reasoner

Ontology

Triplestore

User

Interface

OWLIM

Reasoner

SAIL

SesameKnowle

Query Service

7/27/2019 Thang


User Interface

Ontology

Triplestore

Knowle

Population

ServiceSearch Web Crawler Parser

KnowleGator

Ontology

Visual Query

Visual QueryTranslator

OntologyQuery &

Reasoner

7/27/2019 Thang


A case study in Bio-informatics

• Integration, cross-querying from PubMed andUniProt

• Data

–70,054 citations from Pubmed

– 14,527 proteins in Uniprot

• Infrastructure (virtual computers)

– 4 data node ( RAM : 1Gb, CPU : Intel Xeon 2.4Ghz)

– 2 master node ( 1 name node,1 secondary namenode) (RAM : 512 Mb, CPU : Intel Xeon 2.4Ghz)

– 1 virtual CPU = Intel Xeon 2.4 Ghz

7/27/2019 Thang


Demo

• Data

– Uniprot : 853 antiviral protein entries

– Pubmed : 2000 citations

7/27/2019 Thang


Demo Snapshot

7/27/2019 Thang


Summary

• We proposed a new framework

– Data infrastructure module (cloud-based

infrastructure )

– Data analytics module(semantic technologies)

• We tested on a prototype

– Using our own infrastructure

– With integration, cross-querying from PubMed

and UniProt

7/27/2019 Thang


Future works

• Integrated user interface

• Explore other cloud-based data store: HBase,

BigTable

• Apply map-reduce concept on data analytics

and crawling

• Integrate Knowle into cloud-based

environment

7/27/2019 Thang


Large Scale Semantic Data Integration andAnalytics through Cloud: A Case Study in

Bioinformatics

Tat Thang

Parallel and Distributed Computing Centre,

School of Computer Engineering, NTU, Singapore

Michael Li

Semantic Technology Group,

Institute for Infocomm Research (I2R), A-Star, Singapore

11th Feb 2011

http://www.a-star.edu.sg/astar/index.do

Documents

Thang