30
Storage Solutions for Bioinformatics Li Yan Director of FlexLab, Bioinformatics core technology laboratory [email protected] http:// www.genomics.cn/FlexLab/index.html Science and Technology Division, BGI-Shenzhen

Storage Solutions for Bioinformatics

  • Upload
    suchi

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Storage Solutions for Bioinformatics. Li Yan Director of FlexLab,  Bioinformatics core technology laboratory [email protected] http:// www.genomics.cn/FlexLab/index.html Science and Technology Division, BGI-Shenzhen. OUTLINE. Background Hardware Infrastructure of Data Storage - PowerPoint PPT Presentation

Citation preview

Page 1: Storage Solutions for Bioinformatics

Storage Solutions for Bioinformatics

Li YanDirector of FlexLab, Bioinformatics core technology laboratory

[email protected]://www.genomics.cn/FlexLab/index.html

Science and Technology Division, BGI-Shenzhen

Page 2: Storage Solutions for Bioinformatics

OUTLINE

• Background

• Hardware Infrastructure of Data Storage

• Data Management

• Data Storage Architecture In BGI

• Distributed Computing on Storage Server

Page 3: Storage Solutions for Bioinformatics

Background: Fast Growing Big Data

Page 4: Storage Solutions for Bioinformatics

Sequencing, se

quencing and se

quencing

Page 5: Storage Solutions for Bioinformatics

Background

Page 6: Storage Solutions for Bioinformatics

Fast growing big data E. coli Genome: 4.9M Caenorhaditis elegans Genome: 100M Human Genome: 3G Wheat Genome: 16G Salamander: 45G

• From small genomes to large complex genomes

Human Genome: 3 billion DNA subunits (A,T,C,G) 80~100X Sequencing: 600GB Raw data for individual study 1000 Genome Project: 600TB Raw data for population study

• From one sample to populations

• From the first generation sequencing to the second generation sequencing

Page 7: Storage Solutions for Bioinformatics

Long-Term Data Storage Needs• Properly secure the data

Plan for data redundancy, which generally means we mirror data with

two or more copies

• Available(24x7x365) for all kinds of uses Readily accessible and in the right format

• Fast Data Transfer for collaborations Fast Network server(Aspera) instead of mailing a hard drive

• Scalable, easy to scale up Choosing reliable file systems

Page 8: Storage Solutions for Bioinformatics

Hardware infrastructure of data storage

Page 9: Storage Solutions for Bioinformatics

Type of Storage infrastructure

• Disk library• A high-capacity storage system that holds a quantity of CD-ROM, DVD or magneto-

optic (MO) disks in a storage rack and feeds them to one or more drives for reading and writing.

• Magnetic tape• A high-capacity data storage system for storing, retrieving, reading and writing

multiple magnetic tape cartridges.• Redundant array of independent disks (RAID)

• RAID is a storage technology that combines multiple disk drive components into a logical unit

• Direct-attached storage (DAS)• a digital storage system directly attached to a server or workstation, without a

storage network in between• Network-attached storage (NAS)

• Network-attached storage (NAS) is file-level computer data storage connected to a computer network providing data access to heterogeneous clients.

• Storage area network (SAN)• A storage area network (SAN) is a dedicated network that provides access to

consolidated, block level data storage.

Page 10: Storage Solutions for Bioinformatics

Type of Storage Pros Cons General use

Disk library •Fast•High storage capacity•High data availability

•Not as easily accessible as DAS•Intended for write once, read rarely info

•Disk-to-disk backup•Archiving•Near line storage

Magnetic tape •Low cost per megabytes•Portable•Unlimited capacity (with multiple tapes)

•Inconvenient for fast recovery of individual or group files

•Archiving•Limited-budget businesses•Offsite storage

Redundant array of independent disks (RAID)

•Fast•High storage capacity•High data availability•Reliable•Security•Fault tolerance

•Possible false sense of security•Some recovery difficulty on some systems•High cost for optimum systems

•Swap files•Internet service providers•Redundant storage

Page 11: Storage Solutions for Bioinformatics

Type of Storage Pros Cons General use

Direct-attached storage (DAS)

•Simple•Low starting cost•Easy to use

•Needs separate storage for each server•Not easy to transfer data in network•Server takes application processing load

•Data and application sharing•Data backup•Archiving

Network-attached storage (NAS)

•Fast file access for multiple clients•Ease of data sharing•High storage capacity•Redundancy•Ease of drive mirroring•Consolidated resources

•Less convenient than SAN for moving large blocks of data

•Backup•Archiving•Redundant storage

Storage area network (SAN)

•Excellent for moving large blocks of data•Exceptional reliability•Easily availible•Fault tolerance•Scalability

•Expensive•Lack of standardization•Management complexity

•Large databases•Bandwidth-intensive applications•Mission-critical applications

Page 12: Storage Solutions for Bioinformatics

Software Level of Data storage

Page 13: Storage Solutions for Bioinformatics

Data flow of NGS

Sequencer Raw Data

AlignmentAssembly

Association

Complex workflow• Annotation of features• Variations/Mutations• Protein Structural• Gene Expressions• Function Networks

Meaningful Biology DataData Store

Page 14: Storage Solutions for Bioinformatics

Data Management Classify the data into different levels

First Level of Storage: Dynamic, fast, Temporary Secondary Level of storage: Slower than first level, but enduring and

safety Third Level of storage: High capacity medium for backups and

archives Choosing file systems

Current popular distributed file systems include: Lustre, HDFS, MogileFS, FreeNAS, FastDFS, OpenAFS, MooseFS, pNFS, and GoogleFS.

Page 15: Storage Solutions for Bioinformatics

Classify the data into different levels• First Level of Storage: Dynamic, fast, Temporary

• intermediate results of data analysis• Reference data • …

• Secondary Level of storage: Slower than first level, but enduring and safety• Sequencing raw data• Meaningful data

• Third Level of storage: High capacity medium for backups and archives• Backups and archives of raw data and meaningful data

Page 16: Storage Solutions for Bioinformatics

Storage ServerDistributed file systems

Distributed File systems• Lustre

lustre is a large, safe and reliable, highly available cluster file system, which is

developed and maintained by the SUN. Lustre can support more than 10,000 nodes,

the number to the number of PB storage system.

• Hadoop(HDFS)

Hadoop and not just a hadoop distributed file system for storage, but designed for

general-purpose computing device in the form of large-scale distributed applications

running on the cluster framework.

• OneFS

OneFS enables to scale data access capacity to more than 1.6 petabytes and up to 10

Gb/sec of throughput for a single cluster capacity of up to 10 GBS (Gigabytes per

second) of throughput.

Page 17: Storage Solutions for Bioinformatics

Distributed File systems

• MogileFS (www.danga.com)

• FreeNAS ( www.openqrm.org )

• FastDFS (code.google.com / p / fastdfs)

• OpenAFS ( www.openafs.org )

• MooseFS (derf.homelinux.org)

• pNFS ( www.pnfs.com )

• GoogleFS

Page 18: Storage Solutions for Bioinformatics

Data compression&& Data security Data compression

Common used: Lemple-Ziv, BWT

Exclusive used for DNA sequences: Biocompress, GeneCompress, CTW-LZ, GeNML, fqzcomp,

sam_comp Data security

Raid system failure/ Redundancy File system Network

Page 19: Storage Solutions for Bioinformatics

Data Storage Architecture In BGI

Page 20: Storage Solutions for Bioinformatics

Data Storage Architecture In BGI

Write

Write

Write

Two Copies

Read

ReadWrite

Compute Nodes

Sequencers

Tape Library

Archiving

Page 21: Storage Solutions for Bioinformatics

Data Storage Architecture In BGI

Write

Write

Write

Two Copies

Read

ReadWrite

Compute Nodes

Sequencers

Tape Library

Archiving

First Level Storage

Page 22: Storage Solutions for Bioinformatics

Data Storage Architecture In BGI

Write

Write

Write

Two Copies

Read

ReadWrite

Compute Nodes

Sequencers

Tape Library

Archiving

Second Level Storage

Page 23: Storage Solutions for Bioinformatics

Data Storage Architecture In BGI

Write

Write

Write

Two Copies

Read

ReadWrite

Compute Nodes

Sequencers

Tape Library

Archiving

Third Level Storage

Page 24: Storage Solutions for Bioinformatics

Data Storage Architecture In BGI

Write

Write

Write

Two Copies

Read

ReadWrite

Compute Nodes

Sequencers

Tape Library

Archiving

Page 25: Storage Solutions for Bioinformatics

Distributed Computing on Storage Server

Page 26: Storage Solutions for Bioinformatics

NGS read file

Sequence Assembly

Storage

Large memory server>500GB

Users26

Traditional Genome AssemblyCostly, Unscalable

Page 27: Storage Solutions for Bioinformatics

Distributed Genome Assembly

Assembly ……

Several storage server (IBM3630*16 for human genome)

Cost effectively, Scalable

Page 28: Storage Solutions for Bioinformatics

HecateConstructing de bruijn Graph

Solving Tiny Repeats Merging Bubbles

Scaffolding Merging Contigs

Page 29: Storage Solutions for Bioinformatics

29

Gaea 2.1Reads

Reference genome

Preprocessing

Locating

Aligning

SNP calling

Distributed Indexing for load balancing

Dynamic Programming for

robust gap alignment

Standard mapping quality for SNP calling

Flexible splitting tolerates more mistmatches

Page 30: Storage Solutions for Bioinformatics

Q&A