25
最最最最最最最最 最最最最最最最最最最最 最最最 email: [email protected] wechat/qq : 715356603

最专业 的移动应用统计分析和开发者服务平台

  • Upload
    stacie

  • View
    150

  • Download
    5

Embed Size (px)

DESCRIPTION

最专业 的移动应用统计分析和开发者服务平台. 王春国 email: [email protected] wechat / qq : 715356603. agenda. Mobile Big Data Tech stack Real time Dataflow Hadoop architect Data Warehouse Sloutions Q&A . Mobile Big Data. Mobile Data Features. Diversity Fragmentation M ulti - dimensional - PowerPoint PPT Presentation

Citation preview

最专业的移动应用统计分析和开发者服务平台

王春国 email: [email protected] wechat/qq : 715356603

agenda

• Mobile Big Data• Tech stack• Real time Dataflow• Hadoop architect• Data Warehouse• Sloutions• Q&A

Mobile Big Data

Mobile Data Features

• Diversity• Fragmentation• Multi-dimensional• Frequently• High-speed growth• Low quality

• 10+ billion installation• ~3+ billion request、 max 60000/s• ~5TB + day• ~1000 nodes• 2 – 2.5 billion message • 500+ job • 16 thousands + App• 65 thousands+ developer

Tech Stack

• Java、 Scala、 Python、 Shell、 C …• Kfaka 、 Storm• Hive 、 Pig• Mapreduce • Redis、MongoDB、 HBase• Excel、 R• Finagle• Git

Data Collection

Architect

Real Time Data Flow

Batch Mode

Data Warehouse

solutions

Protobuf

• Serializing structured data – think XML• Flexible , Efficient , Simple• Development language independence • More smaller • More faster • Format Simpler• Less ambiguous

Hive ORCFile Features

• Reduces the NameNode's load• light-weight indexes -skip row groups -seek to a given row• block-mode compression• bound the amount of memory needed for

reading or writing• metadata stored using Protocol Buffers

Hive ORCFile Strutcture

HQL: SELECT COUNT(1) FROM TABLE(ORCFile vs TextFile)

ORCFile TextFle0

50

100

150

200

250

300

350

400

ORCFile vs TextFile

time(s)

LZMA Compress

• More faster compression speed• More faster decompression speed• More Smaller memory requirements

decompression • More Smaller code size for decompression

gz lzo lzma0

100

200

300

400

500

600

700

800

log(G)

log(G)

gz vs lzo vs lzma

Blend Scheduler

• Fair Scheduler• Map Slot <-> Reduce Slot• More efficient • Full use of cluster resources

Data Skew

Row Key design by date+appkey

Row Key design by md5(date+app_key)[0:4] +date+appkey

Bulk Load

MapReduce -> put HBase Table

HDFS -> HFile -> Table 4min 10s

Welcome to Umeng !