Introduction to Cloud Computing course/cs402/2009/ 彭波 [email protected] 北京大学信息科学技术学院 6/30/2009

Introduction to Cloud Computing

http://net.pku.edu.cn/~course/cs402/2009/彭波

[email protected]北京大学信息科学技术学院

6/30/2009

2

大纲

云计算 (Cloud Computing) 是 ? 大规模数据处理是？我们这门课的目标和内容是？

云计算 (Cloud Computing)

4

What is Cloud Computing?

1. First write down your own opinion about “cloud computing” , whatever you thought about in your mind.

2. Question: What ? Who? Why? How? Pros and cons?

3. The most important question is: What is the relation with me?

5

Cloud Computing is…

No software access everywhere by Internet power -- Large-scale data processing Appeal for startups

Cost efficiency 实在是太方便了 Software as platform

Cons Security Data lock-in

SaaSPaaS

Utility Computing

SaaSPaaS

Utility Computing

6

Software as a Service (SaaS)

a model of software deployment whereby a provider licenses an application to customers for use as a service on demand.

http://en.wikipedia.org/wiki/Software_deployment

7

Platform as a Service (PaaS)

对于开发 Web Application 和 Services ， PaaS 提供了一整套基于 Internet 的，从开发，测试，部署，运营到维护的全方位的集成环境。特别它从一开始就具备了 Multi-tenant architecture ，用户不需要考虑多用户并发的问题，而由 platform 来解决，包括并发管理，扩展性，失效恢复，安全。

8

Utility Computing

“pay-as-you-go” 好比让用户把电源插头插在墙上，你得到的电压和 Microsoft 得到的一样，只是你用得少， pay less ； utility computing 的目标就是让计算资源也具有这样的服务能力，用户可以使用 500 强公司所拥有的计算资源，只是 use less pay less 。这是 cloud computing 的一个重要方面

9

Cloud Computing is…

10

Key Characteristics

illusion of infinite computing resources available on demand;

elimination of an up-front commitment by Cloud users; 创业启动花费

ability to pay for use of computing resources on a short-term basis as needed 。小时间片的 billing ，报告指出 utility computing 在这一点上的实践是失败的

very large datacentersvery large datacenters

large-scale software infrastructurelarge-scale software infrastructure

operational expertiseoperational expertise

11

Why now?

very large-scale datacenter 的实践，因为新的技术趋势和 Business 模式

pay-as-you-go computing

12

Key Players

Amazon Web Services Google App Engine Microsoft Windows Azu

re

13

Key Applications

Mobile Interactive applications, Tim O’Reilly 相信未来是属于能够实时对用户提供信息的服务。 Mobile 必定是关键。而后台在 datacenter 中运行是很自然的模式，特别是那些 mashup 融合类型的服务。

Parallel batch processing 。大规模数据处理使用 Cloud Computing 技术很自然， MapReduce ， Hadoop 在这里起到重要作用。这里，数据移入 / 移出 cloud 是很大的开销， Amazon 开始尝试 host large public datasets for free 。

The rise of analytics 。数据库应用中 transaction based应用还在增长，而 analytics 的应用增长迅速。数据挖掘，用户行为分析等应用的巨大推动。

Extension of compute-intensive desktop application 。计算密集型的任务，说 matlab, mathematica 都有了 cloud computing 的扩展， woo~

14

Cloud Computing = Silver Bullet?

Google 文档在 3 月 7 日发生了大批用户文件外泄事件。美国隐私保护组织就此提请政府对 Google 采取措施，使其加强云计算产品的安全性。

Problem of Data Lock-in

15

Challenges

16

Some other Voices

It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true.Richard Stallman, quoted in The Guardian, September 29, 2008

It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true.Richard Stallman, quoted in The Guardian, September 29, 2008

The interesting thing about Cloud Computing is that we’ve redefined Cloud Computing to include everything that we already do. . . . I don’t understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads.Larry Ellison, quoted in the Wall Street Journal, September 26, 2008

The interesting thing about Cloud Computing is that we’ve redefined Cloud Computing to include everything that we already do. . . . I don’t understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads.Larry Ellison, quoted in the Wall Street Journal, September 26, 2008

17

What’s matter with ME?!

What you want to do with 1000pcs, or even 100,000 pcs?

18

Cloud is coming…

Google alone has 450,000 systems running across 20 datacenters, and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's Law

“Data Center is a Computer”

Parallelism everywhereMassive Scalable Reliable

Resource ManagementData Management

Programming Model & Tools

http://en.wikipedia.org/wiki/Moore%27s_law

大规模数据处理

20

21

Happening everywhere!

Molecular biology(cancer)microarray chips

Particle events (LHC)particle colliders

microprocessorsSimulations (Millennium)

Network traffic (spam)fiber optics

300M/day

1B

1M/sec

22 Maximilien Brice, © CERN




26

How much data?

Internet archive has 2 PB of data + 20 TB/month

Google processes 20 PB a day (2008) “all words ever spoken by human beings”

~ 5 EB CERN’s LHC will generate 10-15 PB a year Sanger anticipates 6 PB of data in 2009

640K ought to be enough for anybody.

27

NERSC User George Smoot wins 2006 Nobel Prize in

Physics

Smoot and Mather 1992

COBE Experiment showed anisotropy of CMB

Cosmic Microwave Background Radiation (CMB): an image of the universe at 400,000 years

28

The Current CMB Map

• Unique imprint of primordial physics through the tiny anisotropies in temperature and polarization.

• Extracting these Kelvin fluctuations from inherently noisy data is a serious computational challenge.

source J. Borrill, LBNL

29

Evolution Of CMB Data Sets: Cost > O(Np^3 )

Experiment Nt Np NbLimiting

DataNotes

COBE (1989) 2x109 6x103 3x101 Time Satellite, Workstation

BOOMERanG (1998) 3x108 5x105 3x101 Pixel Balloon, 1st HPC/NERSC

(4yr) WMAP (2001)

7x1010 4x107 1x103 ? Satellite, Analysis-bound

Planck (2007) 5x1011 6x108 6x103 Time/ Pixel

Satellite, Major HPC/DA effort

POLARBEAR (2007)

8x1012 6x106 1x103 TimeGround, NG-

multiplexing

CMBPol (~2020) 1014 109 104 Time/ Pixel

Satellite, Early planning/design

data compression

30

Example: Wikipedia Anthropology

Experiment Download entire revision

history of Wikipedia 4.7 M pages, 58 M

revisions, 800 GB Analyze editing patterns

& trends

Computation Hadoop on 20-machine

cluster

Kittur, Suh, Pendleton (UCLA, PARC), “He Says, Kittur, Suh, Pendleton (UCLA, PARC), “He Says, She Says: Conflict and Coordination in Wikipedia” She Says: Conflict and Coordination in Wikipedia” CHI, 2007CHI, 2007

Increasing fraction of edits are forwork indirectly related to articles

31

Example: Scene Completion

Image Database Grouped by Semantic Content

30 different Flickr.com groups 2.3 M images total (396 GB).

Select Candidate Images Most Suitable for Filling Hole

Classify images with gist scene detector [Torralba]

Color similarity Local context matching

Computation Index images offline 50 min. scene matching, 20 m

in. local matching, 4 min. compositing

Reduces to 5 minutes total by using 5 machines

Extension Flickr.com has over 500 millio

n images …

Hays, Efros (CMU), “Scene Completion Using MilliHays, Efros (CMU), “Scene Completion Using Millions of Photographs” SIGGRAPH, 2007ons of Photographs” SIGGRAPH, 2007

32

Example: Web Page Analysis

Experiment Use web crawler to gather

151M HTML pages weekly 11 times

Generated 1.2 TB log information

Analyze page statistics and change frequencies

Systems Challenge“Moreover, we experienced

a catastrophic disk failure during the third crawl, causing us to lose a quarter of the logs of that crawl.”

Fetterly, Manasse, Najork, Wiener (Microsoft, HP), Fetterly, Manasse, Najork, Wiener (Microsoft, HP), “A Large-Scale Study of the Evolution of Web Pag“A Large-Scale Study of the Evolution of Web Pages,” Software-Practice & Experience, 2004es,” Software-Practice & Experience, 2004

33

GATGCTTACTATGCGGGCCCC

CGGTCTAATGCTTACTATGC

GCTTACTATGCGGGCCCCTTAATGCTTACTATGCGGGCCCCTT

TAATGCTTACTATGCAATGCTTAGCTATGCGGGC

AATGCTTACTATGCGGGCCCCTT


CGGTCTAGATGCTTACTATGC


CGGTCTAATGCTTAGCTATGC

ATGCTTACTATGCGGGCCCCTT

?

Subject genome

Sequencer

Reads

34

DNA Sequencing

ATCTGATAAGTCCCAGGACTTCAGT

GCAAGGCAAACCCGAGCCCAGTTT

TCCAGTTCTAGAGTTTCACATGATC

GGAGTTAGTAAAAGTCCACATTGAG

Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG

Bacteria: ~5 million bp Humans: ~3 billion bp

Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp)

Shorter reads, but much higher throughput Per-base error rate estimated at 1-2% (Simpson,

et al, 2009) Recent studies of entire human genomes have

used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads

~144 GB of compressed sequence data

35

CGGTCTAGATGCTTAGCTATGCGGGCCCCTT

Reference sequence

Alignment

GCTTA T CTAT

TTA T CTATGC

A T CTATGCGGA T CTATGCGG

GCTTA T CTAT

TCTAGATGCT

CTATGCGGGCCTAGATGCTT

A T CTATGCGGCTATGCGGGC

A T CTATGCGG

Subject reads

36

CGGTCTAGATGCTTATCTATGCGGGCCCCTT

GCTTATCTATTTATCTATGC

ATCTATGCGGATCTATGCGG

GCTTATCTAT GGCCCCTTGCCCCTT

CCTT

CGGCGGTCCGGTCTCGGTCTAG

TCTAGATGCTCTATGCGGGCCTAGATGCTT

CTT

ATGCGGGCCC

Reference sequence

Subject reads

37

Example: Bioinformatics

Evaluate running time on local 24 core cluster Running time increases linearly with the number

of reads

Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.

38

Example: Data Mining

del.icio.us crawl->a bipartite graph covering 802739 Webpages and 1021107 tags.

Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, Edward Y. Chang: Pfp: parallel fp-growth for query recommendation. RecSys 2008: 107-114

http://www.sigmod.org/dblp/db/indices/a-tree/l/Li:Haoyuan.html

http://www.sigmod.org/dblp/db/indices/a-tree/l/Li:Haoyuan.html

大规模数据处理 + 云计算

An Example

40

数据处理任务

词频统计：统计一个文档集中每个词出现的次数 Try on these collection:

2006年初，我们在国内搜集了 870 Million 不同网页 ,共约 2 TB.

商业搜索引擎 Google, Yahoo 等，收集网页数量在 100+ Billion pages

怎样处理海量数据？怎样处理海量数据？

41

Divide and Conquer

“Work”“Work”

w1w1 w2

w2 w3w3

r1r1 r2

r2 r3r3

“Result”“Result”

“worker” “worker” “worker”

Partition

Combine

42

What’s Mapreduce

Parallel/Distributed Computing Programming Model

Input split shuffle output

43

Typical problem solved by MapReduce

读入数据 : key/value 对的记录格式数据 Map: 从每个记录里 extract something

map (in_key, in_value) -> list(out_key, intermediate_value) 处理 input key/value pair 输出中间结果 key/value pairs

Shuffle: 混排交换数据把相同 key 的中间结果汇集到相同节点上

Reduce: aggregate, summarize, filter, etc. reduce (out_key, list(intermediate_value)) -> list(out_value)

归并某一个 key 的所有 values ，进行计算输出合并的计算结果 (usually just one)

输出结果

44

Word Frequencies in Web pages

输入： one document per record 用户实现map function ，输入为

key = document URL value = document contents

map输出 (potentially many) key/value pairs. 对 document 中每一个出现的词，输出一个记录 <word, “1”>

45

Example continued:

MapReduce 运行系统 ( 库 )把所有相同 key 的记录收集到一起 (shuffle/sort)

用户实现 reduce function 对一个 key 对应的 values 计算

求和 sum

Reduce输出 <key, sum>

46

MapReduce Runtime System

47

History of Hadoop

2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella

December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.

January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the sta

ndalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardwar

e than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 n

odes in 5.2 hrs, 900 nodes in 7.8 January 2006 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo!

48

From Theory to Practice

Hadoop ClusterYou

1. Scp data to cluster2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job4a. Go back to Step 3

5. Move data out of HDFS6. Scp data from cluster

课程目标和内容

50

课程目标

掌握MapReduce编程模型与运行环境的使用。掌握算法在 MapReduce 模型下并行化的基本方法。了解MapReduce 运行分布式环境的实现技术。了解云计算中大规模数据处理和算法并行化技术的

发展现状和关键问题。了解并培养并行化思考问题的习惯。

51

课程内容LEC# TOPICS ABSTRACT

1 课程介绍 - 云计算围绕大规模数据处理为背景介绍云计算技术以MapReduce 为平台展开讲授和实践，是课程的

中心。MapReduce环境

2 MapReduce原理从函数式语言谈MapReduce 的基本原理分析 Inverted Index问题及其 MapReduce 实现Inverted Index问题

3 并行与分布式系统基础介绍大规模并行分布式系统的设计分析 PageRank问题及其 MapReduce 实现PageRank问题

4 MapReduce系统设计与实现

分析 MapReduce 的系统设计和考虑分析 Clustering问题及其 MapReduce 实现

Clustering问题5 MapReduce高层应用介绍MapReduce之上的应用和发展

分析频繁集挖掘问题及其 MapReduce 实现频繁集挖掘问题6 项目讨论课程项目讨论7 特邀报告邀请学术界或业界研究技术人员报告8 项目报告学生课程项目报告

52

Grading Policy

30% Assignments

20% Readings

50% Course project Hw1 - Read -

Intro Distributed system; Intro MapReduce Programming. Hw2 - Read MapReduce[1]Hw3 – Read GFS[2] Hw4 – Read Pig Latin[3]

Hw1 - Read - Intro Distributed system; Intro MapReduce Programming. Hw2 - Read MapReduce[1]Hw3 – Read GFS[2] Hw4 – Read Pig Latin[3]

Lab 1 - Introduction to Hadoop, EclipseLab 2 – A Simple Inverted IndexLab 3 - PageRank over Wikipedia CorpusLab 4 – Clustering the Netflix Movie Data

Lab 1 - Introduction to Hadoop, EclipseLab 2 – A Simple Inverted IndexLab 3 - PageRank over Wikipedia CorpusLab 4 – Clustering the Netflix Movie Data

http://code.google.com/edu/parallel/dsd-tutorial.html

http://code.google.com/edu/parallel/mapreduce-tutorial.html

http://code.google.com/edu/parallel/dsd-tutorial.html

http://code.google.com/edu/parallel/mapreduce-tutorial.html

53

课程的要求

熟练一种 Programming Language Lots of java programming practices

54

Teachers and Resources

课程网站 http://net.pku.edu.cn/~co

urse/cs402/2009/ 讨论组

http://groups.google.com/group/cs402pku

Hadoop 主页 http://hadoop.apache.org

/core/ Resources

http://net.pku.edu.cn/~course/cs402/2008/resource.html

闫宏飞老师

陈日闪助教

55

Homework

登记 http://net.pku.edu.cn/~course/cs402/2009/

组成小组 3-4人，为课程 project准备跨专业方向很好

Lab1 Lab 1 - Introduction to Hadoop, Eclipse

HW Reading1 Intro Distributed system; Intro Parallel Programming.

http://code.google.com/edu/parallel/dsd-tutorial.html http://code.google.com/edu/parallel/mapreduce-tutorial.h

tml

56

Summary

CloudComputing brings Possible of using unlimited re

sources on-demand, and by anytime and anywhere

Possible of construct and deploy applications automatically scale to tens of thousands computers

Possible of construct and run programs dealing with prodigious volume of data

… How to make it real?

Distributed File System Distributed Computing Frame

work …………………………………

Q&A

58

参考文献 [1] J. Dean and S. Ghemawat, "MapReduce: S

implified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150.

[2] G. Sanjay, G. Howard, and L. Shun-Tak, "The Google file system," in Proceedings of the nineteenth ACM symposium on Operating systems principles. Bolton Landing, NY, USA: ACM Press, 2003.

[3] O. Christopher, R. Benjamin, S. Utkarsh, K. Ravi, and T. Andrew, "Pig latin: a not-so-foreign language for data processing," in Proceedings of the 2008 ACM SIGMOD international conference on Management of data. Vancouver, Canada: ACM, 2008.

59

Google App Engine

App Engine handles HTTP(S) requests, nothing else Think RPC: request in, processing, response out Works well for the web and AJAX; also for other services

App configuration is dead simple No performance tuning needed

Everything is built to scale “infinite” number of apps, requests/sec, storage capacity APIs are simple, stupid

60

App Engine Architecture

60

PythonVM

process

stdlib

app

memcachedatastore

mail

images

urlfech

statefulAPIs

stateless APIs R/O FSreq/resp

61

Microsoft Windows Azure

62

Amazon Web Services

Amazon’s infrastructure (auto scaling, load balancing)

Elastic Compute Cloud (EC2) – scalable virtual private server instances

Simple Storage Service (S3) Simple Queue Service (SQS) – messaging SimpleDB - database Flexible Payments Service, Mechanical Turk, Cl

oudFront, etc.

63

Amazon Web Services

Very flexible, lower-level offering (closer to hardware) = more possibilities, higher performing

Runs platform you provide (machine images) Supports all major web languages Industry-standard services (move off AWS

easily) Require much more work, longer time-to-

market Deployment scripts, configuring images, etc.

Various libraries and GUI plug-ins make AWS do help

64

Price of Amazon EC2

Documents

Introduction to Cloud Computing course/cs402/2009/ 彭波 [email protected] 北京大学信息科学技术学院 6/30/2009