28
Web Content Cartography Bernhard Ager Wolfgang M¨ uhlbauer Georgios Smaragdakis Steve Uhlig Technische Universtit¨ at Berlin / T-Labs ETH Z¨ urich Internet Measurement Conference 2011 Ager, M¨ uhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 1

Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Web Content Cartography

Bernhard Ager† Wolfgang Muhlbauer‡

Georgios Smaragdakis† Steve Uhlig†

†Technische Universtitat Berlin / T-Labs‡ETH Zurich

Internet Measurement Conference 2011

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 1

Page 2: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Motivation

Motivation

Content is King

• Web traffic currently dominates: ∼ 60 %

• Hosting infrastructures are the work-horse of content delivery

• But: “The only constant is change”: Hyper-giants, Meta CDNs,IETF CDNi, virtualization, applications

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 2

Page 3: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Motivation

How is the hosting landscape evolving?

We need to characterize hosting infrastructures

• Researchers: Understand the content eco-system better

• Content providers: Discover choice of available infrastructures

• ISPs: Perform strategic decisions: Peering, CDN infrastructure

• Infrastructures: Understand position in the market

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 3

Page 4: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Motivation

How we complement existing work

Earlier approaches to characterize infrastructures

Hyper-giants, Google [La10]; Hosting models [Le09]; Rapidshare [An09],Akamai and Limelight [Hu08]; Akamai [Su06]; Akamai, Digital Island, and12 more [Kr01]; ...

... and how our approach is different

• No a-priori signatures• Aiming at the broad picture

• Automatable, lightweight

[La10] C. Labovitz, S. Lekel-Johnson, D. McPherson, J. Oberheide, and F. Jahanian. Internet Inter-Domain Traffic. In Proc.ACM SIGCOMM, 2010.

[Le09] T. Leighton. Improving Performance on the Internet. Commun. ACM, 2009.[An09] D. Antoniades, E. Markatos, and C. Dovrolis. One-click Hosting Services: A File-Sharing Hideout. In Proc. ACM IMC,

2009.[Hu08] C. Huang, A. Wang, J. Li, and K. Ross. Measuring and Evaluating Large-scale CDNs. In Proc. ACM IMC, 2008.[Su06] A. Su, D. Choffnes, A. Kuzmanovic, and F. Bustamante. Drafting Behind Akamai: Inferring Network Conditions Based

on CDN Redirections. IEEE/ACM Trans. Netw., 2009.[Kr01] B. Krishnamurthy, C. Wills, and Y. Zhang. On the Use and Performance of Content Distribution Networks. In Proc.

ACM IMW, 2001.

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 4

Page 5: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Motivation

How we complement existing work

Earlier approaches to characterize infrastructures

Hyper-giants, Google [La10]; Hosting models [Le09]; Rapidshare [An09],Akamai and Limelight [Hu08]; Akamai [Su06]; Akamai, Digital Island, and12 more [Kr01]; ...

... and how our approach is different

• No a-priori signatures• Aiming at the broad picture

• Automatable, lightweight

[La10] C. Labovitz, S. Lekel-Johnson, D. McPherson, J. Oberheide, and F. Jahanian. Internet Inter-Domain Traffic. In Proc.ACM SIGCOMM, 2010.

[Le09] T. Leighton. Improving Performance on the Internet. Commun. ACM, 2009.[An09] D. Antoniades, E. Markatos, and C. Dovrolis. One-click Hosting Services: A File-Sharing Hideout. In Proc. ACM IMC,

2009.[Hu08] C. Huang, A. Wang, J. Li, and K. Ross. Measuring and Evaluating Large-scale CDNs. In Proc. ACM IMC, 2008.[Su06] A. Su, D. Choffnes, A. Kuzmanovic, and F. Bustamante. Drafting Behind Akamai: Inferring Network Conditions Based

on CDN Redirections. IEEE/ACM Trans. Netw., 2009.[Kr01] B. Krishnamurthy, C. Wills, and Y. Zhang. On the Use and Performance of Content Distribution Networks. In Proc.

ACM IMW, 2001.

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 4

Page 6: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Outline

1 Motivation

2 Approach

3 Data

4 Results

5 Conclusion

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 5

Page 7: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Approach

What are the characteristics of content hosting?

Web content cartography

• What are those hosting infrastructures?

• Where are they located?• At the network level• Geographically

• Who is operating them?

• Which role does each infrastructure play?

We propose web content cartography:building maps of hosting infrastructures

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 6

Page 8: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Approach

A sketch of HTTP content delivery

Observation

DNS exposesnetwork footprint

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 7

Page 9: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Approach

Identifying infrastructures

Features

• IP address, /24

• Prefix, AS

Two-level clustering process

• First phase: k-means

• Second phase: based on address space

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 8

Page 10: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Data

Collecting data

Hostnames

Requirement: Good coverage of hosting infrastructures

• Extracted from Alexa top 1 Mio. list

• 2000 top, 2000 tail, ∼ 3000 embedded, ∼ 850 cnames

Traces

Requirement: Sampling a large enough network footprint

• Script

• Run by volunteers

• Trace collection via website

Traces 133ASN 78Countries 27Continents 6

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 9

Page 11: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Data

Estimating coverageHow should you choose vantage points?

0 20 40 60 80 100 120

020

0040

0060

0080

00

Trace

Num

ber

of /2

4 su

bnet

wor

ks d

isco

vere

d

OptimizedMax randomMedian randomMin random

Insights

• Optimized: first 30traces from 30 ASs in 24countries ⇒ samplingdiversity comes fromgeographic and networkdiversity

• Median: tail traces yield20 /24s per trace ⇒limited utility whenadding more traces

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 10

Page 12: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Data

Estimating coverageHow should you choose hostnames?

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Similarity

CD

F

EmbeddedTop 2000TotalTail 2000

Insights

• embedded: similaritylow ⇒ better distributed

• tail: similarity high ⇒mostly centralized

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 11

Page 13: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Results

Characterizing infrastructures

Rank # hostnames owner content mix

1 476 Akamai3 108 Google4 70 Akamai5 70 Google6 57 Limelight7 57 ThePlanet

12 28 Wordpressonly on top, both on top and embedded, only on embedded, tail.

Main findings in Top 20

• tail content is important: consolidation

• Some companies run multiple infrastructures

• embedded often dominating

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 12

Page 14: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Results

Content potential and monopoly

Location CP

NCP CMI

AS 1 1

0.75 0.75

AS 2 0.5

0.25 0.5

Content Potential (CP)

Fraction of content available from a location.

Normalized Content Potential (NCP)

CP weighted by distributedness.

Content Monopoly Index (CMI)

CMI = NCP / CP

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 13

Page 15: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Results

Content potential and monopoly

Location CP NCP

CMI

AS 1 1 0.75

0.75

AS 2 0.5 0.25

0.5

Content Potential (CP)

Fraction of content available from a location.

Normalized Content Potential (NCP)

CP weighted by distributedness.

Content Monopoly Index (CMI)

CMI = NCP / CP

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 13

Page 16: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Results

Content potential and monopoly

Location CP NCP CMI

AS 1 1 0.75 0.75AS 2 0.5 0.25 0.5

Content Potential (CP)

Fraction of content available from a location.

Normalized Content Potential (NCP)

CP weighted by distributedness.

Content Monopoly Index (CMI)

CMI = NCP / CP

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 13

Page 17: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Results

Normalized content potential: Top 12 ASs

1 2 3 4 5 6 7 8 9 10 11 12

CPNCP

Rank

Pot

entia

l

0.00

0.05

0.10

0.15

Rank AS name CMI1 Chinanet 0.6992 Google 0.9963 ThePlanet.com 0.9854 SoftLayer 0.9675 China169 BB 0.5766 Level 3 0.1097 China Telecom 0.4708 Rackspace 0.9549 1&1 Internet 0.969

10 OVH 0.96911 NTT America 0.07012 EdgeCast 0.688

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 14

Page 18: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Results

Comparing AS rankings

CAIDA-cone [CAIDA]

• Number ofcustomer ASs

Arbor [La10]

• Inter-AS trafficvolume

Normalized potential

• Weighted contentavailability

Rank CAIDA-cone Arbor Normalized potential1 Level 3 Level 3 Chinanet2 AT&T Global Crossing Google3 MCI Google ThePlanet4 Cogent/PSI * SoftLayer5 Global Crossing * China169 backbone6 Sprint Comcast Level 37 Qwest * Rackspace8 Hurricane Electric * China Telecom9 tw telecom * 1&1 Internet

10 TeliaNet * OVH

[La10] C. Labovitz, S. Lekel-Johnson, D. McPherson, J. Oberheide, and F. Jahanian. Internet Inter-Domain Traffic. In Proc.ACM SIGCOMM, 2010.

[CAIDA] http://as-rank.caida.org/

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 15

Page 19: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Conclusion

Conclusion

Summary

• Lightweight discovery of hosting infrastructures

• Characterization of hosting infrastructures• We can detect the inhomogenous use of infrastructures

• Content-centric AS rankings• “Content monopolies”: Google, Chinese ISPs• Complementary to traditional rankings

Future work

• Relate with other metrics: traffic volume, finances, ...

• Explore the interplay of content delivery with the topology

• Break-down content by other categories: language, category, ...

• Follow-up work: increase coverage

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 16

Page 20: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Appendix

Backup slides

Backup slides

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 17

Page 21: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Appendix Background

Top 20 content clusters by hostname count

Rank # hostnames # ASes # prefixes owner content mix1 476 79 294 Akamai2 161 70 216 Akamai3 108 1 45 Google4 70 35 137 Akamai5 70 1 45 Google6 57 6 15 Limelight7 57 1 1 ThePlanet8 53 1 1 ThePlanet9 49 34 123 Akamai

10 34 1 2 Skyrock11 29 6 17 Cotendo12 28 4 5 Wordpress13 27 6 21 Footprint14 26 1 1 Ravand15 23 1 1 Xanga16 22 1 4 Edgecast17 22 1 1 ThePlanet18 21 1 1 ivwbox.de19 21 1 5 AOL20 20 1 1 Leaseweb

only on top, both on top and embedded, only on embedded, tail.

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 18

Page 22: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Appendix Background

Marginal utility: hostnames

0 2000 4000 6000

020

0040

0060

0080

00

Hostname ordered by utility

Num

ber

of /2

4 su

bnet

wor

ks d

isco

vere

d

TotalEmbeddedTop 2000Tail 2000

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 19

Page 23: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Appendix Background

Content exchange matrix: top

Requested Served fromfrom Africa Asia Europe N. America Oceania S. America

Africa 0.3 18.6 32.0 46.7 0.3 0.8

Asia 0.3 26.0 20.7 49.8 0.3 0.8

Europe 0.3 18.6 32.2 46.6 0.2 0.8

N. America 0.3 18.6 20.7 58.2 0.2 0.8

Oceania 0.3 20.8 20.5 49.2 5.9 0.8

S. America 0.2 18.7 20.6 49.3 0.2 10.1

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 20

Page 24: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Appendix Background

Content exchange matrix: embedded

Requested Served fromfrom Africa Asia Europe N. America Oceania S. America

Africa 0.3 26.9 35.5 35.8 0.3 0.6

Asia 0.3 37.9 18.3 40.1 1.1 0.6

Europe 0.3 26.8 35.6 35.6 0.4 0.6

N. America 0.3 26.5 18.4 52.9 0.3 0.6

Oceania 0.3 29.2 18.5 38.7 11.3 0.6

S. America 0.3 26.4 18.2 39.3 0.3 14.2

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 21

Page 25: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Appendix Background

Sizes of similarity clusters

1 5 10 50 500

12

510

2050

100

500

Infrastructure cluster by rank

Num

ber

of h

ostn

ames

on

infr

astr

uctu

re

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 22

Page 26: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Appendix Background

Hosting model of the less distributed infrastructures

1 (2

620)

2 (2

89)

3 (6

1)

4 (2

6)

5 (9

)

6 (1

1)

7 (6

)

8 (2

)

9 (3

)

10 (

2)

13 countries12 countries11 countries10 countries9 countries8 countries7 countries6 countries5 countries4 countries3 countries2 countries1 country

Fra

ctio

n

0.0

0.2

0.4

0.6

0.8

1.0

Number of ASN for infrastructure(Number of infrastructure clusters in parenthesis)

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 23

Page 27: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Appendix Background

Determining the hosting modelAn example: Skyrock vs. Cotendo

Rank # hostnames # ASes # prefixes owner content mix

10 34 1 2 Skyrock11 29 6 17 Cotendo

only on top, both on top and embedded, only on embedded, tail.

Skyrock

• 4 /24-subnetworks

• Website offering blogs/OSN

• Single country: France

⇒ Data center

Cotendo

• 21 /24-subnetworks

• Website offers CDN service

• 8 countries on 4 continents

⇒ Global scale CDN

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 24

Page 28: Web Content Cartography - MIT CSAILpeople.csail.mit.edu/.../IMC2011/IMC2011-presentation.pdfWeb Content Cartography Bernhard Agery Wolfgang Muhlbauer z Georgios Smaragdakisy Steve

Appendix Background

Determining the hosting modelAn example: Skyrock vs. Cotendo

Rank # hostnames # ASes # prefixes owner content mix

10 34 1 2 Skyrock11 29 6 17 Cotendo

only on top, both on top and embedded, only on embedded, tail.

Skyrock

• 4 /24-subnetworks

• Website offering blogs/OSN

• Single country: France

⇒ Data center

Cotendo

• 21 /24-subnetworks

• Website offers CDN service

• 8 countries on 4 continents

⇒ Global scale CDN

Ager, Muhlbauer, Smaragdakis, Uhlig (TUB/T-Labs, ETH) Web Content Cartography IMC’11 24