40
Web Archive Profiling For Efficient Memento Aggregation Sawood Alam Old Dominion University, Norfolk, Virginia - 23529 Advisor: Michael L. Nelson Doctoral Consortium JCDL’16 June 19, 2016 Supported in part by the International Internet Preservation Consortium (IIPC)

JCDL 2016 Doctoral Consortium - Web Archive Profiling

Embed Size (px)

Citation preview

Page 1: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Web Archive ProfilingFor

Efficient Memento Aggregation

Sawood AlamOld Dominion University, Norfolk, Virginia - 23529

Advisor: Michael L. Nelson

Doctoral Consortium JCDL’16June 19, 2016

Supported in part by the International Internet Preservation Consortium (IIPC)

Page 2: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Motivation

Page 3: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Motivation

Page 4: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Motivation

Page 5: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Memento Aggregator

Page 6: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Memento Aggregator

Page 7: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Memento Aggregator

Page 8: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Memento Aggregator

Page 9: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Memento Aggregator

Page 10: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Memento Aggregator

Page 11: JCDL 2016 Doctoral Consortium - Web Archive Profiling

From: Michael Nelson [mailto:[email protected]]

Sent: Wednesday, December 02, 2015 12:33 PM

To: Jones, Gina

Cc: Rourke, Patrick; Grotke, Abigail

Subject: Re: WebSciDL

Hi Gina, I'll investigate. memgator is software that one my students wrote, but I suspect the traffic you're seeing is b/c it is deployed in http://oldweb.today/ can you share the IP addr from where you're seeing the traffic? I presume the requests are for Memento TimeMaps? It should not being actually scraping HTML pages.

regards,

Michael

On Wed, 2 Dec 2015, Jones, Gina wrote:

> Hi Michael, we have a slight configuration issue with the current OW

> set up for our webarchives. I think, from looking at the logs, that

> "MemGator:1.0-rc3 <@WebSciDL>" is really causing some issues on our wayback.

> Do you know who is running this scraper? Itʼs not part of memento is it?

>

> Gina Jones

> Web Archiving Team

> Library of Congress

From: Ilya Kreymer <[email protected]>

Date: Wed, 2 Dec 2015 10:33:56 -0800

Subject: high traffic on oldweb!

To: Herbert Van de Sompel <[email protected]>, Sawood Alam <[email protected]>

Hi Herbert, Sawood,

Herbert: Perhaps you are lucky that I am not using the LANL aggregator, as the traffic has gotten really high, and also I was asked to remove an archive due to the traffic it was causing temporarily..

I am thinking that ability to remove source archives quickly is an important aspect of an aggregator.

Sawood: Hopefully yours will support something like this so I don't need to restart the container to change the archivelist ;)

Ilya

Broadcasting is Bad

Page 12: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Availability and Overlap

● Archives are sparse● Broadcasting is wasteful, both clients and archives suffer

Page 13: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Memento Routing

Page 14: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Routing Pros & Cons

● Pros○ Minimizes traffic and resources consumption○ Improves throughput

● Cons○ Upfront profile maintenance cost○ May miss Mementos (false negatives)

Page 15: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Why Small Archives Matter?

Page 16: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Why Small Archives Matter?

● 400B+ web pages at IA do not cover everything

● Top three archives after IA produce full TimeMap 52% of the time (AlSum et al, TPDL 2013)

● Targeted crawls● Special focus archives● Restricted resources● Private archives● Censorship

Page 17: JCDL 2016 Doctoral Consortium - Web Archive Profiling

While the IA was Down...

$ memgator -f cdxj example.org | cut -c-4 | grep -v "^@" | uniq -c 2 2002 1 2005 1 2008 6 2009 67 2010 17 2011 64 2012 108 2013 108 2014 186 2015 51 2016

Page 18: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Research Questions

● What do individual web archives hold?● How much do we need to know about an

archive’s holdings?● What is the optimal level of summarization for

better accuracy and increased freshness?● What are various ways to learn about archives’

holdings?● How to store and update archives’ profiles to

efficiently scale?

Page 19: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Archive Profile

● High-level summary of an archive● Predicts presence of mementos of a URI-R in

an archive● Provides various statistics about the holdings● Small in size● Publicly available● Easy to update and partially patch● Useful for Memento query routing and other

things

Page 20: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Profiling Policies

● Complete URI-R Profiling (1 URI-R = 1 Profile Key)

○ bbc.co.uk/images/logo.png?w=90○ cnn.com/2014/03/15/?id=128734

● TLD-only Profiling (1 TLD = 1 Profile Key)

○ com)/○ uk)/

● Middle Ground○ uk,co)/○ uk,co,bbc)/images○ uk,co,bbc)/0/2/1○ com,cnn)/ 201309 ar

Page 21: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Available Profiling ResourcesClient request

Archive Response

CDX Records

Page 22: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Profiling Strategies

● CDX Profiling● Fulltext Search Profiling● Sample URI Profiling● Response Cache Profiling

Page 23: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Sample Profile

Page 24: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Probability Rank

Page 25: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Archives

Archive URI-Rs URI-Ms Index Size

Archive-It 1.9B 5.3B 1.8TB

UKWA 0.7B 1.7B 0.5TB

Stanford 12M 25M 8.3GB

Page 26: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Sample Query Sets

Sample(1M URIs Each)

InArchive-It

InUKWA

InStanford

Union{AIT, UK,

SU}

DMOZ 4.097% 3.594% 0.034% 7.575%

MementoProxy 4.182% 0.408% 0.046% 4.527%

IAWayback 3.716% 0.519% 0.039% 4.165%

UKWayback 0.108% 0.034% 0.002% 0.134%

Page 27: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Evaluation

● Generate profiles with 23 policies● Relate CDX Size, URI-M, URI-R, and URI-Key● Analyze profile growth● Estimate Relative Cost● Evaluate Routing Efficiency

Page 28: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Resource Requirement

Page 29: JCDL 2016 Doctoral Consortium - Web Archive Profiling

CDX Size vs URI-M (UKWA 10 Years)

Alpha: 175 bytes per CDX line

Page 30: JCDL 2016 Doctoral Consortium - Web Archive Profiling

URI-M vs URI-R (UKWA 10 Years)

Gamma: 2.46 K : 2.686Beta: 0.911

Page 31: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Space Cost (UKWA 7 Years)

Phi: 8.5e-07 -- 0.70583

Page 32: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Time Cost (UKWA 7 Years)

Tau: 5.7e-05 -- 6.2e-05CDX: 45GBURI-Ms: 181MURI-Rs: 96MTime: 3 hours

Page 33: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Archive-It

Page 34: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Fulltext Search Cost

Page 35: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Partial Knowledge

Page 36: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Cost vs Accuracy

Group Policies Cost Accuracy

G1 H1P0/TLD Bound by # of TLDs ≈ 0.01

G2 H3P0, DDom, DSub, DPth, DQry < 0.01 ≈ 0.78

G3 DIni ≈ 2 * G2 ≈ 0.88

G4 HxP1 ≈ 5 * G3 ≈ 0.94

G5 Higher HmPn 0.4 -- 0.7 Not Explored

G6 URIR 1.0 1.0

Page 37: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Work Plan

✓ Baseline Profiling Through CDX Files✓ Profile Serialization✓ Fulltext Search Profiling✓ Sample URI Dataset➢ Instrumenting Memento Aggregator➢ Multidimensional Profiling

Page 38: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Publications

TPDL15 Web Archive Profiling Through CDX Summarization

TCDL15 Profiling Web Archives - For Efficient Memento Query Routing

IJDL16 Web Archive Profiling Through CDX Summarization

JCDL16 Poster: MemGator - A Portable Concurrent Memento Aggregator

TPDL16 Web Archive Profiling Through Fulltext Search

RFC Object Resource Stream (ORS) and CDX-JSON (CDXJ) Formats

C4LJ MemGator - A Portable Concurrent Memento Aggregator Architecture

JCDL17 Scalable, Maintainable, and Extensible Web Archive Profile Serialization for Efficient Lookup

JCDL17 URI, Time, and Language Profiling from Live Archives via URI Sampling and Fulltex Search

SIGIR17 Memento Aggregator Routing Based on Probability Distribution of Memento Availability with Archive Profiles

IJDL17 Archive X-Ray - Web Archive Profiling for Efficient Memento Aggregation

Page 39: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Future Work

● Language profiles● Evaluation of combination profiles such as

URI-Key along with Datetime● Utilize archive profile to generate rank

ordered list of archive● Profiles for usage other than Memento

routing, such as, site classification based profiles (e.g., news, wiki, social media, blog etc.)

Page 40: JCDL 2016 Doctoral Consortium - Web Archive Profiling

Conclusions● Generated profiles with different policies for three archives● Examined cost-precision tradeoffs of various policies● Related CDX Size, URI-M, URI-R, and URI-Key● Gained up to 80% routing accuracy with <1% relative cost

while maintaining 0.9 recall