Kulturarwfai.nu/wp-content/uploads/FAI_Janson.pdf• Uppdatera robotprogramvaran till NetarchiveSuite och Heritrix 3.2 • Sluta respektera robots.txt-filer • Wayback Machine för

Kulturarw3SVENSKA WEBBEN  BEVARANDE &  TILLGÄNGLIGGÖRANDE

1

#FAI2016 När: Tisdag 25:e oktober 2016 14:00-14:45 Vem: Daniel Jansson Var: Konferens informationsförvaltningen 2016

Kungliga biblioteket historia

• Pliktleveranser av tryckt material sedan 1661

- Uppdaterades 1993 endast elektroniska dokuments i fixt format CD-ROM, disketter

• Pliktleveranser av audiovisuella medier sedan 1979

• Första Svenska webbtidningen förlorad

• Kulturarw3 startade sommaren 1996

• E-plikt sedan 1:a januari 2015

Då — Mål

• Alla webbsidor i Sverige

- bilder, video mm.

- .se, .nu och svenskt material under andra Top Level Domains

- Suecana (utländskt material av svenskt intresse)

Då — Strategi

• Med så liten mänsklig inblandning som möjligt.

• Ta ögonblicksbilder  över svenska webben ett par gånger om året.

- Får “allt”

- Mindre arbetskrävande

- Datorminne är billigt

- Nackdel: stora volymer gör kvaliteskontrollen svår

• Sedan juni 2002 — Selektiv insamling (också ögonblicksbilder)  Cirka 150 dagstidningar varje dag,

! Sverige på webben?

http://www.kb.se/kbstart.htm

Endast domännamnet relevant

• .se • .nu, Niue populär här i Sverige • Andra: Om servern stationerad i Sverige  

eller svensk ägare till domänen

! Sverige på webben?

8 Looking Back, Looking Forward: New Strategies for Coverage of a National Web Sphere IIPC 2016, Reykjavik, Iceland

WebDanica project Tested Different Methods Internet Archive method NetArchive Link method

NL-data Outlinks from Danish broad

crawl 2012

0 Find Danish webpages

IA-data World wide collection 2012

Wide0005

Very few common results NL

results IA

results

General implementation covering more methods

Host: 1. part of URL http://abc.xx/def/ghi/...

Only in NL 46.552

Only in IA 43.185

Both in IA and NAL 2.014

Eld Zierau, The Royal Library of Denmark

Hur

• En robot samlar in webbsidor genom att automatiskt följa länkar och spara ner sidorna.

• Svep: Open-source robot, Heritrix

- Huvudsaklig utveckling av Internet Archive (IA)

- Skriven i Java. Stor användargrupp.

- Uttryckligen designad för webbinsamling (inte indexering).

• Viktigt!  Indexering är inte arkivering och arkivering är inte indexering!

• Samlar också in bilder, ljud mm.

Hur – Flödesdiagram, webbroboten

Archived data

URLs already processed

List of URLs

Inbox with list of URLs to be

harvested

Log with  new links

New URLs

Process log

Harvesting threads

Distribute URLs

Hur – Hitta länkar; parsning

Royal Library of Sweden

Click on this link to see our visitors addresses.

relativ länk

www.kb.se/whitemarble.jpg

relativ länk

absolut länk

www.kb.se/address.html

www.kb.se/logo.gif

Parsning klar!

Nu — Hur mycket som samlats in

Varv Namn Antal filer Storlek

1 2010-1 240 866 031 9.45 TB

2 2013-1 717 887 978 53.92 TB

3 2013-2 844 741 844 67.78 TB

4 2014-1 702 393 955 59.78 TB

5 2014-2 678 218 510 63.93 TB

Antal objekt: 5 000 000 000 Antal bytes: 350 TB

Nu — Hur mycket som samlats in

Varv Namn Antal filer Storlek

1 2010-1 240 866 031 9.45 TB

2 2013-1 717 887 978 53.92 TB

3 2013-2 844 741 844 67.78 TB

4 2014-1 702 393 955 59.78 TB

5 2014-2 678 218 510 63.93 TB

Antal objekt: 5 000 000 000 Antal bytes: 350 TB

Kan jämföras med alla samlade svep för perioden 1997-03-24 – 2005-11-23 - 469 miljoner URL:er - 17,0 TB

Nu – Vad görs nu

• Ta fram tre huvudspår

• Att göra en årlig insamling i KW3-bulk

• Att fler typer av webbplatser samlas in enligt samma modell som KW3-dagstidningar

– Massmedia

– KIA-/SiS-index

• Riktad insamling vid särskilda händelser

Nu – Vad görs nu

• Uppdatera robotprogramvaran till NetarchiveSuite och Heritrix 3.2

• Sluta respektera robots.txt-filer

• Wayback Machine för tillgängliggörande

• Knyta samman Kulturarw3 med e-plikten

Framtiden – Kulturarw3 & e-plikt

Kulturarw3 – robotinsamling

&

"#$ %

&

Premium- material

E-plikt

"#$%METADATAMETADATAMETADATAMETADATAMETADATAMETADATAMETADATAMETADATAMETADATAMETADATAMETADATAMETADATAMETADATAMETADATAMETADATAMETADATA

&

"#$ %

&

Premium- material

http://lib

ris.kb.se

/bib/194

3325

1

http://libris.kb.se/bib/19433251

Version 1

19

Version 2

20

Version 3

21

Version 4

22

Nu — E-plikt

+ Fler versioner

+ Premium material (material bakom betalväggar)

+ Bättre sökbarhet

+ Mer och bättre metadata

- Kan sakna kontext

- Avgränsad vid vilka som omfattas

Nu — E-plikt

+ Fler versioner

+ Premium material (material bakom betalväggar)

+ Bättre sökbarhet

+ Mer och bättre metadata

- Kan sakna kontext

- Avgränsad vid vilka som omfattas

KW3:s styrkor

'

Webb 2.0

Webb 2.0 och annonser

SvD 1926

DN 1933 DN 1980

Webb 2.0 och annonser

Headless browsing – Insamling av ”Web 2.0”

29

(a) The live resource at URI-R http://www.truthinshredding.com/ loadsA, B, and C via JavaScript.

(b) Using PhantomJS, the advertise-ment (B) and video (C) are found butthe account frame (A) is missed.

(c) Using Heritrix, the embedded re-sources A, B, and C are missed.

Figure 1: Neither archival tool captures all embedded resources, but PhantomJS discovers the URI-Rs oftwo out of three embedded resources dependent upon JavaScript (B, C) while Heritrix misses all of them.

http%3A%2F%2Fwww.truthinshredding.com&gsrc=3p&ic=1&jsh=m%3B%2F_%2Fscs%2Fapps-static...

The page loaded into the iframe uses JavaScript to pull theprofile image into the page from URI-RA1

https://apis.google.com/_/scs/apps-static/_/ss/k=oz.widget.-ynlzpp4csh.L.W.O/m=bdg/am=AAAAAJAwAA4/d=1/rs=AItRSTNrapszOr4y_tKMA1hZh6JM-g1haQ

Embedded Resource B is an advertisement that uses theJavaScript at URI-RB1

http://pagead2.googlesyndication.com/pagead/show_ads.js

to pull in ads to the page. Embedded Resource C is aYouTube video that is embedded in the page using the fol-lowing HTML for an iframe:

.

PhantomJS does not load Embedded Resource A, poten-tially because the host resource completes loading beforethe page embedded in the iframe can finish loading. Phan-tomJS stops recording embedded URIs and monitoring therepresentation after a page has completed loading, and Em-bedded Resource A executes its JavaScript to load the pro-file picture after the main representation has completed thepage load1. PhantomJS does discover the advertisement1PhantomJS scripts can be written to avoid this race-condition using longer timeouts or client-side event detec-tion, but this is outside the scope of this paper.

(Embedded Resource B) and the YouTube video (Embed-ded Resource C). Even though the headless browser used byPhantomJS does not have the plugin necessary to displaythe video, the URI-R is still discovered by PhantomJS.

Heritrix fails to identify the URI-Rs for the Embedded Re-sources A, B, and C. When the memento created by Heritrixis loaded by the Wayback Machine, Embedded Resources A,B, and C are missing. This is attributed to Heritrix, whichdoes not discover the URI-Rs for these resources during thecrawl. When viewing the memento through the WaybackMachine, the JavaScript responsible for loading the embed-ded resources is executed resulting in either a zombie re-source (prima facie violative) or HTTP 404 response (in-complete) for the embedded URI.

Heritrix’s inability to discover the embedded URI-Rs couldbe mitigated by utilizing PhantomJS during the crawl. How-ever, this raises many questions, most notably: How muchslower will the crawl time be? How many additional em-bedded resources could it recover and potentially need tostore? Can we optimize the crawl approach based on the de-tection of deferred representations? Our investigation intothese questions will assess the feasibility of combining Her-itrix with PhantomJS to balance the speed of Heritrix withthe completeness of PhantomJS.

5. COMPARING CRAWLSWe designed an experiment to measure the performance dif-ferences between a command-line archival tool (wget [12]), atraditional crawler (the Internet Archive’s Heritrix Crawler[23, 30]), and a headless browser client (PhantomJS). Nei-ther Heritrix nor wget execute the client-side JavaScript,while PhantomJS does execute client-side JavaScript.

We constructed a 10,000 URI-R dataset by randomly gen-erating a Bitly URI and extracting its redirection target

Headless browsing – Insamling av ”Web 2.0”

30

(identical to the process used to create the Bitly data sub-set in [1]). We split the 10,000 URI dataset into 20 sets of500 seed URI-Rs and used wget, Heritrix, and PhantomJSto crawl each set of seed URI-Rs. We repeated each crawlten times to establish an average performance, resulting inten di�erent crawls of the 10,000 URI dataset (executing thecrawl one of the 500-URI sets at a time) with wget, Heritrix,and PhantomJS. We measured the increase in frontier size(|F |) and the URIs per second (tURI) to crawl the resource.

While Heritrix provides a user interface that identifies thecrawl frontier size, PhantomJS and wget do not. We cal-culate the frontier size of PhantomJS by counting the num-ber of embedded resources that PhantomJS requests whenrendering the representation. We calculate the frontier sizeof wget by executing a command2 that records the HTTPGET requests issued by wget during the process of mirror-ing a web resource and its embedded resources. We considerthe frontier size to be the total number of resources and em-bedded resources that wget attempts to download.

We began a crawl of the same 500 URI-Rs using wget, Her-itrix, and PhantomJS simultaneously to mitigate the im-pact of live Web resources changing state during the crawls.For example, if the representation changes (such as includesnew embedded resources) in between the times wget, Phan-tomJS, and Heritrix perform their crawls, the number orrepresentations of embedded resources may change and there-fore the representation influenced the crawl performance,not the crawler itself.

We crawled live-Web resources because mementos inheritthe limitations of the crawler used to create them. De-pending on crawl policies, a memento may be incompleteand di�erent than the live resource. The robots.txt pro-tocol [27, 35], breadth- versus depth-first crawling, or theinability to crawl certain representations (like deferred rep-resentations as we discuss in this paper) can all influencethe mementos created during a crawl.

5.1 Crawl Time by URITo better understand how crawl times of wget, PhantomJS,and Heritrix di�er, we determined the time needed to ex-ecute a crawl. Heritrix has a browser-based user interfacethat provides the URIs/second (tURI) metric. We collectedthis metric from the Web interface for each crawl. We usedUnix system times to calculate the crawl time for each Phan-tomJS and wget crawl by determining the start and stoptimes for dereferencing each resource and its embedded re-sources. We compare the wget, PhantomJS, and Heritrixcrawl times per URI in Figure 2 and Table 1. Heritrix out-performs PhantomJS, crawling 2.065 URIs/s while Phan-tomJS crawls 0.170 URIs/s and wget crawls 0.864 URIs/s.Heritrix crawls, on average, 12.13 times faster than Phan-tomJS and 2.39 times faster than wget.

The performance di�erence comes from two aspects of thecrawl. First, Heritrix executes crawls in parallel with multi-ple threads being managed by the Heritrix software – this is

2We executed wget -T 40 -o outfile -p -O headerFile[URI-R] which downloads the target URI-R and all embed-ded resources and dumps the HTTP tra�c to headerFile.

Figure 2: Heritrix crawls 12.13 times faster thanPhantomJS. The error lines indicate the standarddeviation across all ten runs.

not possible with PhantomJS on a single core machine sincePhantomJS requires access to a headless browser and its as-sociated JavaScript engine, and parallelization will result inprocess and threading conflicts. Second, Heritrix does notexecute the client-side JavaScript and only adds URIs thatare extracted from the Document Object Model (DOM),embedded style sheets, and other resources to its frontier.

5.2 URI Discovery and Frontier SizeWe performed a string-matching de-duplication (that is, re-moving duplicate URIs) to determine the true frontier size(|F |).

Crawler Crawl time Frontier SizetURI stURI |F | s|F |

wget 0.864 0.855 129,443 3,213.65Heritrix 2.065 0.137 302,961 1,219.82PhantomJS 0.170 0.001 531,484 2,036.92

Table 1: Mean and standard deviation of crawl time(URIs/s) and frontier size for wget, Heritrix, andPhantomJS crawls of 10,000 seed URIs.

As shown in Figure 3 and in Table 1, we found that Phan-tomJS discovered and added 1.75 times more URI-Rs to itsfrontier than Heritrix, and 4.11 times more URI-Rs thanwget. Per URI-R, PhantomJS loads 19.7 more embeddedresources than Heritrix and 32.4 more embedded resourcesthan wget. The superior PhantomJS frontier size is at-tributed to its ability to execute JavaScript and discoverURIs constructed and requested by the client-side scripts.

However, raw frontier size is not the only performance metricfor assessing the quality of the frontier. PhantomJS andHeritrix discover some of the same URIs, while PhantomJSdiscovers URIs that Heritrix does not and Heritrix discoversURIs that PhantomJS does not. We measured the unionand intersection of the Heritrix and PhantomJS frontiers.As shown in Figure 4(a), per 10,000 URI-R crawl Heritrixfinds 39,830 URI-Rs missed by PhantomJS on average, whilePhantomJS finds 194,818 URI-Rs missed by Heritrix percrawl on average. PhantomJS and Heritrix find 63,550 URI-Rs in common between the two crawlers. The wget crawl

Figure 3: PhantomJS discovers 1.75 times more em-bedded resources than Heritrix and 4.11 times moreresources than wget. The averages and error linesindicate the standard deviation across all ten runs.

(a) A portion of Heritrix,PhantomJS, and wget fron-tiers overlap. PhantomJSand Heritrix identify URIsthat the others do not.

(b) The frontier of URI-Rs unique to PhantomJSshrinks when only consid-ering the host and pathaspects (Base Policy formatching) of the URI-R.

Figure 4: Heritrix, PhantomJS, and wget frontiersas an Euler Diagram. The overlap changes depend-ing on how duplicate URIs are identified.

Figure 5: Frontier size grows linearly with seed size.

Figure 6: Crawl speed is dependent upon frontiersize.

resulted in a frontier of 24,589 URI-Rs, which was a propersubset of both the Heritrix and PhantomJS frontiers.

This analysis shows that PhantomJS finds 19.70 more em-bedded resources per URI than Heritrix (Figure 5). Heritrixruns 12.13 times faster than PhantomJS (Figure 6). Notethat the red axis in Figures 5 and 6 are unmeasured andonly projections of the measured trends, with the projec-tions predicting the performance as the seed list size grows.

5.3 Frontier PropertiesDuring the PhantomJS crawls, we observed that PhantomJSdiscovers session-specific URI-Rs that Heritrix misses andHeritrix discovers Top Level Domains (TLDs) that Phan-tomJS misses, presumably from Heritrix’s inspection of Java-Script. For example:

http://dg.specificclick.net/?y=3&t=h&u=http%3A%2F%2Fmisscellania.blogspot.com%2Fstorage%2FTwitter-2.png...

from PhantomJS versus

http://dg.specificclick.net/

from Heritrix. The uniquely Heritrix URI-Rs are potentiallythe base of a URI to be further built by JavaScript. Be-cause PhantomJS only discovers URIs for which the client is-sues HTTP requests, this URI-R is not discovered by Phan-tomJS. To determine the nature of the di�erences betweenthe Heritrix and PhantomJS frontiers, we analyzed the unionand intersection between the URI-Rs in the frontiers usingdi�erent matching policies (Figure 4(b)).

During a crawl of 500 URI-Rs by PhantomJS, 19,022 URI-Rs were added to the frontier for a total of 19,522 URI-Rs

Insamling – Kommentarsfält

Tillgängliggörande – Wayback Machine

Tillgängliggörande – Memento

Tillgängliggörande – Katalogposter

Tillgängliggörande – Emulering

Framtiden – Data mining

Frågor?

Documents

Kulturarwfai.nu/wp-content/uploads/FAI_Janson.pdf• Uppdatera robotprogramvaran till NetarchiveSuite och Heritrix 3.2 • Sluta respektera robots.txt-filer • Wayback Machine för