Mezi snem a realitou. Otevřená data českého webového archivu

Preview:

Citation preview

Webarchiv Památník českého internetu, více

OpenAlt 2016Mezi snem a realitou.

Otevřená data českého webového archivu.

http://www.slideshare.net/webarchivCZ/presentations

Proč archivujeme web?Kdo a jak archivuje web?

Metadata

Rudolf.Kreibich@nkp.czvedoucí podpory aplikací NK ČR

Proč archivujeme web?

“… více jak 70% URL v Harvard Law Review a 50% URL v nálezích nejvyššího

soudu Spojených států amerických, odkazuje k již neexistujícímu webovému zdroji. “

Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Jonathan Zittrain, Kendra Albert a Lawrence Lessig. Legal Information Management / Volume 14 / Issue 02 / June 2014, pp 88-99, DOI: http://dx.doi.org/10.1017/S1472669614000255, Published online: 12 June 2014

404 Not Found The 404 (Not Found) status code indicates that the origin server did

not find a current representation for the target resource or is not willing to disclose that one exists. A 404 status code does not

indicate whether this lack of representation is temporary or permanent; the 410 (Gone) status code is preferred over 404 if the

origin server knows, presumably through some configurable means, that the condition is likely to be permanent.

A 404 response is cacheable by default; i.e., unless otherwise indicated by the method definition or explicit cache controls (see

Section 4.2.2 of [RFC7234]).

✝uri

“Je snažší nalézt exemplář filmu z roku 1924, než webové stránky z roku 1994.”

M.S. Ankerson. “Writing web histories with an eye on the analog past.” 2012. http://nms.sagepub.com/content/14/3/384.full.pdf+html

“Bude možné studovat naše století bez webových archivů?”

Ian Milligan, Professor in the Department of History at the University of Waterloo.

Kdo a jak archivuje web?

“Univerzální dostupnost veškerého vědění.” Brewster Kahle

IIPC | Internationl Internet Preservation Consortium

Složení členů

2x Regionální knihovny32x Národní knihovny (včetně ČR)

3x Neziskové organizace9x Výzkumné organizace nebo univerzity

http://netpreserve.org/about-us/members

Heritrix / OpenWaybacksklízení / zpřístupnění

Otevřený softwareMezinárodní komunitahttps://github.com/iipc/openwayback

https://github.com/internetarchive/heritrix3

Temný věk Java Scriptu

“Brozzler is a distributed web crawler (爬⾍) that uses a real browser (chrome

or chromium) to fetch pages and embedded urls and to extract links.”

https://github.com/internetarchive/brozzler

Heritrix sklízí 2065 URL/sPhantomJS sklízí 172 URL/s

=>

škálovat JS intepretory

Měsíční výběrové sklizně

Občasné tématické sklizně

Půl roční sklizně domény cz(spolupráce s nic.cz)

… od roku 2001

~ 221 TB

~ 6 miliard digitálních objektů / URL

~1,2 miliónu domén .cz

méně než 1 % je volně přístupné=

~ 4738 webů z 1,2 miliónu webů

Operation | postupný přesun do Infrastructre as Code

Dobrá strana síly

AnsibleVagrantPacker

Docker?…

Temná a svůdná strana

VMware vCenterIBM GPFS

http://arquivo.pt/search.jsp?l=en&query=prase

“The Common Crawl corpus contains petabytes of data collected over the last 7 years.

It contains raw web page data, extracted metadata and text extractions.

The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program.

From Public Data Sets, you can download the files entirely free using HTTP or S3.

As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.”

http://commoncrawl.org/the-data/get-started/

“Google podle mně nearchivuje, ale cachuje.”

já, u vícero příležitostí

metadata

WARC | ISO 28500:2009 | Prochází revizí

WARC/1.0WARC-Type: responseWARC-Date: 2014-08-02T09:52:13ZWARC-Record-ID: Content-Length: 43428Content-Type: application/http; msgtype=responseWARC-Warcinfo-ID: WARC-Concurrent-To: WARC-IP-Address: 212.58.244.61WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stmWARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3JWARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJOWARC-Truncated: length

Wayback CDX Server APIplain text or JSON array of the CDX data

urlkey: org,archivetimestamp: 19970126045828original: http://www.archive.org:80mimetype: text/htmlstatuscode: 200digest: Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNYlength: 1415

https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md

WAT | Metadata k archivovaným objektům | JSON

WARC-Header-Metadata: WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm WARC-Type: response WARC-Date 2014-08-02T09:52:13Z …Payload-Metadata: HTTP-Response-Metadata: Headers: Content-Language: Content-Encoding: ... HTML-Metadata: Head: Title: BBC NEWS | Africa | Namibia braces for Nujoma exit … Metas: name: keywords content: BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service … Links: href: /css/screen/shared/styles.css path: STYLE/#text …

http://commoncrawl.org/the-data/get-started/ https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat

https://webarchive.jira.com/wiki/display/Iresearch/archive-metadata-extractor.jar

WAT | Metadata k archivovaným objektům | JSONServer response

"Headers" : { "Date" : "Sat, 02 Aug 2014 09:52:13 GMT", "Cache-Control" : "max-age=0", "Connection" : "close", "Expires" : "Sat, 02 Aug 2014 09:52:13 GMT", "Content-Type" : "text/html", "Server" : "Apache", "Vary" : "X-CDN", "Set-Cookie" : “BBC UID=15730d9c1b741c0d3942e2aca1317fbf39e57b90be68a329d375ba9d5a8964080CCBot%2f2%2e0%20%28http%3a%2f%2fcommoncrawl%2eorg%2ffaq%2f%29; expires=Sun, 02-Aug-15 09:52:13 GMT; path=/; domain=bbc.co.uk;"

http://commoncrawl.org/the-data/get-started/ https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat

https://webarchive.jira.com/wiki/display/Iresearch/archive-metadata-extractor.jar

WET | Extrahovaný fulltextWARC/1.0WARC-Type: conversionWARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stmWARC-Date: 2014-08-02T09:52:13ZWARC-Record-ID: <urn:uuid:007d632a-ab5a-4c4e-afc2-c455066a82de>WARC-Refers-To: <urn:uuid:ffbfb0c0-6456-42b0-af03-3867be6fc09f>WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJCContent-Type: text/plainContent-Length: 6724

BBC NEWS | Africa | Namibia braces for Nujoma exit[an error occurred while processing this directive]…Your news when you want itNews Front PageAfrica…Hausa Portuguese Africa More Last Updated: Thursday, 22 January, 2004, 00:48 GMTE-mail this to a friendPrintable version…Swapo has been careful to secure the Ovambo vote by ploughing a large slice of development funding into the region, and the people there get more than their fair share of government positions.For the moment, Mr Nujoma's biggest headache is land reform. Huge tracks of land are still owned by a few white farmers and black Namibians are impatient at the slow pace of reform. White farmers say they are falling over backwards to please the government, but Mr Pahamba says that they are only handing over poor quality land. Meanwhile, the militant black farmer's union is threatening farm occupations similar to those in Zimbabwe. Guard dogs…

http://commoncrawl.org/the-data/get-started/ https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat

LGA | Metadata pro vztahy mezi URL v čase

ID-Map

url: https://www.youtube.com/watch?v=--FDzShdFjw&gl=US&hl=ensurt_url: com,youtube)/watch?gl=us&hl=en&v=–fdzshdfjwid: 294869příklad {"url":"https://www.youtube.com/watch?v=--FDzShdFjw&gl=US&hl=en","surt_url":"com,youtube)/watch?gl=us&hl=en&v=–fdzshdfjw","id":294869}

ID-Graph

timestamp: 20150209052911id: 20150209052911outilink_ids: 31, 31366, 62596, 91594, 91595, …příklad{“timestamp":"20150209052911","id":294869,"outlink_ids":[31,31366,62596,91594,91595,129599, …]}https://webarchive.jira.com/wiki/display/ARS/LGA+Overview+and+Technical+Details

WANE | Extrahované jmenné entity

url: http://dissonantwinstonsmith.wordpress.com/2014/08/24/im-sick-of/?like_comment=79&_wpnonce=0fc57aa499&replytocom=93timestamp: 20141019212346named_entities: locations: North County, America, St. Louis County St. Louis County Police St. Louis County, WordPress.com, Middle East, … organizations: Twitter Facebook Google, Google, Facebook, Wal-Mart, CNN, Bearcats, … persons: Stell, Tom Jackson, Smith, Pamela Fillingim, Darren Wilson Eric Fowler Eric Vickers Ferguson Ferguson, Ferguson, …digest: sha1:747IKFWUCVQVXY7TX2NMYFL422T4TRQX

Extrahováno se Stanford Named Entity Recognizer (NER)

http://nlp.stanford.edu/software/CRF-NER.shtmlhttps://webarchive.jira.com/wiki/display/ARS/

WANE+Overview+and+Technical+Details

NameTag / CNES 2.0 | WANE?http://ufal.mff.cuni.cz/nametag

https://ufal.mff.cuni.cz/cnec/cnec2.0

Open nsfw model

“This repo contains code for running Not Suitable for Work (NSFW) classification deep neural network Caffe models. “

https://github.com/yahoo/open_nsfw/blob/master/

audio2text

NameTag / CNES 2.0 | WANE?http://ufal.mff.cuni.cz/nametag

https://ufal.mff.cuni.cz/cnec/cnec2.0

Jak metadata zpřístupnit?

bulk databulk data v S3

APIwebová služba

Co s metadaty?

vývoj formátů na webuvývoj prolinkování webů

vývoj nsfw webů na doméněvývoj poměru grafiky / textu na webu

vývoj web technologií…

Oddělení archivace webu | ODIF | NK ČR

Vedoucí: Jaroslav KvasnicaKurátoři: Marie Haškovcová, Monika Holoubková, Markéta HrdličkováIT Operation: Rudolf.Kreibich@nkp.cz

webarchiv.czfacebook.com/webarchivczslideshare.net/webarchivCZ github.com/webarchivcz

Recommended