Upload
tahseenam
View
195
Download
0
Embed Size (px)
Citation preview
DDiiggiittaall LLiibbrraarriieess:: HHiissttoorryy,, TTeecchhnnoollooggyy,,
RR&&DD
Edward A. Fox Professor, Computer Science, Virginia Tech
Blacksburg, VA 24061 USA [email protected] h�p://fox.cs.vt.edu
6 Jan. 2014 1
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
6 Jan. 2014 2
HTTP://WWW.QU.EDU.QA/
HTTP://WWW.TAMU.EDU/ HTTP://WWW.PSU.EDU/ HTTP://WWW.VT.EDU/
Funding provided thru the ELISQ project: Electronic Library Institute - SeerQ
6 Jan. 2014 3
Sponsored by Qatar University & Qatar Na�onal Library
HTTP://qnl.qa
EELLIISSQQ PPrroojjeecctt TTeeaamm Qatar University, Qatar: Mohammed Samaka (Ph.D., Co-Lead PI) Sumaya Ali S A Al-Maadeed (Ph.D., PI) Myrna Tabet Asad Nafees Tahseena Moideen
This project was made possible by NPRP Grant # 4 -‐ 029 -‐ 1 – 007 from the Qatar Na�onal Research Fund (a member of Qatar Founda�on).
Virginia Tech, USA: Edward Fox (Ph.D., Lead-PI) Tarek Kanan
Penn. State University, USA: C. Lee Giles (Ph.D., PI) Sagnik Ray Choudhury
Texas A&M, USA: Richard Furuta (Ph.D., PI) Hamed Alhoori
6 Jan. 2014 4
Consultants: John Impagliazzo (Ph.D., Key Investigator) Susan Lukesh (Ph.D.) Carole Thompson
Qatar Na�onal Library, Qatar: Claudia Lux (PI) Krishna Roy Chowdhury Postdoc - TBA
AAcckknnoowwlleeddggeemmeennttss Dr. Mazen Hasna, VP and Chief Academic Officer, Qatar University Dr. Rashid Alammari, Dean, College of Engineering, Qatar University Dr. Moumen Hasnah , Director of Academic Research, Qatar University Dr. Claudia Lux, Qatar Na�onal Library Director Dr. Imad Bachir, Qatar University Library Director Dr. Munir Tag, Ac�ng Director Technical, ICT Program Manager (QNRF) Ms. Krishna Roy Chowdhury, Associate Director for Library IT, Qatar Na�onal Library Prof. Seb� Foufou, Head of Department of Computer Science and Engineering, Qatar University
AAddddii��oonnaall TThhaannkkss
6 Jan. 2014 6
Qscience – providing collec�on: Christopher J. Leonard, Editorial Director Paul Coyne, CTO
US Na�onal Science Founda�on (recent and current grants to Fox): IIS-‐1319578 IIS-‐0916733 DUE-‐0840719 OCI-‐1032677 plus those to PSU, TAMU
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
6 Jan. 2014 7
IInnttrroodduucc��oonn Reasons to be here Interested Find what to do with your content Find how to help your user community
h�p://www.morganclaypool.com/toc/icr/1/1 1. DL Introduc�on, 5S framework (2012) 2. DL Quality, Integra�on (2013) 3. DL Technologies (in press) 4. DL Applica�ons (in press)
6 Jan. 2014 8
6 Jan. 2014 9
6 Jan. 2014 10
6 Jan. 2014 11
6 Jan. 2014 12
DDLLss SShhoorrtteenn tthhee CChhaaiinn ttoo
13
Author
Reader
Digital
Library Editor
Reviewer
Teacher
Learner
Librarian
14
Digital Library Content
Articles,Reports,Books
TextDocuments
Speech,Music
VideoAudio
(Aerial)Photos
GeographicInformation
ModelsSimulations
Software,Programs
GenomeHuman,animal,plant
BioInformation
2D, 3D,VR,CAT
Images andGraphics
ContentTypes
6 Jan. 2014
15
Content Based Information Retrieval
16
Digital Library Reference Model 1.0 p. 30 of 234
IInnffoorrmmaall 55SS DDLL DDeefifinnii��oonnss
help sa�sfy info needs of users (socie�es) provide info services (scenarios) organize info in usable ways (structures) present info in usable ways (spaces) communicate info with users (streams)
18
DLs are complex systems that:
19
IInnffoorrmmaa��oonn LLiiffee CCyyccllee
Authoring Modifying
Organizing Indexing
Storing Retrieving
Distributing Networking
Retention / Mining Accessing Filtering
Using Creating
6 Jan. 2014
20
Browsing Collaborating Customizing Filtering Providing access Recommending Requesting Searching Visualizing
Annotating Classifying Clustering Evaluating Extracting Indexing
Measuring Publicizing
Rating Reviewing (peer)
Surveying Translating
(language)
Conserving Converting
Copying/Replicating Emulating Renewing
Translating (format)
Acquiring Cataloging
Crawling (focused) Describing Digitizing
Federating Harvesting Purchasing Submitting
Preservational Creational Add Value
Repository-Building Information Satisfaction
Services
Infrastructure Services
21
SSeeeerrSSuuiittee iiss NNoott GGooooggllee
Metadata (as in library catalogs) as well as content Sets of collec�ons, rather than the Web as a whole
Provided by a curator (e.g., publisher, museum) Provided by user submissions Or collected by focused ‘crawling’
Tailored services, rather than the same for everyone Browsing using categories, preserving, adding value Based on studying user requirements, e.g., chemists
Working with en��es, rather than just words Cita�ons, tables, figures, names, chemical formula Using knowledge bases, machine learning, ar�ficial intelligence
6 Jan. 2014
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
6 Jan. 2014 22
23
HHiissttoorryy OOvveerrvviieeww
1991, esp. from Informa�on Retrieval Connec�ng computer, library, and informa�on science communi�es NSF DL Ini�a�ve 1 in 1994 included funding for Stanford, where Google was prototyped Interna�onal conferences in the Americas (JCDL), as well as Europe (TPDL, by DELOS), Asia (ICADL) Publishers: ACM, … DOIs, (Ins�tu�onal) Repositories Spinoffs: content & courseware management systems Recently including (linked) data
6 Jan. 2014
www.nsdl.org
6 Jan. 2014 24
25
26
IInnss��ttuu��oonnaall RReeppoossiittoorriieess
“Ins�tu�onal repositories are digital collec�ons that capture and preserve the intellectual output of a single university or a mul�ple ins�tu�on community of colleges and universi�es.”
Crow, R. “Ins�tu�onal repository checklist and resource guide”, SPARC, Washington, D.C., USA
www.arl.org/sparc/IR/IR_Guide_v1.pdf
6 Jan. 2014
NNDDLLTTDD:: wwwwww..nnddllttdd..oorrgg Networked Digital Library of Theses and Disserta�ons (NDLTD)
Vision: Every thesis and disserta�on in the world is: o Devised to take advantage of the most helpful electronic publishing methods
o Shared globally and easily found o Supported by a suite of digital library services to aid authors, researchers, learners, universi�es
o Preserved and migrated permanently 6 Jan. 2014 27
28
Human tragedies that result from man-‐made and natural events affect humans and communi�es significantly. During and a�er a tragic event, there are a series of needs that have to be addressed. o Compounded by communica�on failures and a confusing plethora of data and informa�on
CCrriissiiss,, TTrraaggeeddyy,, aanndd RReeccoovveerryy ((CCTTRR)) NNeettwwoorrkk // IInntteeggrraatteedd DDiiggiittaall EEvveenntt AArrcchhiivvee && LLiibbrraarryy ((IIDDEEAALL))
6 Jan. 2014
CTRnet (Crisis, Tragedy & Recovery Net) Disaster Loca�ons
29
CTRnet (Crisis, Tragedy & Recovery Net) Word Clouds of Japan Earthquake and Libya Revolu�on (using tweets)
30 Libya Revolu�on Japan Earthquake,
Tsunami Disaster Updated every 10 minutes
31
CCTTRR ssttaakkeehhoollddeerrss
6 Jan. 2014
CINET: Network Science Middleware
32
Netviz: Course project aims to develop a visualiza�on component for CINET which contains large network graphs. The visualiza�on service will get Networks from CINET, convert from Galib to Gexf format, then visualize the graphs using Gelphi.
33
� CINET: Network Science Middleware
CINET network displayed using Gephi
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
6 Jan. 2014 34
WWeebb AArrcchhiivviinngg
Introduc�on: Web archiving is the process of gathering up data recorded on the World Wide Web, storing it, ensuring the data is preserved in an archive, and making the collected data available for future research.
The Internet Archive and several na�onal libraries ini�ated Web archiving prac�ces in 1996.
6 Jan. 2014 35
CCrraawwlleerr ((HHeerriittrriixx)) ((ffoorr sseeaarrcchh eennggiinneess && WWeebb aarrcchhiivveess))
A Web crawler starts with a list of URLs to visit, called the seeds.
On those page, iden�fies all the hyperlinks adds them to the list of URLs to visit recursively visits pages pointed to according to a set of policies.
Priori�zes its downloads – some pages change o�en.
6 Jan. 2014 36
FFooccuusseedd CCrraawwlleerrss
For a par�cular topic or event to build a Web collec�on focused in that area
Start with URLs of interest, viewed as seeds to grow from Expand in a ‘smart’ way to get all and only what is relevant
Use informa�on retrieval / ar�ficial intelligence / machine learning o Require ‘knowledge bases’ and/or human training examples
Nevertheless, there is a tradeoff between the resul�ng o Recall (i.e., coverage of what is out there) o Precision (i.e., freedom from noise in what is collected)
6 Jan. 2014 37
SSeeeerrSSuuiittee IInnssttaann��aa��oonnss
CiteSeerx http://citeseerx.ist.psu.edu A scientific literature digital library and search engine
ChemXSeer http://chemxseer.ist.psu.edu Portal for researchers in environmental chemistry integrating the scientific literature with experimental, analytical, and simulation results and tools
ArchSeer http://archseer.ist.psu.edu/ Archeology literature
TableSeer ANY fields with tables
6 Jan. 2014 38
h�p://citeseerx.ist.psu.edu CiteSeerX
3 M documents Ms of files 60 M cita�ons 3 to 6 M authors 2 to 4 M hits day 100K documents added monthly 800K individual users several Tbytes
CiteSeerX crawls researcher homepages on the web for scholarly papers, formerly in computer science
Converts PDF to text Automa�cally extracts OAI metadata and other data Automa�c cita�on indexing, links to cited documents, crea�on of document page, author disambigua�on So�ware open source – can be used to build other such tools
6 Jan. 2014 39
6 Jan. 2014 40
6 Jan. 2014 41
SSeeeerrSSuuiittee Tool kit used to build search engines and digital libraries
CiteSeerX , MyCiteSeerX , ChemXSeer, ArchSeer, AlgoSeer, AckSeer, BizSeer, CSSeer, CollabSeer, RefSeer, GrantSeer, SeerSeer, YouSeer, etc. Built on commercial grade open source tools (Solr/Lucene) Penn State exper�se – automated specialized metadata extrac�on
Supports research in Indexing and search Data mining & structures Informa�on and knowledge extrac�on Social networks: Name/en�ty disambigua�on Scientometrics/infometrics Systems engineering User interface design (HCI = human-‐computer interac�on) So�ware engineering and management
ChemXSeer Highlights Portal for academic researchers in chemistry which integrates the scientific
literature with experimental, analytical and simulation results and tools Provides unique metadata extraction, indexing and searching pertinent to the
chemical literature by using heuristics combined with machine learning Chemical formulae and names Tables Figures Publication functions as in CiteSeerX Expert and expertise search.
After extraction, data stored API accessible xml for users. Hybrid repository: Serves as a federated information interoperational system
Scientific papers crawled and indexed from the web User submitted papers and datasets (e.g. excel worksheets, Gaussian and CHARMM toolkit outputs) Scientific documents and metadata from publishers, web or archives.
Access control for proprietary provided content and user-submitted experiment data
Takes advantage of in-house open source projects such as CiteSeerX/
Seersuite.
Example Formula Search
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
6 Jan. 2014 45
UUsseerrss -‐-‐ TTAAMMUU
Requirements (content, services) Prac�ces (scholarly, informa�on seeking) Social framework (collabora�on, recommenda�on)
Interviews, surveys
Evalua�ons: usability, benefits
6 Jan. 2014 46
IInnffrraassttrruuccttuurree -‐-‐ PPSSUU
Computers, so�ware, launching infrastructure at: QU: powerful server, now crawling + ready to help any group interes�ng in cura�ng a collec�on VT, QNL (postdoc), QCRI (Prof. Mitra), …
Adapt to disciplines, interes�ng parts of documents Adapt to each collec�on
Develop knowledge base and heuris�cs for the coll. Change document parser Change database to match what occurs Change extractors : document -‐> database
6 Jan. 2014 47
AArraabbiicc -‐-‐ VVTT
Handle Arabic text documents Obtain a suitable category/classifica�on system Have people provide ‘training set’ Use machine learning to automa�cally classify future Arabic text documents
Support cross-‐language informa�on retrieval Arabic ques�on against English documents English ques�on against Arabic documents
6 Jan. 2014 48
AArraabbiicc HHaannddwwrrii��nngg -‐-‐ QQUU
Images of historic documents Arabic text extracted Mapping from a part of the text to the corresponding part of the image Special tools for
Those processing the original documents Those doing research with the collec�on
Will allow work on non-‐textual collec�ons too, e.g., museum images, set of photos for teaching architecture
6 Jan. 2014 49
AAcccceessssiibbllee CCoolllleecc��oonnss iinn QQaattaarr -‐-‐ QQNNLL What collec�ons have the highest priority?
What special handling is needed for each class, for each subclass of collec�on type?
How do DLs best fit into the ac�vi�es of the Na�onal Library?
Can .qa be fully archived for Wayback Machine use?
6 Jan. 2014 50
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
6 Jan. 2014 51
52
DDLL CCuurrrriiccuulluumm FFrraammeewwoorrkk Semester 1:
DL collections:development/creation
Semester 2:DL services and
sustainability
CO
UR
SE
STR
UC
TUR
E
DigitizationStorage
Interchange
Digital objectsCompositesPackages
MetadataCataloging
Author submission
NamingRepositories
Archives
Spaces(conceptual,geographic,2/3D, VR)
Architectures(agents, buses,
wrappers/mediators)Interoperability
Services(searching,
linking, browsing, etc.)
Intellectual property rights mgmt.
PrivacyProtection (watermarking)
Archiving and preservation
Integrity
Architectures(agents, buses,
wrappers/mediators)Interoperability
CO
RE
DL
TOP
ICS
DocumentsE-publishing
Markup
Info. NeedsRelevanceEvaluation
Effectiveness
ThesauriOntologies
ClassificationCategorization
Bibliographic information
BibliometricsCitations
RoutingFiltering
Community filtering
Search & search strategyInfo seeking behavior
User modelingFeedback
Info summarizationVisualization
Multimedia streams/structures
Capture/representationCompression/coding
Content-based analysis
Multimedia indexing
Multimediapresentation,
rendering
RE
LATE
DTO
PIC
S
6 Jan. 2014
MMoodduulleess
h�p://en.wikiversity.org/wiki/Curriculum_on_Digital_Libraries Table 1: Core DL Modules Table 2: Informa�on Retrieval Packages Table 3: Big Data Table 4: Mul�media So�ware
Like lesson plans, for a training session or lecture Can be used for self-‐study, refreshers
53
6 Jan. 2014 54
h�p://curric.dlib.vt.edu/modDev/modDev.html
EELLIISSQQ AAuuddiieennccee ((UUsseerrss)) Primary:
o Librarians and libraries in Qatar o Researchers and academics o Government organiza�ons o Non-‐Governmental organiza�ons
(such as h�p://www.fsd.org.qa/)
Secondary: o University / School Students o Teachers / Faculty o Managers o Qatari ci�zens o Other stakeholders
6 Jan. 2014 55
h�p://elisq.qu.edu.qa/
Project Objec�ves/Aims
A. Research and prototype digital library systems and infrastructure for Qatar, focusing ini�ally on Qatari informa�on related to government and scholarly ac�vi�es.
Leverage the crawling engine from Penn State‘s SeerSuite so�ware infrastructure, and extend it beyond its current focus on English to support Arabic-‐English collec�ons, and to cover a broad range of scholarly disciplines, and all types of government informa�on.
6 Jan. 2014 56
EELLIISSQQ PPrroojjeecctt ((11 ooff 22))
Project Objec�ves/Aims (con�nued) B. Research and build the digital library community in
Qatar, suppor�ng digital library use, services, collec�on development, tailored systems, and advancing toward a Knowledge Society.
Study scholarly ac�vi�es, and engage in community building in Qatar, so DLs can be tailored to specific domains and to the unique needs of Qatar. Through workshops, a consul�ng center at the proposed Ins�tute, and collabora�ve efforts with libraries and museums in Qatar, we will iden�fy par�cular needs and uses, and tailor collec�ons, systems, and services, to lead toward the Qatari Knowledge Society.
6 Jan. 2014 57
EELLIISSQQ PPrroojjeecctt ((22 ooff 22))
SSiiggnniifificcaannccee ttoo LLiibbrraarriiaannss,, CCoorrppoorraa��oonnss,, aanndd GGoovveerrnnmmeennttaall AAggeenncciieess
The need to preserve cultural and historical heritage => o Collec�ons of fragile and precious ar�facts => o Libraries, museums, and archives developing digital
collec�ons => o Users from all over the world accessing and studying
A one stop search of: o Informa�on about Qatar o Informa�on to preserve the culture of Qatar
Deep indexing, analysis, and retrieval of: o Resources, reports, sta�s�cs, and other types of informa�on o Informa�on in the Arabic language as well as in English
6 Jan. 2014 58
EELLIISSQQ CCoonntteenntt Metadata, data, and many types of documents (including full text) Qatari resources that first appeared in digital form -‐ ‘born’ digital At a later stage the project will include: o Digital versions of material already exis�ng in print o Mul�media (image, audio, video) forms
Free and open as well as content with limited access
6 Jan. 2014 59
EELLIISSQQ FFooccuuss
Community in Qatar Iden�fy interested stakeholders, to tailor to needs Train next genera�on of digital librarians, archivists, and curators Partners helping with addi�onal collec�on development
Advanced Technology for Enhanced Access “Low hanging fruit” by crawling Qatar-‐related Web Improved analysis (cita�ons, tables, chemicals, …) Support for both Arabic and English
6 Jan. 2014 60
OOuuttlliinnee Acknowledgments Introduc�on History Technology Research Development Summary and Discussion
6 Jan. 2014 61
SSuummmmaarryy ((ssoommee hhiigghhlliigghhttss))
Introduc�on to digital libraries: 5S, any content
History: since 1991, Google, repositories
Technology: SeerSuite, Heritrix, Solr, HCI Ini�al collec�ons: Qscience, news, …
Research: extend SeerSuite; Arabic Adapt other tools for handwri�ng collec�on, non-‐text collec�ons
Development: consul�ng center (addressing needs)
6 Jan. 2014 62
QQuueess��oonnss ffoorr YYoouu
What communi�es should be served?
What collec�ons should be made accessible?
What services are required?
What are the priori�es in the above?
Can you help us find suitable partners, content owners, curators, user groups?
6 Jan. 2014 63
QQuueess��oonnss ffoorr UUss??
h�p://elisq.qu.edu.qa/
h�p://fox.cs.vt.edu
6 Jan. 2014 64