Upload
vincent-smith
View
1.531
Download
11
Embed Size (px)
DESCRIPTION
Scholarly communication for the facebook generation
Citation preview
Vincent S. Smith
The Virtual TaxonomistScholarly communication forthe facebook generation
Goal…• Inventory the Earth’s species• Document their relationships• “Publish” these data
Data set…• 1.8M described species (10M names)
• 300M pages (over last 250 years)
• 1.5-3B specimens
People…• 4-6,000 scientists• 30-40,000 amateurs• Many more citizen scientists?
TaxonomyThe foundation of biology
Bacteria9021 Spp
Archaebacteria
259 Spp.
Plants260k spp.
Animals1.18 M spp.
Other193k spp.
Fungi101k
1.8 million species
Taxonomy is parochialInformation sits in the “long tail” of a power distribution
Crusta-ceans
39k
Birds 10kReptiles 7.1kMammals 5kAmphib.5k
Sponges 10kCnidarians 9kRotifers 1.8k
Flatworms 13.7k
Insects0.82 M spp.
Molluscs117 k
Fish 25k
Bacteria9021 Spp
Archaebacteria
259 Spp.
Plants260k spp.
Animals1.18 M spp.
Other193k spp.
Fungi101k
Taxonomy is parochialInformation sits in the “long tail” of a power distribution
1.8 million species
Crusta-ceans
39k
Birds 10kReptiles 7.1kMammals 5kAmphib.5k
Sponges 10kCnidarians 9kRotifers 1.8k
Flatworms 13.7k
Insects0.82 M spp.
Molluscs117 k
Fish 25k
Bacteria9021 Spp
Archaebacteria
259 Spp.
Plants260k spp.
Animals1.18 M spp.
Other193k spp.
Fungi101k
Beetles370k spp.
Flies85k spp.
Butterflies & moths165k spp.
Bees, wasps & ants198k spp.
0.01 papers per species per yeari.e 1 paper every 100 years
Birds: 1 paper per species per yr.Mammals: 2 papers per species per yr.
Elephants: 47 papers per species per yr.
Taxonomy is parochialInformation sits in the “long tail” of a power distribution
1.8 million species
250 yrs 1000 yrs!!!
?1758 2008 3008
Taxonomy is slowMost life on earth is still undescribed
Bacteria9021 Spp
Archaebacteria
259 Spp.
Plants260k spp.
Animals1.18 M spp.
Other193k spp.
Fungi101k
250 year and counting!
The story so far…• Estimates range from 5-100 million species (prob. 80% undescribed)
• At present rates most species will be extinct before we get to describe them
• Most descriptions are formulaic, publication process is slow, involves paper archival
Most biodiversity (data) is hidden
Taxonomy is hard to findPeople & data distributed & highly fragmented
• Small communities working on biodiversity
• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)
• We use different methods of citation (pp.)
- Just 4-6,000 taxonomists worldwide
Mol. Phyl. Evol.21,964 pp. since 2000
Menopon gallinaeNumidicola antennatusAmyrsidea ventralisSomaphantus lusiusMenacanthus stramineusColimenopon urocoliusTrinoton anserinumMeromenopon meropisGruimenopon longumHoazineus armiferusCopocephalum zebraComatomenopon elbeli/elongatumPsittacomenopon poicephalusOdoriphila clayae/phoeniculiArdeiphilus trochioxusCuculiphilus fasciatusCiconiphilus quadripustulatusEomenopon denticulatumPiagetiella bursaepelecaniOsborniella crotophagaeHohorstiella lataNeomenopon pteroclurusMachaerilaemus laticorpus/latifronsAustromenopon crocatumEidmanniella pellucidaHolomenopon brevithoracicumDennyus hirundinisMyrsidea victrixAncistrona vagelliPseudomenopon pilosumBonomiella columbaeChapinia robustaPlegadiphilus threskiornisActornithophilus uniseriatusMEGAMENOPONRediella mirabilisLatumcephalum lesouefi/macropusParaboopia flavaParaheterodoxus insignisBoopia tarsataTherodoxus oweniLaemobothrion maximumRicinus fringillaeTrochiliphagus abdominalisTrochiloecetes rupununiLiposcelis bostrychophilus
Taxonomy is hard to findPeople & data distributed & highly fragmented
• Small communities working on biodiversity
• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)
• We use different methods of citation (pp.)
- Just 4-6,000 taxonomists worldwide
• Publications are data rich
Taxonomy is hard to findPeople & data distributed & highly fragmented
DATA
• Linked by taxonomic names
• Small communities working on biodiversity
• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)
• We use different methods of citation (pp.)
- Just 4-6,000 taxonomists worldwide
• Publications are data rich
Taxonomy is hard to findPeople & data distributed & highly fragmented
DATA
What does this all mean…• Taxonomy is an information science (formulaic, data rich, parochial, under funded)
• Taxonomy lends itself to the Web
• Linked by taxonomic names
• Small communities working on biodiversity
• So is the data we use & publish- 1.5-3B specimens worldwide (type specimens)- 300M pages spanning 250 yrs. (all still relevant)
• We use different methods of citation (pp.)
- Just 4-6,000 taxonomists worldwide
• Publications are data rich
Getting taxonomy on the Web
Scratchpads• Web publishing for taxonomists
Tackling the problems of the taxonomic community
Biodiversity Heritage Library• Digitising heritage literature
Encyclopedia of Life• A web page for every species
Plazi.org & iPhylo• Data mining contemporary literature
Getting taxonomy on the Web
Scratchpads• Web publishing for taxonomists
Tackling the problems of the taxonomic community
Biodiversity Heritage Library• Digitising heritage literature
Encyclopedia of Life• A web page for every species
Plazi.org & iPhylo• Data mining contemporary literature
Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”
• Biodiversity publications since 1469- 5.4 million books- 800,000 monographs- 40,000 periodicals
• Held by Natural History librariesE.g., NHM holds more than 1M books, 250kmonographs & periodicals, 0.5M artworks
• Sharing the digisation of contents• Focus on out of copyright materials• Partnership with “Internet Archive”
• BHL partnership of 10 Nat. Hist. libraries
• Make the contents “findable”
Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”
1 scribe machine, 3,500 pages per shift per day
2. Extract text (OCR)1. Scan (photograph)
34 scribe machines now in operation
3. Find keywords- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs
Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”
2. Extract text (OCR)3. Find keywords
1. Scan
- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs
Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.
Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”
2. Extract text (OCR)3. Find keywords
1. Scan
- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs
4. Index5. Put on the web
Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.
Biodiversity Heritage Library (BHL)“Digitising biodiversity literature”
• NHM, London- 1 scribe machine- >500k pages- Focus on exceptionally rare text
• Completed to date:- 3,802 periodicals (journals)- 9,181 books- 5.5 million pages (2% of total)
http://www.biodiversitylibrary.org/
- Copyright (1923 USA)• Challenges
- OCR quality (old fonts)- Better indexing- Foreign language content- Needs a critical mass of content to be useful
Getting taxonomy on the Web
Scratchpads• Web publishing for taxonomists
Tackling the problems of the taxonomic community
Biodiversity Heritage Library• Digitising heritage literature
Encyclopedia of Life• A web page for every species
Plazi.org & iPhylo• Data mining contemporary literature
Data mining taxonomic publications“Extracting factual information”
- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs
Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.
“Extracting factual information”
Palma, R.L., andR.L.C. Pilgrim.2002. A revisionof the genusNaubates(Insecta:Phthiraptera:Philopteridae).J. R. Soc. N.Z.32:7-60.
- Taxonomic names- Author names- Citations- Collection data- Morphological data- Descriptions- Identification keys- Illustrations- Photographs
Data mining taxonomic publications
Experimental extraction of factual information
Plazi.org (D. Agosti et al)(Manual, slow but accurate)
iPhylo (R. Page)(Automatic, fast but dirty)
Article(Hand selected)
“Library”(Legal & minable)
Repository(DSpace)
Entity-Attribute-Value Model(Database)
GoldenGate(Manual Software)
Crawler scripts & web services
Approx. 26nested fields
(TaxonX-XML)
Approx. 12?data objects
Data mining taxonomic publications
Experimental extraction of factual information
Plazi.org (D. Agosti et al)(Manual, slow but accurate)
iPhylo (R. Page)(Automatic, fast but dirty)
Repository(DSpace)
Entity-Attribute-Value Model(Database)
RSS + TAPIR Data visualizations
“A database of everything!”
RDF + RSS
Data mining taxonomic publications
Getting taxonomy on the Web
Scratchpads• Web publishing for taxonomists
Tackling the problems of the taxonomic community
Biodiversity Heritage Library• Digitising heritage literature
Encyclopedia of Life• A web page for every species
Plazi.org & iPhylo• Data mining contemporary literature
Encyclopedia of Life (EOL)“A web page for every species”
http://www.eol.org/
• A web page for all 1.8M species
• Multi-institution collaboration
• $50m funding (5 years)- MacArthur and Sloan Foundations
• Megascience mashup- Aggregating data from the web
• Multiple audiences- Science & outreach
• 10 years to complete- First draft 2008, “finished” 2017!
Encyclopedia of Life (EOL)“A web page for every species”
• Huge interest- 11.5 million hits in first 5 hours- 500+ press articles- Pages unavailable for first two days!
• First draft 27 Feb. 2008- 24 “exemplar” pages- 30,000 detailed pages (fish & amphib.)- 1 million “stubs” (names & links)
- Growth (needs 1,000 spp. per day)• Much praise but some criticism
- Quality vs. quantity of information- Authoritative “vetting” process- Credit for “authors”
• Nine more years to go- Get more content online- Better tools to engage more people
Getting taxonomy on the Web
Scratchpads• Web publishing for taxonomists
Tackling the problems of the taxonomic community
Biodiversity Heritage Library• Digitising heritage literature
Encyclopedia of Life• A web page for every species
Plazi.org & iPhylo• Data mining contemporary literature
What is a Scratchpad?
Your data1
Published & reviewedon your site
3Uploaded &
tagged
2
“A Website & publishing platform for taxonomic communities”
What is a Scratchpad?
Your data1
Published & reviewedon your site
3Uploaded &
tagged
2
Fast Intuitive Fit for use
“A Website & publishing platform for taxonomic communities”
What can Scratchpads do?Import, manage, search & browse:
DNA & Phylogenies
Specimens
Literature Images
What can Scratchpads do?Integration & connectivity within & between sites
DNA & Phylogenies
Specimens
Literature ImagesTaxonomy
Current ScratchpadsAntsBeesBeetlesBig-headed fliesBirdsBlackfliesCiliatesCockroachesDragon TreesDung BeetlesFalse ButtonweedFlat wormsFliesForaminiferaFossil InsectsFungus GnatsHolometabolaLeaf-miner FliesLiceLichens of BermudaMalvaceaeMegalastrum fernsMilichiid fliesMosquitoesMossesNannotax fossilsNepticuloid mothsPalmsPearl oystersPolychaete wormsScaleworms
TermitesTriticid grassesWeevilsWood Ferns
Sulawesi FernsStick insects
Sites: 61Users: 665Pages: 130kSince March 2007
Scratchpad applications
4th Edition Howard & Moore, Birds of the world(fact checking, data compilation, 2010, funding)
A multipurpose, flexible technology
eBooks
Scratchpad applications
European Mosquito Bulletin (ISSN 1460-6127), Phasmid Studies (ISSN 0966-0011)(submission, review, & dissemination of articles)
A multipurpose, flexible technology
eJournals
Scratchpad applications
Image galleries
A multipurpose, flexible technology
Nanno fossils, Cockroaches, Stick insects, Flatworms, Grasses, Lichens & many more… (rapid upload, annotation, & display of images)
Scratchpad usageContent & contributors in the first 15 months
Pages:- Across 61 sites- In detail:
• Definitions (41%)• References (26%)• Associations (8.5%)• DNA sequences (6%)
• Images (4.5%)• Maps (2.8%)• Specimens (2.1%)• Others (1.3%)
129,896 pages, 665 contributors
June 24 2008
Scratchpad usage
Contributors:- No more than 10% significantly active- Contributors in more than 30 countries- In detail:
• Europe (55%)• Unknown (29%)• North America (9%)• Asia (3%)
• Australasia (2.5%)• South America (2%)• Russia (0.8%)• Middle East (0.4%) [Jan. 08]
129,896 pages, 665 contributors
June 24 2008
Content & contributors in the first 15 months
Scratchpad visitorsTracking visitors across sites
March 2008
Scratchpad visitorsPopular content: what visitors are looking at
The “long tail” of taxonomy
Visitors want less of more, i.e. everyone wants something different
Scratchpad overview
Scratchpads are integrating taxonomy
Scratchpads• Web publishing for taxonomists
“Small pieces loosely joined”
Biodiversity Heritage Library• Digitising heritage literature
Encyclopedia of Life• A web page for every species
Plazi.org & iPhylo• Data mining contemporary literature
Integrating taxonomy
Questions?
Scratchpad managementScalable & sustainable technology
Virtual machine, open-source software, self-archiving, backed-up, multi-site configuration(easy to move & upgrade, secure & reliable, citable, screencasts, low admin., low marginal costs)
Hardware, software & user support
Impact(Web equivalent to journal impact
factor & personal H-index)
Scratchpad bibliometricsMetrics of output and use
130,000 pages
665 contributors
Content Usage