Wissen im Web: Semantic Web Mining und die Motivation Freiwilliger Bettina Berendt Humboldt University Berlin, Institute of Information Systems berendt

Wissen im Web:

Semantic Web Mining und die Motivation Freiwilliger

Bettina Berendt

Humboldt University Berlin, Institute of Information Systems www.wiwi.hu-berlin.de/~berendt

Dank an ...

meine KoautorInnen (die auf den folgenden Folien gewürdigt sind)unddie Seminargruppen, die am EDOC-Projekt mitgearbeitet haben

und mitarbeiten:

Hanna Brekenfeld, Noppawan Bunyongasena, Thomas Dammeier, Gebhard Dettmar, Kai Dingel, Michael Ferber, Christoph Hanser, Oleg Ishenko, Beate Krause, Altug Kul, Toni Lohde, Egor Nikitin, Thomas Posner, Derya Saki, Mert Sengüner, Daniel Trümper

Semantic Web Mining=

Semantic Web Mining=

Semantic Web Mining

Agenda

Makrokosmos

BegriffeSemantic Web MiningSemantic Web MiningSemantic Web Mining

Mikrokosmos

BeispieleSemantics MiningSemantics Mining

“Makrokosmos World Wide Web”

Das Potenzial

Sehr viel Wissen, für Menschen zugänglich.

Die Probleme

Sehr viel Wissen, für Menschen zugänglich.

Web Mining

Formen

Knowledge discovery (aka Data mining):

“the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” 1

Web Mining: die Anwendung von Data-Mining-Techniken auf Inhalt, (Hyperlink-) Struktur und Nutzung von Webressourcen.

Webmining-Gebiete: Web content mining

Web structure mining

Web usage mining

1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.) (1996). Advances in Knowledge Discovery and Data Mining. Boston, MA: AAAI/MIT Press

Webmining-Gebiete: Web content mining

Web structure mining

Web usage mining

Web Mining:Beispiele

Das Hauptproblem des Web Mining

Common phrases of selected components01. process; water; air; pressure; gas; body of water; natural gas; high pressure; hot water; fresh

water;12. Mark; Gospel; Matthew; Luke; Rose; Virgin; Virgin Mary; Gospel of John; Gospel of Mark;

Gospel of Luke;23. part; text; Britannica; entry; Encyclopedia Britannica; Encyclop~¦dia Britannica; Encyclopaedia

Britannica; domain Encyclop~¦dia Britannica; public domain Encyclop~¦dia Britannica; public domain text;3

4. property; theorem; elements; proof; subset; axioms; proposition; natural numbers; fundamental theorem; mathematical logic;4

5. Dove; AMD; Dove Streptopelia; imperial crown; Imperial army; imperial court; imperial family; Collared Dove Streptopelia; Imperial Russia;5

6. side; feet; long time; long period; right side; left side; long distances; different types; short distance; opposite side;6

7. David; bill; Bob; Jim; Allen; Dave; Current stars; former members; Bill Clinton; former President;7

8. magazine; newspaper; political parties; public domain text; public opinion; political career; public schools; own right; political life; public service;8

9. way; things; boy; cat; long time; same way; same thing; only way; different ways; good thing;11

10. problems; zero; sum; digits; ~~; natural numbers; positive integer; mathematical analysis; decimal digits; natural logarithm;12

11. population density; couples; races; total area; makeup; Demographics; median age; income; density; housing units;

175.Torres; Iraqi KASUMI KHAZAD Khufu; Granada; Spa; Fra; General information; General Public License; General Bernardo; New Granada; Torres Strait;

176.love; Me; Rolling Stones; love songs; Rolling Stone magazine; Love Me; Fall in Love; Meet Me; love story; professional wrestler;

Das Wikipedia 300 Component Model, generiert mit diskreter PCA

http://cosco.hiit.fi/search/H300.html/topic_list

Zusammenfassend – Schwächen rein statistischer Ansätze:

Interpretation der Resultate?

Existenz von Resultaten?

Korrektheit?

Inferenzen?

Zusammenfassend – Schwächen rein statistischer Ansätze:

Interpretation der Resultate?

Existenz von Resultaten?

Korrektheit?

Inferenzen?

http://cosco.hiit.fi/search/H300.html/show_topic_id=0











Semantic Web

Das Semantic Web

“The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in co-operation.” 1

“The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. It is a collaborative effort led by W3C with participation from a large number of researchers and industrial partners. It is based on the Resource Description Framework (RDF), which integrates a variety of applications using XML for syntax and URIs for naming.” 2

1 Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Sci. American, May.

2 http://www.w3.org/2001/sw/3 Berners-Lee, T. (2000). Semantic Web XML2000.

www.w3.org/2000/Talks/1206-xml2k-tbl/

Category structure:<RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://directory.mozilla.org/rdf"><Topic r:id="Top"> <tag catid="1"/> <d:Title>Top</d:Title> <narrow r:resource="Top/Arts"/> ....</Topic><Topic r:id="Top/Arts"> <tag catid="2"/> <d:Title>Arts</d:Title> <narrow r:resource="Top/Arts/Books"/> ... <narrow r:resource="Top/Arts/Artists"/> <symbolic r:resource="Typography:Top/Computers/Fonts"/></Topic>....</RDF>

Category structure:<RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://directory.mozilla.org/rdf"><Topic r:id="Top"> <tag catid="1"/> <d:Title>Top</d:Title> <narrow r:resource="Top/Arts"/> ....</Topic><Topic r:id="Top/Arts"> <tag catid="2"/> <d:Title>Arts</d:Title> <narrow r:resource="Top/Arts/Books"/> ... <narrow r:resource="Top/Arts/Artists"/> <symbolic r:resource="Typography:Top/Computers/Fonts"/></Topic>....</RDF>

Resources:<RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://directory.mozilla.org/rdf"> ...<Topic r:id="Top/Arts"> <tag catid="2"/> <d:Title>Arts</d:Title> <link r:resource="http://www3...ca/…./file.html"/></Topic><ExternalPage about="http://www…ca/file .html"> <d:Title>John phillips Blown glass</d:Title> <d:Description>A small display of glass by John Phillips</d:Description></ExternalPage><Topic r:id="Top/Computers"> <tag catid="4"/> <d:Title>Computers</d:Title> <link r:resource="http://www.cs.tcd.ie/FME/"/> <link r:resource=”http://foo.asdfsa….."/></Topic></RDF>

Resources:<RDF xmlns:r="http://www.w3.org/TR/RDF/" xmlns:d="http://purl.org/dc/elements/1.0/" xmlns="http://directory.mozilla.org/rdf"> ...<Topic r:id="Top/Arts"> <tag catid="2"/> <d:Title>Arts</d:Title> <link r:resource="http://www3...ca/…./file.html"/></Topic><ExternalPage about="http://www…ca/file .html"> <d:Title>John phillips Blown glass</d:Title> <d:Description>A small display of glass by John Phillips</d:Description></ExternalPage><Topic r:id="Top/Computers"> <tag catid="4"/> <d:Title>Computers</d:Title> <link r:resource="http://www.cs.tcd.ie/FME/"/> <link r:resource=”http://foo.asdfsa….."/></Topic></RDF>

Semantic Web:Beispiel

Warum Semantic Web?Bsp. strukturierte

Suche (1) – Metadaten gemäß DC

Semantische Suche: Bsp. 2 – Metadaten

gem. DC + Domänenontologie

Was ist eine Ontologie?

An ontology is „an explicit specification of a shared conceptualisation.“ (Gruber, 1993)

Gruber, T.R. (1993). Towards principles for the design of ontologies used for knowledge sharing. In N. Guarino & R. Poli (Eds.), Formal Ontologies in Conceptual Analysis and Knowledge Representation Deventer, NL: Kluwer.

Bozsak, Ehrig, Handschuh, Hotho, Maedche, Motik, Oberle, Schmitz, Staab, Stojanovic, Stojanovic, Studer, Stumme, Sure,Tane, Volz, & Zacharias (2002). KAON - Towards a Large Scale Semantic Web. In Kurt Bauknecht, A. Min Tjoa, & Gerald Quirchmayr (Eds.), E-Commerce and Web Technologies, Third International Conference, EC-Web 2002, Aix-en-Provence, France, September 2-6, 2002, Proceedings (pp. 304-313). Springer: LNCS 2455

Relational Metadata

DAMLPROJ

COOPERATES-WITH

URI-GST

URI-SWMining

COOPERATES-WITH

WORKS-IN

PROJECT

RESEARCHER

PERSON

OBJECT

COOPERATES--WITH

TITLE

NAME

RESEARCHER

PERSON

OntologyCOOPERATES--

WITH

Semantic Web Mining

WWW

-URI-AHO

Andreas Hotho

cooperateswith(X,Y)

cooperateswith(Y,X)

WORKS-IN

WORKS-IN

Ontologie-basierte Website-Modellierung

Das Hauptproblem des Semantic Web

<HTML><HEAD><META NAME="DC.Creator" CONTENT="(Scheme=Freetext) Thomas Seilnacht

<[email protected]>"><META NAME="DC.Title" CONTENT="(Scheme=Freetext) 10 Schritte zum Bau

der eigenen Homepage"><META NAME="DC.Date.Created" CONTENT="(Scheme=Freetext) 1998-10-02"><META NAME="DC.Form" CONTENT="(Scheme=IMT) text/html"><META NAME="DC.Identifier" CONTENT="(Scheme=URL)

http://www.seilnacht.tuttlingen.com/HTML/Homepage.htm"><META NAME="DC.Description" CONTENT="(Scheme=Freetext) Anleitung zum

Bau einer Homepage mit dem Netscape Communicator"><META NAME="DC.Subject.Keywords" CONTENT="(Scheme=Freetext)

Homepage, HTML, Internet, FTP, Polyview, Programmieren, Frames, JavaScript, CGI-Script, Grundbegriffe, Grafik, Freeware, INFORMATISCHE GRUNDBILDUNG">

<META NAME="DC.Type" CONTENT="Kurs/Onlinekurs/Virtuelles Seminar"><META NAME="DC.Language" CONTENT="Deutsch"><META NAME="DC.Description" CONTENT="(Scheme=URL)

http://dbs.schule.de/db/mlesen.html?Id=7915&KATEGORIE=medien">

“Wer soll das alles machen?”“Wer soll das alles machen?”

Strategien zur Schaffung des Semantic Web

“institutionell”: Zwang / extrinsische Motivation

“sozial”: Verteilte Autorenschaft à la Open Source (example: dmoz.org) / intrinsische Motivation

“informatisch / HCI”: Tool-Support

“informatisch / Informationsverarbeitung” …

... Semantic Web Mining

Semantic Web Mining: Eine Definition

(1) Mining of the Semantic Web

(2) Mining for the Semantic Web

(3)The iterative process of (1) and (2), in which the semantics obtained by mining are re-used for mining again.

Berendt, Stumme, & Hotho, Proc. ISWC 2002; Stumme, G., Hotho, A., & Berendt, B. (submitted). Semantic Web Mining – State of the Art and Future Directions.

“Mikrokosmos EDOC”

Wissensbeiträge: Daten und Metadaten

<BIBLIOGRAPHY><FLOAT><PAGENUMBER>136</PAGENUMBER></FLOAT>

<HEAD>Literaturverzeichnis</HEAD>

...

<CITATION WORKTYPE="journal" PUBLISHED="PUBLISHED">

<CUT ID="bib-45-">[2] </CUT><WORKAUTHOR>Albrecht, T. F.; Bott, K.; Meier, T.; Schulze, A.; Koch, M.; Cundiff, S. T.; Feldmann, J.; Stolz, W.; Thomas, P.; Koch, S. W.; Göbel; E. O.</WORKAUTHOR> <ARTICLETITLE>Disorder mediated biexcitonic beats in semiconductor quantum wells</ARTICLETITLE>, <WORKTITLE>Phys. Rev. B</WORKTITLE>, <PUBDATE>1996</PUBDATE>, <NUMBER>54</NUMBER>, <PAGES>4436</PAGES>,

</CITATION> ...



...


<CUT ID="bib-45-">[2] </CUT><WORKAUTHOR>Albrecht, T. F.; Bott, K.; Meier, T.; Schulze, A.; Koch, M.; Cundiff, S. T.; Feldmann, J.; Stolz, W.; Thomas, P.; Koch, S. W.; Göbel; E. O.</WORKAUTHOR> <ARTICLETITLE>Disorder mediated biexcitonic beats in semiconductor quantum wells</ARTICLETITLE>, <WORKTITLE>Phys. Rev. B</WORKTITLE>, <PUBDATE>1996</PUBDATE>, <NUMBER>54</NUMBER>, <PAGES>4436</PAGES>,

</CITATION> ...

http://edoc.hu-berlin.de/diml/dtd/xdiml.dtd

Dissertation Markup Language DiMLhttp://edoc.hu-berlin.de/diml/dtd/xdiml.dtd

...<!ELEMENT citation (#PCDATA | email | url | note | workauthor | worktitle | articletitle | serialtitle | address | editor | publisher |

edition | volume | number | version | pages | pubdate | bible | court | law | cut | pagenumber)*><!ATTLIST citation id ID #IMPLIED label CDATA #IMPLIED workType (Book | Journal | Misc) #IMPLIED published (yes|no) 'yes'><!ELEMENT note (#PCDATA | em | u | strong | br | sup | tt | sub | link | name | email | organization | term | foreign | url |

footnote | endnote | glossref | indexref | pagenumber | q | citation | imath | im)*><!ATTLIST note id ID #IMPLIED><!ELEMENT workauthor (#PCDATA | given | surname | suffix | organization)*><!ATTLIST workauthor role CDATA #IMPLIED ref IDREF #IMPLIED id ID #IMPLIED>...

<!ELEMENT worktitle (#PCDATA | em | u | strong | br | sup | tt | sub | pagenumber)*><!ATTLIST worktitle id ID #IMPLIED type CDATA #IMPLIED><!ELEMENT articletitle (#PCDATA | em | u | strong | br | sup | tt | sub | pagenumber)*><!ATTLIST articletitle id ID #IMPLIED type CDATA #IMPLIED>...

Das Potenzial

Wenn es diese Daten und Metadaten einmal gibt ...

... dann unterstützen sie leistungsfähige Suchen in verteilten Archiven (z.B.) elektr. Abschlussarbeiten u. Dissertationen (ETDs)

i.d.R. mit OAI-Metadaten-Harvesting Beispiele:

o www.ndltd.org• z.Z. 154 Mitglieder / Repositorien

o http://www.cybertesis.net• z.Z. 27 Mitglieder / Repositorien

Vorteile für die Autoren:o Kostenfreie Publikation, hochwertige Archivierungo Garantie der langfristigen Lesbarkeit (50 Jahre)o Authentizität & Integritäto Semantische Durchsuchbarkeit

... aber wie bekommt man die (Meta)Daten?

Die Probleme

Befragung

Problem 1: Es ist nicht einfach (und es macht keinen Spaß)

Seit Beginn von EDOC (1997): Anteil der Online-Diss. ~20% (13% incl. Medizinische Fakultät)

Befragung aller DoktorandInnen und HabilitandInnen (knapp 2500 Personen, 12-14% antworteten)

Hauptergebnisse bzgl. Bekanntheit und Nutzung von EDOC-Diensten:o Probleme im Informationsfluss Marketing und Serviceo Die Erstellung der Metadaten wird als mühselig und

schwierig empfunden – insbesondere die I.d.R. nachträglich vorgenommene Literatur-Formatierung

[Berendt, Brenstein, Li, & Wendland, Proc. ETD 2003; Berendt, Proc. AAAI Spring Symposium KCVC, 2005]

… und das hat Folgen




<CUT ID="bib-15-">[1] </CUT><WORKAUTHOR>Agarwal, R.; Krueger, B. P.; Scholes, G. D.; Yang, M.; Yom, J.; Mets, L.; Fleming, G. R.</WORKAUTHOR>U<ARTICLETITLE>ltrafast energy transfer in LHC-II revealed by three-pulse photon echo peak shift measurements</ARTICLETITLE>, <WORKTITLE>J. Phys. Chem. B</WORKTITLE>, <PUBDATE>2000</PUBDATE>, <NUMBER>104</NUMBER>, <PAGES>2908</PAGES>,

</CITATION>

...

Warum ist das ein Problem?

Cardona, M., & Marx, W. (2004).Verwechselt,vergessen,wiedergefunden.

Referenzen–das fehlerhafte Gedächtnis[...] Physik Journal, 3 (11), 27-29.

Semantics Mining / usage mining

Ein 3. Hauptergebnis der Befragung:o weitgehend unbekannt und ungenutzt sind

• strukturiertes Schreiben• strukturierte Suche

Frage: Macht die Site Leser zu Autoren? Daten aus dem Webserver-Log 10,992 Sessions (210,655 Seiten) aus einer Woche 2003

(gegen Ende der ersten Befragung) Methoden: semantische Anreicherung, Assoziationsregel- und

Sequenzmining (Tools: WEKA, WUM); Clustering, Klassifikation

Q: Wissensbereitstellung als Nebeneffekt anderer Aktivitäten?

(hier: Websuche)

Non-semantic Web Usage Mining

80.136.155.126 - - [29/Mar/2003:00:02:00 +0100] "GET /favicon.ico HTTP/1.1" 200 1406 "-" "Mozilla/5.0 (Windows; U; Win 9x 4.90; de-DE; rv:1.0.1) Gecko/20020823 Netscape/7.0"

80.136.155.126 - - [29/Mar/2003:00:02:00 +0100] "GET /dissertationen/style/did.css HTTP/1.1" 200 10301 "http://edoc.hu-berlin.de/conferences/conf2/Kuehne-Hartmut-2002-09-08/HTML/kuehne-ch1.html" "Mozilla/5.0 (Windows; U; Win 9x 4.90; de-DE; rv:1.0.1) Gecko/20020823 Netscape/7.0"

66.196.72.44 - - [29/Mar/2003:00:02:38 +0100] "GET /../projekte/epdiss/kolloqu/schu/slide4.html HTTP/1.0" 400 379 "-" "Mozilla/5.0 (Slurp/cat; [email protected]; http://www.inktomi.com/slurp.html)"

66.196.72.44 - - [29/Mar/2003:00:03:09 +0100] "GET /humboldt-vl/hofmann-hasso/PDF/Hofmann.pdf HTTP/1.1" 200 94881 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Q312461)"

66.196.72.21 - - [29/Mar/2003:00:04:14 +0100] "GET /dissertationen/biologie/kernekewisch-michaela/HTML/kernekewisch-vita.html HTTP/1.0" 200 7418 "-" "Mozilla/5.0 (Slurp/cat; [email protected]; http://www.inktomi.com/slurp.html)"

64.68.82.27 - - [29/Mar/2003:00:04:21 +0100] "GET /download/kume/r-lailach-hesse.PDF HTTP/1.0" 200 179357 "-" "Googlebot/2.1 +http://www.googlebot.com/bot.html)"

193.7.255.242 - - [29/Mar/2003:00:07:08 +0100] "GET /dissertationen/radspieler-alexander-2000-09-20/HTML/radspieler-ch2.html HTTP/1.1" 304 - "-" "Firefly/1.0 (compatible; Mozilla 4.0; MSIE 5.5)"

Problem: URLs sind nicht semantisch. Eine Analyse der Daten in dieser Form bringt keine Erkenntnis!

Ontologie-basierte Verhaltensmodellierung: URLs und Anwendungsereignisse

URL Webseite mit Inhalt

Gewünschter Dienst

Berendt, B., Stumme, G., & Hotho, A. (2004). Usage mining for and on the Semantic Web. In H. Kargupta, A. Joshi, K. Sivakumar, & Y. Yesha (Eds.), Data Mining: Next Generation Challenges and Future Directions. Menlo Park, CA: AAAI/MIT Press.

Erhaltener Inhalt

Datenvorbereitung: Semantische Anreicherung

TOP

AUTHOR SEARCH DOC OTHER

OAI OTHERDISSFULLTEXT

LIST

DNB

AUTHOR

KEYWORD

META PROJECTOTHER DOC

MASTER

ABSTRACT

ADVICE

TEMPLATE

FAQ

LATEX

HINWEISE

DIML

README

…

…

…

…

…

…

ACCESS

CONFERENCE

PUBLIC READ

STUDY

CMS

ABSTRACT

ACCESS

RESULT

…

…

…

…

…

…

regexpr.txt: mapping from URLs to concepts

HOME edoc\.hu-berlin\.de\/$ AUTHOR-START \/e_autoren_en\/$ DISS-ABSTRACT \/abstract\.php3\/habilitationen\/ AUTHOR-ADVICE \/e_autoren\/hinweise\.php\?nav=.* AUTHOR-ADVICE \/e_rzm\/hinweise\.php.*...

regexpr.txt: mapping from URLs to concepts

HOME edoc\.hu-berlin\.de\/$ AUTHOR-START \/e_autoren_en\/$ DISS-ABSTRACT \/abstract\.php3\/habilitationen\/ AUTHOR-ADVICE \/e_autoren\/hinweise\.php\?nav=.* AUTHOR-ADVICE \/e_rzm\/hinweise\.php.*...

HOME

1. Ein Zugriff (request) entspricht [dem Interesse an]a) einem Konzept

b) einer (Multi-)Menge von Konzepten

c) einer strukturierten Menge von Konzepten

2. Ein Merkmalsträger isti. eine Session, betrachtet als eine (Multi-)Menge von Zugriffen

ii. eine Session, betrachtet als eine Sequenz von Zugriffen

iii. eine Session, betrachtet als ein Graph von Zugriffen

iv. ein Nutzer, modelliert durch – (ggf. aggregierte) Attribute seiner Session(s)

+ ggf. – andere Attribute (z.B. Wohnort, Einkommen, Transaktionshistorie)

Resultat der Datenvorbereitung: Datenmodellierung

A B A

A B C

A

B C

C

A B C

Semantic Web Usage Mining – Schritt 2: Musterentdeckung – Bsp. Sequenzmining

“Find out pages that are usually visited together and inspect the navigation paths between them.”

Sequence miner WUM (http://www.hypknowsys.de)

select t from node as a b, template # _ a * b as twhere a.accesses > 100 and a.support > 100and b.accesses > 50 and b.support > 50and ( b.support / a.support ) > 0.5and a.url startswith “AUTHOR”

- only paths starting from author-relevant content

Beliebte Eintrittspunkte und 1. Schritte

“Leser“ gehen direkt zu Dissertationen u. bleiben dort. “Leser“ gehen direkt zu Dissertationen u. bleiben dort.

Pfade zur Formatvorlage

“Autoren“ bleiben bei Autoren-Inhalten. “Autoren“ bleiben bei Autoren-Inhalten.

Leser und Autoren sind unterschiedliche Gruppen; Leser werden nicht zu Autoren (jedenfalls nicht in einer Session)

Nur wenige Besucher nutzen die interne Suchmaschine, und sie erfahren die strukturierte Suche nicht als effektive oder effiziente Suchoption.

Eine separate Fragebogenstudie unterstützt diesen Befund.

Die Nutzung externer Suchmaschinen macht den Zugang zu Dissertations-Volltexten wahrscheinlicher.

Problem 2: Wissensbereitstellung ergibt sich nicht als Nebeneffekt

anderer Aktivitäten (hier: Websuche)

Exkurs: Analyse bei gegebener Domänen-Ontologie: ka2portal.aifb.uni-karlsruhe.de

Gibt es verschiedene “Suchtypen”

in diesem Onlinekatalog?

Welche (Kombinationen von)

Suchoptionen sind populär?

Was signalisiert dieses über

das inhaltliche Interesse der Nutzer?

http://ka2portal.aifb.uni-karlsruhe.de/







Semantics of requestsStep 1: Domain ontology

[Oberle, Berendt, Hotho, & Gonzalez, Proc. AWIC 2003]

• community portal ka2portal.aifb.uni-karlsruhe.de

• ontology-based:

• Knowledge base in F-Logic

• Static pages: annotations

• Dynamic pages: generated

from queries

• Queries also in F-Logic

• Logs contain these queries

affiliation

RESEARCHER PERSON PROJECT PUBLICATION RESEARCHTOPIC EVENT ORGANIZATION RESEARCHINTEREST LASTNAME TITLE ISABOUT EVENTS EVENTTITLE WORKSATPROJECT AUTHOR AFFILIATION ISWORKEDONBY PROGRAMCOMMITTE

E EMPLOYS NAME RESEARCHGROUPS EMAIL

An example query with concepts and relations:

FORALL N,PEOPLE <-PEOPLE:Employee[affiliation->> "http://www.anInstitute.org"] and PEOPLE:Person[lastName->>N].

Query = feature vector of concepts + relations

Session = feature vector of concepts + relations, summed over all queries in the session

Semantics of requests Step 2: Modelling requests and sessions-as-sets

Clustering,Association rules,Classification, ...

Der Lösungsansatz

Mach es einfacher

Semantics Mining / content mining

Welche Art von Programmen und Nutzungsschnittstellen unterstützen

Autoren und motivieren sie zur Mitarbeit?

... Und wie können weitere Daten gesammelt werden, um den Schreibprozess zu verstehen und zu unterstützen?

Ein intelligentes Autorentool zur Schaffung von Semantik

Prototyp: Fokus auf Bibliographie-Annotationo Kern & fehleranfälligster Teil der Formatvorlagen-Benutzung in

EDOC

Basierend auf Informationsextraktion

[Berendt, Proc. AAAI Spring Symposium KCVC, 2005]

System-Architektur

Web serviceWeb service

citeseerciteseer

paratoolsparatools

TTTTTT

other WS and info. sources

VBA macroVBA macro

Nutzungsschnittstelle

corrected, XML annotated, and formatted

Informationsextraktion: Referenz-Parsing in 3 Tools

Paratools-Zitations-Parsinghttp://paracite.eprints.org

Eine Datenbank von Templates der Form

'_AUTHORS_ (_YEAR_). _TITLE_.

_PUBLICATION_,_VOLUME_(_ISSUE_):_PAGES_' jedes _XXX_ ist assoziiert mit einem regulären Ausdruck

o Bsp.: _YEAR_ ([[:digit:]]{4}) 2 Gewichtungsfaktoren

o reliability: „syntaktische Festgelegtheit“ eines regulären Ausdrucks• Ex.: _URL_ > _TITLE_

o concreteness = Anzahl fixierter Symbole• Ex.: '_AUTHORS_,_PUBLICATION_, in press' > '_AUTHORS_,

_PUBLICATION_'

Templates werden gegen die Referenz gematcht. Wähle das Template mit der höchsten reliability, oder (wenn

diese gleich sind) mit der höchsten concreteness.

Mach es lohnender

Semantics Mining / content + structure mining: RDI – Rosetta

Bradshaw, S. (2003). Reference Directed Indexing: Redeeming Relevance for Subject Search in Citation Indexes. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries.Bradshaw, S., & Hammond, K. (2000). Guiding people to information: Providing an interface to a digital library usingReference as a basis for indexing. In Proceedings of the Fifth International ACM Conference on Intelligent User Interfaces .

Versteh es richtig

Semantics Mining / content + structure mining: SSI

R. Navigli & P. Velardi. Structural Semantic Interconnections: a knowledge-based approach to word sense disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence (27-7), July, 2005.

Basic idea: graphs of meanings induced by WordNetBasic idea: graphs of meanings induced by WordNet

Using SSI for word sense disambiguation

(“The driver turned on his heel and went back to the truck.“)

Using SSI for word sense disambiguation

(“The driver turned on his heel and went back to the truck.“)

Zusammenfassung und Ausblick

Um Freiwillige zu motivieren, müssen informatische, motivationale und institutionelle Aspekte berücksichtigt werden!

Erweiterung des Intelligenten Autoren-Tools: o Erweiterung der Leistungsfähigkeit (Zitationsstile, ...)o Integration weiterer Information-Retrieval- und Mining-

Verfahreno Laborstudien zur ersten Evaluationo Usage-Mining zur fortlaufenden Evaluationo Verstärkung des Community-Elements!

Ausblick 1: Stärkere Einbeziehung der Community

bibster.semanticweb.org

Recommendations based on items‘ semantics and their... similarity to the user‘s expertise measured by previous externalisations (content of personal database)... similarity to relevant items measured by previous internalisations (answers to a query) and combinations (addition to the personal database)

Haase, Ehrig, Hotho, & Schnizler, 2004

www.bibserv.org

Ausblick 2: Spaß!

Danke für die Aufmerksamkeit!

Documents

Wissen im Web: Semantic Web Mining und die Motivation Freiwilliger Bettina Berendt Humboldt University Berlin, Institute of Information Systems berendt