66
2003.09.18 - SLIDE 1 IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003 http://www.sims.berkeley.edu/academics/courses/ is202/f03/ SIMS 202: Information Organization and Retrieval Lecture 8: Thesaurus Design

2003.09.18 - SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

2003.09.18 - SLIDE 1IS 202 – FALL 2003

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2003http://www.sims.berkeley.edu/academics/courses/is202/f03/

SIMS 202:

Information Organization

and Retrieval

Lecture 8: Thesaurus Design

2003.09.18 - SLIDE 2IS 202 – FALL 2003

Lecture Overview

• Review– Types of Controlled Vocabularies– Name Authority Control

• Thesaurus Design and Development– Controlled Vocabularies for topical description– Thesaurus Design– Steps In Thesaurus Development– Indexing

• Discussion

2003.09.18 - SLIDE 3IS 202 – FALL 2003

Lecture Overview

• Review– Types of Controlled Vocabularies– Name Authority Control

• Thesaurus Design and Development– Controlled Vocabularies for topical description– Thesaurus Design– Steps In Thesaurus Development– Indexing

• Discussion

2003.09.18 - SLIDE 4IS 202 – FALL 2003

Controlled Vocabularies

• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information

• That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata

2003.09.18 - SLIDE 5IS 202 – FALL 2003

Controlled Vocabularies

• Names and name authorities

• Gazetteers (geographic names)

• Code lists (e.g., LC language codes)

• Subject heading lists

• Classification schemes

• Thesauri

2003.09.18 - SLIDE 6IS 202 – FALL 2003

Name Authorities: The Problem

• Proliferation of the forms of names– Different names for the same person– Different people with the same names

• Examples – from Books in Print (semi-controlled but not

consistent)– ERIC author index (not controlled)

2003.09.18 - SLIDE 7IS 202 – FALL 2003

Lecture Overview

• Review– Types of Controlled Vocabularies– Name Authority Control

• Thesaurus Design and Development– Controlled Vocabularies for topical description– Thesaurus Design– Steps In Thesaurus Development– Indexing

• Discussion

2003.09.18 - SLIDE 8IS 202 – FALL 2003

Uses of Controlled Vocabularies

• Library subject headings, classification and authority files

• Commercial journal indexing services and databases

• Yahoo, and other web classification schemes

• Online and manual systems within organizations– SunSolve– MacArthur

2003.09.18 - SLIDE 9IS 202 – FALL 2003

Indexing Languages

• An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents

• An indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms

2003.09.18 - SLIDE 10IS 202 – FALL 2003

Types of Indexing Languages

• Uncontrolled keyword indexing

• Indexing languages– Controlled, but not structured

• Thesauri– Controlled and structured

• Classification systems– Controlled, structured, and coded

• Faceted classification systems

2003.09.18 - SLIDE 11IS 202 – FALL 2003

Thesauri

• A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among synonymous, equivalent, broader, narrower and other related terms

2003.09.18 - SLIDE 12IS 202 – FALL 2003

Thesaurus Standards

• National and International Standards for Thesauri– ANSI/NISO z39.19-1994 — American National

Standard Guidelines for the Construction, Format and Management of Monolingual Thesauri

– ANSI/NISO Draft Standard Z39.4-199x — American National Standard Guidelines for Indexes in Information Retrieval

– ISO 2788 — Documentation — Guidelines for the establishment and development of monolingual thesauri

– ISO 5964 — Documentation — Guidelines for the establishment and development of multilingual thesauri

2003.09.18 - SLIDE 13IS 202 – FALL 2003

Thesaurus Examples

• Examples– The ERIC Thesaurus of Descriptors– The Medical Subject Headings (MESH) of the

National Library of Medicine– The Art and Architecture Thesaurus

2003.09.18 - SLIDE 14IS 202 – FALL 2003

ERIC Thesaurus – Entry

2003.09.18 - SLIDE 15IS 202 – FALL 2003

ERIC Thesaurus – Alphabetic

2003.09.18 - SLIDE 16IS 202 – FALL 2003

ERIC Thesaurus – KWIC Index

2003.09.18 - SLIDE 17IS 202 – FALL 2003

ERIC Thesaurus – Hierarchies

2003.09.18 - SLIDE 18IS 202 – FALL 2003

ERIC Thesaurus – Groups

2003.09.18 - SLIDE 19IS 202 – FALL 2003

ERIC Thesaurus – Online

http://www.ericfacility.net/extra/pub/thessearch.cfm

2003.09.18 - SLIDE 20IS 202 – FALL 2003

MESH – Entry

2003.09.18 - SLIDE 21IS 202 – FALL 2003

MESH – Alphabetic

2003.09.18 - SLIDE 22IS 202 – FALL 2003

MESH – Tree Structures

2003.09.18 - SLIDE 23IS 202 – FALL 2003

MESH – KWOC Index

2003.09.18 - SLIDE 24IS 202 – FALL 2003

MESH - Online

http://www.nlm.nih.gov/mesh/meshhome.html

2003.09.18 - SLIDE 25IS 202 – FALL 2003

AAT – Facets

2003.09.18 - SLIDE 26IS 202 – FALL 2003

AAT – Hierarchies (print)

2003.09.18 - SLIDE 27IS 202 – FALL 2003

AAT – Hierarchies (online)

http://www.getty.edu/research/tools/vocabulary/aat/

2003.09.18 - SLIDE 28IS 202 – FALL 2003

AAT – Entry (online)

2003.09.18 - SLIDE 29IS 202 – FALL 2003

Lecture Overview

• Review– Types of Controlled Vocabularies– Name Authority Control

• Thesaurus Design and Development– Controlled Vocabularies for topical description– Thesaurus Design– Steps In Thesaurus Development– Indexing

• Discussion

2003.09.18 - SLIDE 30IS 202 – FALL 2003

Why Develop a Thesaurus?

• To provide a conceptual structure or “space” for a body of information– To make it possible to adequately describe

the topical content of information resources at an appropriate level of generality or specificity

– To provide enhanced search capabilities and to improve the effectiveness of searching (i.e., to retrieve most of the relevant material without too much irrelevant material)

2003.09.18 - SLIDE 31IS 202 – FALL 2003

Why Develop a Thesaurus?

• To provide vocabulary (or terminological) control– When there are several possible terms

designating a single concept, the thesaurus should lead the indexer or searcher to the appropriate concept, regardless of the terms they start with

2003.09.18 - SLIDE 32IS 202 – FALL 2003

Preliminary Considerations

• What is used now?– Continue using an existing thesaurus?– Ad hoc modification of existing thesaurus?– Develop a new well-structured thesaurus?

• What is the scope and complexity of the subject field?

• What kind of retrieval objects or data will be dealt with?

• How exhaustive and specific is the desired description of objects?

2003.09.18 - SLIDE 33IS 202 – FALL 2003

Preliminary Considerations

• The scope and complexity of the field will provide some indication of the scope and complexity of the thesaurus– It is better to plan for a larger and more

comprehensive system than a smaller system that rapidly will become inadequate as the database grows

• Development of a good thesaurus requires a major intellectual effort as well as clerical operations like data entry and production of sorted lists

2003.09.18 - SLIDE 34IS 202 – FALL 2003

Development of a Thesaurus

• Term selection

• Merging and development of concept classes

• Definition of broad subject fields and subfields

• Development of classificatory structure

• Review, testing, application, revision

2003.09.18 - SLIDE 35IS 202 – FALL 2003

Flow of Work in Thesaurus Construction

Select Sources

Assign codes

Select Terms

Record Selected Terms

Sort Terms

Merge identical Terms

Define Broad SubjectFields

Merge Terms in SameConcept class

Sort Terms into BroadSubject Fields

Define Subfields withinone Subject Field

Work out detailed structureof the Subject Field

Select Preferred Terms

All Subfields of BroadSubject finished?

All BroadSubjects finished?

Improve Class Structure

Yes

Yes

No

No

Print Classified Indexand review

Discuss with Experts andUsers

Select descriptors andchecklist items

Produce Full Thesaurusand Check references

Assign Notation

Review and Test

Many Modifications?

Based on Soergel, pp 327-333

Yes

No

Revise asneeded

2003.09.18 - SLIDE 36IS 202 – FALL 2003

1. Term Selection

• Select sources for the collection of terms– Prearranged Sources– Open-ended Sources

• Assign codes to each source

• Selection of terms– For part of pre-arranged and for all open-

ended sources

• Enter terms into database with all information

2003.09.18 - SLIDE 37IS 202 – FALL 2003

1.1 Kinds of Sources

• Prearranged Sources– Existing descriptor lists, classification schemes

thesauri• This includes universal schemes like DDC or LCSH

– Nomenclatures of single disciplines– Treatises on the terminology of a field– Encyclopedias, lexica, dictionaries and glossaries– Tables of contents of textbooks and handbooks– Indexes of journals or abstracting journals– Indexes of other publications in the field

2003.09.18 - SLIDE 38IS 202 – FALL 2003

1.1 Kinds of Sources

• Open-ended sources– Lists of search requests or interest profiles– Description of projects/activities to be served by the

information retrieval system– Discussion with specialists in the field– Sample of documents in the field

• Ask users why and how these documents relate to the field• Have documents indexed by experts in the field

– Lists of titles of documents in the field– Abstracts and reviews of documents– Your own knowledge

2003.09.18 - SLIDE 39IS 202 – FALL 2003

Selection of Sources

• Prearranged sources require less effort in gathering the material, and may already indicate some relationships between terms and concepts and relationships among terms

• Open-ended sources can reflect current terminology and may provide more complete coverage

• Choose a set of sources that are current, as complete as possible, and considered authoritative

2003.09.18 - SLIDE 40IS 202 – FALL 2003

Selection of Sources

• Each selected source is assigned an ID for tracking its use in the development of the thesaurus– Useful when making decisions about which

terms to prefer– Useful for backtracking when questions arise

(where did this come from?)

2003.09.18 - SLIDE 41IS 202 – FALL 2003

Selection of Terms

• Terms can be transferred directly from prearranged sources to the recording medium (cards or database)– Have to decide which terms and references to

include, or to take the whole source

2003.09.18 - SLIDE 42IS 202 – FALL 2003

Selection of Terms

• In open-ended sources you read through the source and pick out terms (i.e. words and phrases) that might be useful in retrieval or as references to other terms

• Alternatively, use keyword and phrase extraction software to create lists of terms and select from those

• Transfer selected terms to the recording medium (cards or database)

2003.09.18 - SLIDE 43IS 202 – FALL 2003

Work Form – Still relevant??

From Soergel, p. 399

2003.09.18 - SLIDE 44IS 202 – FALL 2003

2. Merging and Development of Concept Classes

• Sort Term DB into alphabetical order

• First Round– Merge information for identical terms, possibly

pulling info from additional sources

• Second Round– Merge synonyms or terms in the same

concept class

2003.09.18 - SLIDE 45IS 202 – FALL 2003

3. Definition of Broad Subject Fields and Subfields

• Define broad subject fields and sort terms into these broad fields

• Define subfields within each broad field and sort terms into these subfields

• Work out the detailed structure– Select preferred terms– Merge information for terms in the same concept

class• Repeat these steps

– For each subfield within a broad field– And for each broad field– Until all terms have been consolidated and preferred

terms selected

2003.09.18 - SLIDE 46IS 202 – FALL 2003

4. Development of Classificatory Structure

• Produce preliminary version of classified index and update the working database

• Improve classificatory structure

• Reality check– Produce and distribute a version of the

classified index– Distribute to users/experts

2003.09.18 - SLIDE 47IS 202 – FALL 2003

5. Final Stages

• Review

• Testing

• Application

• Revision

2003.09.18 - SLIDE 48IS 202 – FALL 2003

Review

• Discuss classified index with users/experts– Select descriptors and checklist descriptors

• Assign notational symbols

• Produce main thesaurus and indexes

2003.09.18 - SLIDE 49IS 202 – FALL 2003

Review (cont.)

• Check cross references and insert where needed

• Produce test version

• Test by indexing

• Modify as needed

• Produce production version

2003.09.18 - SLIDE 50IS 202 – FALL 2003

Testing a Thesaurus

• Assign descriptors to a sample set of NEW documents (use enough to get an idea of any gaps in the thesaurus)

• Test retrieval using sample questions and seeing how effectively the thesaurus maps to the appropriate descriptor

2003.09.18 - SLIDE 51IS 202 – FALL 2003

Lecture Overview

• Review– Types of Controlled Vocabularies– Name Authority Control

• Thesaurus Design and Development– Controlled Vocabularies for topical description– Thesaurus Design– Steps In Thesaurus Development– Indexing

• Discussion

2003.09.18 - SLIDE 52IS 202 – FALL 2003

The Indexing Process

• Concept identification

• Term selection (via thesaurus)

• Term assignment

2003.09.18 - SLIDE 53IS 202 – FALL 2003

Application: The Indexing Process (Manual)

IsTerm

suitable

NOSelect Alternativeterm to represent

Concept

WouldConcept be

better representedby one of

these terms

Is There

Another Concept

Consider Preferred

Term

Select Preferred

Term

Establish TermDenoting Concept

Examine Documentand Identify Significant Concepts

Consider First

Concept

PreferredTerm?

StartNO

NO

NO

NO

NO

YES YES YES

YES

YESYES

DoesThesaurus

contain termfor

Concept

Consider anyassociated terms inThesaurus (NT,BT)

Admit New TermInto Thesaurus

Can Conceptbe expressed

combining terms?

Consider Each ofThese Terms

Assign Termsto

Document

Prefer Alternative

Term(s)

End

Adapted from ISO 5963, p.5

2003.09.18 - SLIDE 54IS 202 – FALL 2003

Thesaurus Revision and Updates

• There will always be new concepts, products, or expressions that need to be added to the thesaurus – Set a regular schedule of reviews and

revisions– Collect complaints, problems, etc. and fold

into revision of the thesaurus

2003.09.18 - SLIDE 55IS 202 – FALL 2003

References

• Soegel, D. Indexing Languages and Thesauri: Construction and Maintenance. Los Angeles: Melville Publishing Co., 1974

• Foskett, A.C. The Subject Approach to Information. London: Clive Bingley, 1982.

• Standards:– ANSI/NISO z39.19-1994 — American National Standard

Guidelines for the Construction, Format and Management of Monolingual Thesauri

– ANSI/NISO Draft Standard Z39.4-199x — American National Standard Guidelines for Indexes in Information Retrieval

– ISO 2788 — Documentation — Guidelines for the establishment and development of monolingual thesauri

– ISO 5964 — Documentation — Guidelines for the establishment and development of multilingual thesauri

2003.09.18 - SLIDE 56IS 202 – FALL 2003

Lecture Overview

• Review– Types of Controlled Vocabularies– Name Authority Control

• Thesaurus Design and Development– Controlled Vocabularies for topical description– Thesaurus Design– Steps In Thesaurus Development– Indexing

• Discussion

2003.09.18 - SLIDE 57IS 202 – FALL 2003

Simon King - Soergel

• Indexing Languages and Thesauri: Construction and Maintenance was published in 1974 and technology has clearly changed a lot since then. Do Soergel’s careful step-by-step instructions (often referencing use of index cards) still have much value? Are outdated instructions and flowcharts in fact dangerous to current thesaurus builders, allowing them to avoid thinking too much about what they’re doing while mindlessly following directions?

2003.09.18 - SLIDE 58IS 202 – FALL 2003

Simon King - Soergel

• Are his concerns about storage (.e.g. attaching too many descriptors too a document) outdated? How about the fact that indexing gets more difficult as the size of the indexing language increases? Can modern technology help us here? Is there an upper limit on the size of a usable indexing language or should we apply Soergels’s guidelines for eliminating terms that point to few documents to make index languages as small as possible?

2003.09.18 - SLIDE 59IS 202 – FALL 2003

Simon King - Soergel

• Do some of the algorithms/procedures he describes (e.g. F04.2(e) and F0.4.4,1 on weighting documents when determining frequencies of concepts/terms) still make sense for when building more modern ISAR systems?

2003.09.18 - SLIDE 60IS 202 – FALL 2003

Sean Savage – House of Q.

• The House of Quality's foundation and starting point is a comprehensive and well-defined list of specific Customer Attributes (i.e., phrases that customers use to describe what's important to them in a product.) The CAs specify the "whats" that the team is aiming for and the team uses the CAs to determine Engineering Characteristics: the "hows" for reaching those targets.

2003.09.18 - SLIDE 61IS 202 – FALL 2003

Sean Savage – House of Q.

• The House of Quality seems to me a great technique for communicating about and improving mature products like automobiles and dishwashers, whose "whats" are stable and widely understood. What are we to make of this as we consider new and evolving technologies, where our biggest challenges often hinge upon discovery and definition of the "whats?"

2003.09.18 - SLIDE 62IS 202 – FALL 2003

Sean Savage – House of Q.

• With many such technologies, including phonecams, customers don't yet have a clear idea of specifically what's important to them, and in cases where they do state such opinions, those views often change dramatically as uses of the tools evolve and mature. Customers invent new uses for new tools, and in turn they develop new criteria for rating the tools.

2003.09.18 - SLIDE 63IS 202 – FALL 2003

Sean Savage – House of Q.

• Is The House of Quality irrelevant in the realm of very new technologies? Can we use cornerstones of The House to build more fluid team-communication tools that will allow more flexibility and change in the definition and measurement of the "whats" and that will clarify, rather than confuse, the relations between business functions and user satisfaction in evolving new products?

2003.09.18 - SLIDE 64IS 202 – FALL 2003

Lisa de Larios-Heiman - Sano

• When designing a web page (or anything else), designers have to keep users' needs in mind. But is there a point where those needs should be ignored for the sake of creating clear, extensible designs? Is possible to be such a good designer, and to have enough expertise on the subject matter, that the designer knows what the users' need are without consulting them? For example, I disagree with Sano that international groupings of bands are irrelevant to people interested in music, but as a designer he does not recognize that as an important distinction

2003.09.18 - SLIDE 65IS 202 – FALL 2003

Lisa de Larios-Heiman - Sano

• This article was written at least 7 years ago, and is very dated in its technical references. Is it similarly dated in its discussion of the web design process? Or have the basics of web design not developed as rapidly as the technology, either because they don't need to or because designers aren't responding quickly to the technological changes? Did his dated references make you take his writings less seriously?

2003.09.18 - SLIDE 66IS 202 – FALL 2003

Next Time

• Multimedia Information Organization and Retrieval

• Readings/Discussion: – Editing Out Video Editing (Ryan Shaw)– Computational Media Aesthetics: Finding

Meaning Beautiful (Margaret Spring)– The Holy Grail of Content-Based Media

Analysis (Hong Qu)– Applications of Video-Content Analysis and

Retrieval (Dan Perkel)