Ian H. Witten

53
g r e e n s t o n e . o r g Ian H. Witten New Zealand Digital Library Project Computer Science Department Waikato University New Zealand http://greenstone.org Browsing around a digital library Greenstone: Open source system for creating and delivering digital library collections

description

Ian H. Witten New Zealand Digital Library Project Computer Science Department Waikato University New Zealand http://greenstone.org. Browsing around a digital library. Greenstone: Open source system for creating and delivering digital library collections. Agenda. Context - PowerPoint PPT Presentation

Transcript of Ian H. Witten

Page 1: Ian H. Witten

gre

en

sto

ne.o

rg

Ian H. Witten

New Zealand Digital Library ProjectComputer Science DepartmentWaikato UniversityNew Zealand

http://greenstone.org

Browsing around a digital library

Greenstone: Open source system for creating and deliveringdigital library collections

Page 2: Ian H. Witten
Page 3: Ian H. Witten
Page 4: Ian H. Witten

Context Documents and interfaces

– Different document types– … and interface languages

Searching and browsing– Different search indexes– … and browsing functionality

Collection configuration (Using the Collector) The power of open source

Agenda

Page 5: Ian H. Witten

What we wanted

Greenstone turns a ragtag menagerie of documentsin various formats into an easy-to-use collection thatcan run on a standalone laptop in a Ugandan village’sinformation center

ALA 2002

Page 6: Ian H. Witten

“Collections” of digital material Individualized, depending on metadata etc Up to several Gb of text … … + associated images, movies, whatever Fully searchable Served on WWW, or published on CD-ROM Multi-platform (Unix + all Windows) Multi-format documents Multi-lingual: documents and interfaces Multimedia Metadata: standard and non-standard

What we wanted

Page 7: Ian H. Witten

Collections:on the Web

nzdl.org

(demo, not service)

Page 8: Ian H. Witten
Page 9: Ian H. Witten

Greenstone collections: on CD-ROM

UNESCOGlobal Help Project United Nations UniversityWorld Health OrganizationPan American Health

Organization

UN and NGOs, e.g.

Page 10: Ian H. Witten

Kataayi Multipurpose CooperativeRural Uganda(20 km fromMasaka)

Page 11: Ian H. Witten

for sustainable development andbasic human needs

Example

• 160,000 pages• 30,000 images• 1230 books• 340 kg• US$20,000

• CD-ROM• US$6 • Win3.1x(!)/95/98/NT• Stand-alone and intranet

server• Web browser user interface

Global Help Project, Antwerp (+ UN agencies)

HumanityDevelopment Library

Page 12: Ian H. Witten

Agenda

Context Documents and interfaces

– Different document types– … and interface languages

Searching and browsing– Different search indexes– … and browsing functionality

Collection configuration Using the Collector The power of open source

Page 13: Ian H. Witten

Collection of pictures(pictures of text)

Alexander Turnbull Library, NZ

Page 14: Ian H. Witten

Voice(and pictures)

Hamilton Public Library

Page 15: Ian H. Witten

Music

Page 16: Ian H. Witten

Chinese documents(pictures of text)

+ Chinese interface

Peking University Library

Page 17: Ian H. Witten

Chinese(Chinese & English interfaces)

Classic Chinese literature

Page 18: Ian H. Witten

Arabic(Arabic & English interfaces)

Famous mosques

Page 19: Ian H. Witten

UNESCO, Paris

French

Page 20: Ian H. Witten

PAHO, WHO

Spanish

Page 21: Ian H. Witten

Turkish

Page 22: Ian H. Witten

Russian collection fromMari El Republic

http://gov.mari.ru/gsdl

Page 23: Ian H. Witten

Agenda

Context Documents and interfaces

– Different document types– … and interface languages

Searching and browsing– Different search indexes– … and browsing functionality

Collection configuration Using the Collector The power of open source

Page 24: Ian H. Witten

Hierarchical document model

Metadata specifiedat any level

Title metadata

Page 25: Ian H. Witten

Searching and browsing

Searching

Metadata-based browsing

Subject Title Publisher

“HowTo”

Dublin Core ad hoc

Page 26: Ian H. Witten

Multiple search indexes

metadata

text

Page 27: Ian H. Witten

metadata

Collection-dependent

Page 28: Ian H. Witten

Multilingual searching

Page 29: Ian H. Witten
Page 30: Ian H. Witten

AZList classifier (Title metadata)

Browsing using classifiers

Page 31: Ian H. Witten

DateList classifier (Date metadata)

Page 32: Ian H. Witten

Hierarchyclassifier (Subject metadata)

Page 33: Ian H. Witten
Page 34: Ian H. Witten

Acronym extraction

plugin

Metadata extraction plugins

Page 35: Ian H. Witten

Language identification

plugin

Page 36: Ian H. Witten

Emailplugin

Page 37: Ian H. Witten

Phrase hierarchy

extraction

+ thesaurus browsing

Page 38: Ian H. Witten

Agenda

Context Documents and interfaces

– Different document types– … and interface languages

Searching and browsing– Different search indexes– … and browsing functionality

Collection configuration Using the Collector The power of open source

Page 39: Ian H. Witten

creator [email protected]

maintainer [email protected]

public true

beta true

 

indexes section:text section:Title document:text

defaultindex section:text

 

plugin GAPlug

plugin ArcPlug

plugin RecPlug

 

classify Hierarchy hfile=sub.txt metadata=Subject sort=Title

classify HDLList metadata=Title

classify Hierarchy hfile=org.txt metadata=Organization sort=Title

classify List metadata=Howto

 

format SearchVList "<td valign=top>[link][icon][/link]</td>

<td>{If}{[parent(All': '):Title],[parent(All': '):Title]: }

[link][Title][/link]</td>"

format CL4VList "<br>[link][Howto][/link]"

format DocumentImages true

format DocumentText "<h3>[Title]</h3>\\n\\n<p>[Text]"

 

collectionmeta collectionname "greenstone demo"

collectionmeta collectionextra "This is a demonstration collection for the

Greenstone digital library software.\nIt contains a small

subset (11 books) of the Humanity Development Library"

collectionmeta iconcollectionsmall "/gsdl/collect/demo/images/demosm.gif"

collectionmeta iconcollection "/gsdl/collect/demo/images/demo.gif"

collectionmeta .section:Title "section titles"

collectionmeta .document:text "entire books"

collectionmeta .section:text "chapters“

Collection configuration file

name, icon, etc

descriptionemail of

creatorsearch

indexespluginsclassifiers documents

query results

classifiers

how to format

Page 40: Ian H. Witten

Add full-text index of titles ... or authors Add alphabetic author browser Include Word documents Include PDF documents Separate index for each language Extract acronyms and add list Import OAI metadata Extract phrase hierarchy and add

browser Alter the format of any of the above Restrict collection’s interface langs Change default interface language

additional indexes line

… need author metadata

add classifier line

add plugin line

(same)

add languages line

plugin option

add plugin line

add classifier line

add format string

add format string

edit site config file

Alter configurationindexes document:Title

classify AZList –metadata Creator

indexes document:Creator

plugin WordPlug

plugin PDFPlug

languages en fr es

plugin PDFPlug –extract_acronyms

classify phind

format …

format PreferenceLangs en|fr|escgiarg shortname=1 argdefault =fr

plugin OAIPlug

Page 41: Ian H. Witten

Agenda

Context Documents and interfaces

– Different document types– … and interface languages

Searching and browsing– Different search indexes– … and browsing functionality

Collection configuration Using the Collector The power of open source

Page 42: Ian H. Witten

Collector =

software “wizard”

for building

new collectio

ns

The pen is mightier than the sword!Building and distributing collections carries

responsibilities …legal … social … ethical …

Be aware of the power of information and use it wisely

Page 43: Ian H. Witten
Page 44: Ian H. Witten
Page 45: Ian H. Witten

Status updated every 5 secs

Page 46: Ian H. Witten
Page 47: Ian H. Witten

Agenda

Context Documents and interfaces

– Different document types– … and interface languages

Searching and browsing– Different search indexes– … and browsing functionality

Collection configuration Using the Collector The power of open source

Page 48: Ian H. Witten

Ghostscript

Kea

pdftohtml

rtftohtml

TextCat

wvWare

Xlhtml

XML::Parser

Interpreter for Adobe Postscript documents (Postscript plugin)

Keyphrase extraction program (to generate metadata)

Converter for PDF documents (PDF plugin)

Converter for RTF documents (RTF plugin)

Detects languages and document encodings

Converter for Word documents (Word plugin)

Converter for Excel/Powerpoint documents (plugins)

Parses XML documents, used to read and write Greenstone’s internal XML document format

The power of open source: Greenstone uses …

Page 49: Ian H. Witten

MG

GDBM

wget

YAZ

Stemmer

GCC

CVS

Perl

Apache

Creates compressed full-text indexes and performs searches

Database used for metadata etc

Downloading pages from the Web when creating collections

Client and server implementation of Z39.50

English language stemmer

C/C++ compiler

Version control system

Used for plugins etc

Web server used by many Greenstone installations

and …

Page 50: Ian H. Witten

Plugins — new document, metadata formats Classifiers — new metadata browsers

Greenstone DL software Accessible via any Web browser Server runs on Windows and Unix Collections can be published on CD-ROM

Access

Full-text and fielded search Flexible browsing facilities Metadata-based (Dublin Core) Collection-specific Hierarchical phrase browsing supported Creates all access structures automatically

Searching/browsing

Documents and interfaces Chinese, Arabic, Maori, Russian etc (+

European) Multimedia: video, audio collections exist

Multilingual

CORBA protocol allows remote access Z39.50 server/client for backwards

compatibility

Distributed

What you see — you can get! Open-source software: free, extensible

Extensible

Page 51: Ian H. Witten

UNESCO: DistributingGreenstone DL software

GNU licensedFully documentedTrilingual (English/French/Spanish)Unix/Windows (3.1/3.11, 95/98/ME, NT/2000/XP)Trivial to installEnd-user interface for collection buildingServe collections on Web or write them to CD-ROMDocuments on disk and/or Web Formats: HTML, Word, PDF, PostScript, plain text, e-

mail, …

“Give a man a fish, feed him for a dayTeach a man to fish, feed him for life”

Sustainable development

Greenstone software on CD-ROM

http://greenstone.org

download from http://greenstone.org

Page 52: Ian H. Witten

How to build a digital library

Witten and BainbridgeMorgan Kaufmann

2003Kia papapounamu te

moana

may peace and calmness surround you, may you reside in the warmth of a summer’s haze, may the ocean of your travels be

as smooth as the polished greenstone.

kia hora te marino, kia tere te karohirohi,

kia papapounamu te moana

How to build a digital library

Witten and BainbridgeMorgan Kaufmann

2003

Page 53: Ian H. Witten