Post on 13-Jan-2016
description
gre
en
sto
ne.o
rg
Ian H. Witten
New Zealand Digital Library ProjectComputer Science DepartmentWaikato UniversityNew Zealand
http://greenstone.org
Browsing around a digital library
Greenstone: Open source system for creating and deliveringdigital library collections
Context Documents and interfaces
– Different document types– … and interface languages
Searching and browsing– Different search indexes– … and browsing functionality
Collection configuration (Using the Collector) The power of open source
Agenda
What we wanted
Greenstone turns a ragtag menagerie of documentsin various formats into an easy-to-use collection thatcan run on a standalone laptop in a Ugandan village’sinformation center
ALA 2002
“Collections” of digital material Individualized, depending on metadata etc Up to several Gb of text … … + associated images, movies, whatever Fully searchable Served on WWW, or published on CD-ROM Multi-platform (Unix + all Windows) Multi-format documents Multi-lingual: documents and interfaces Multimedia Metadata: standard and non-standard
What we wanted
Collections:on the Web
nzdl.org
(demo, not service)
Greenstone collections: on CD-ROM
UNESCOGlobal Help Project United Nations UniversityWorld Health OrganizationPan American Health
Organization
UN and NGOs, e.g.
Kataayi Multipurpose CooperativeRural Uganda(20 km fromMasaka)
for sustainable development andbasic human needs
Example
• 160,000 pages• 30,000 images• 1230 books• 340 kg• US$20,000
• CD-ROM• US$6 • Win3.1x(!)/95/98/NT• Stand-alone and intranet
server• Web browser user interface
Global Help Project, Antwerp (+ UN agencies)
HumanityDevelopment Library
Agenda
Context Documents and interfaces
– Different document types– … and interface languages
Searching and browsing– Different search indexes– … and browsing functionality
Collection configuration Using the Collector The power of open source
Collection of pictures(pictures of text)
Alexander Turnbull Library, NZ
Voice(and pictures)
Hamilton Public Library
Music
Chinese documents(pictures of text)
+ Chinese interface
Peking University Library
Chinese(Chinese & English interfaces)
Classic Chinese literature
Arabic(Arabic & English interfaces)
Famous mosques
UNESCO, Paris
French
PAHO, WHO
Spanish
Turkish
Russian collection fromMari El Republic
http://gov.mari.ru/gsdl
Agenda
Context Documents and interfaces
– Different document types– … and interface languages
Searching and browsing– Different search indexes– … and browsing functionality
Collection configuration Using the Collector The power of open source
Hierarchical document model
Metadata specifiedat any level
Title metadata
Searching and browsing
Searching
Metadata-based browsing
Subject Title Publisher
“HowTo”
Dublin Core ad hoc
Multiple search indexes
metadata
text
metadata
Collection-dependent
Multilingual searching
AZList classifier (Title metadata)
Browsing using classifiers
DateList classifier (Date metadata)
Hierarchyclassifier (Subject metadata)
Acronym extraction
plugin
Metadata extraction plugins
Language identification
plugin
Emailplugin
Phrase hierarchy
extraction
+ thesaurus browsing
Agenda
Context Documents and interfaces
– Different document types– … and interface languages
Searching and browsing– Different search indexes– … and browsing functionality
Collection configuration Using the Collector The power of open source
creator sjboddie@cs.waikato.ac.nz
maintainer sjboddie@cs.waikato.ac.nz
public true
beta true
indexes section:text section:Title document:text
defaultindex section:text
plugin GAPlug
plugin ArcPlug
plugin RecPlug
classify Hierarchy hfile=sub.txt metadata=Subject sort=Title
classify HDLList metadata=Title
classify Hierarchy hfile=org.txt metadata=Organization sort=Title
classify List metadata=Howto
format SearchVList "<td valign=top>[link][icon][/link]</td>
<td>{If}{[parent(All': '):Title],[parent(All': '):Title]: }
[link][Title][/link]</td>"
format CL4VList "<br>[link][Howto][/link]"
format DocumentImages true
format DocumentText "<h3>[Title]</h3>\\n\\n<p>[Text]"
collectionmeta collectionname "greenstone demo"
collectionmeta collectionextra "This is a demonstration collection for the
Greenstone digital library software.\nIt contains a small
subset (11 books) of the Humanity Development Library"
collectionmeta iconcollectionsmall "/gsdl/collect/demo/images/demosm.gif"
collectionmeta iconcollection "/gsdl/collect/demo/images/demo.gif"
collectionmeta .section:Title "section titles"
collectionmeta .document:text "entire books"
collectionmeta .section:text "chapters“
Collection configuration file
name, icon, etc
descriptionemail of
creatorsearch
indexespluginsclassifiers documents
query results
classifiers
how to format
Add full-text index of titles ... or authors Add alphabetic author browser Include Word documents Include PDF documents Separate index for each language Extract acronyms and add list Import OAI metadata Extract phrase hierarchy and add
browser Alter the format of any of the above Restrict collection’s interface langs Change default interface language
additional indexes line
… need author metadata
add classifier line
add plugin line
(same)
add languages line
plugin option
add plugin line
add classifier line
add format string
add format string
edit site config file
Alter configurationindexes document:Title
classify AZList –metadata Creator
indexes document:Creator
plugin WordPlug
plugin PDFPlug
languages en fr es
plugin PDFPlug –extract_acronyms
classify phind
format …
format PreferenceLangs en|fr|escgiarg shortname=1 argdefault =fr
plugin OAIPlug
Agenda
Context Documents and interfaces
– Different document types– … and interface languages
Searching and browsing– Different search indexes– … and browsing functionality
Collection configuration Using the Collector The power of open source
Collector =
software “wizard”
for building
new collectio
ns
The pen is mightier than the sword!Building and distributing collections carries
responsibilities …legal … social … ethical …
Be aware of the power of information and use it wisely
Status updated every 5 secs
Agenda
Context Documents and interfaces
– Different document types– … and interface languages
Searching and browsing– Different search indexes– … and browsing functionality
Collection configuration Using the Collector The power of open source
Ghostscript
Kea
pdftohtml
rtftohtml
TextCat
wvWare
Xlhtml
XML::Parser
Interpreter for Adobe Postscript documents (Postscript plugin)
Keyphrase extraction program (to generate metadata)
Converter for PDF documents (PDF plugin)
Converter for RTF documents (RTF plugin)
Detects languages and document encodings
Converter for Word documents (Word plugin)
Converter for Excel/Powerpoint documents (plugins)
Parses XML documents, used to read and write Greenstone’s internal XML document format
The power of open source: Greenstone uses …
MG
GDBM
wget
YAZ
Stemmer
GCC
CVS
Perl
Apache
Creates compressed full-text indexes and performs searches
Database used for metadata etc
Downloading pages from the Web when creating collections
Client and server implementation of Z39.50
English language stemmer
C/C++ compiler
Version control system
Used for plugins etc
Web server used by many Greenstone installations
and …
Plugins — new document, metadata formats Classifiers — new metadata browsers
Greenstone DL software Accessible via any Web browser Server runs on Windows and Unix Collections can be published on CD-ROM
Access
Full-text and fielded search Flexible browsing facilities Metadata-based (Dublin Core) Collection-specific Hierarchical phrase browsing supported Creates all access structures automatically
Searching/browsing
Documents and interfaces Chinese, Arabic, Maori, Russian etc (+
European) Multimedia: video, audio collections exist
Multilingual
CORBA protocol allows remote access Z39.50 server/client for backwards
compatibility
Distributed
What you see — you can get! Open-source software: free, extensible
Extensible
UNESCO: DistributingGreenstone DL software
GNU licensedFully documentedTrilingual (English/French/Spanish)Unix/Windows (3.1/3.11, 95/98/ME, NT/2000/XP)Trivial to installEnd-user interface for collection buildingServe collections on Web or write them to CD-ROMDocuments on disk and/or Web Formats: HTML, Word, PDF, PostScript, plain text, e-
mail, …
“Give a man a fish, feed him for a dayTeach a man to fish, feed him for life”
Sustainable development
Greenstone software on CD-ROM
http://greenstone.org
download from http://greenstone.org
How to build a digital library
Witten and BainbridgeMorgan Kaufmann
2003Kia papapounamu te
moana
may peace and calmness surround you, may you reside in the warmth of a summer’s haze, may the ocean of your travels be
as smooth as the polished greenstone.
kia hora te marino, kia tere te karohirohi,
kia papapounamu te moana
How to build a digital library
Witten and BainbridgeMorgan Kaufmann
2003