Grootschalige digitalisering van archivalia
-
Upload
marc-holtman -
Category
Technology
-
view
7 -
download
1
description
Transcript of Grootschalige digitalisering van archivalia
Grootschalige digitalisering van archivalia
Marc Holtman
Analoge originele documenten in depot
Scans
Metadata
(digitaal) ontsluiten
Scanning
zoeken en raadplegen
Usability
Vindbaarheid
Analoge originele documenten in depot
Scans
Metadata
(digitaal) ontsluiten
Scanning
zoeken en raadplegen
Usability
Vindbaarheid
Analoge originele documenten in depot
Scans
Scanning
Begrippen
Principes
Technische aspecten
Economische principes
Werkproces
Bij grootschalige digitalisering van archivalia
Voorstelrondje
http://195.242.171.17/hga/virtuelestudiezaal/WebsitePubliek/BeschrijvingOverlijdensAkte.aspx?nn=3&akteid=6305722&containerid=6305718
http://www.nationalarchives.gov.uk/dol/images/examples/pdfs/PROB-11-1-0029.pdf
http://beeldbank.nationaalarchief.nl/nl/afbeeldingen/thema/81-topstukken-van-het-nationaal-archief
http://www.hsp.org/default.aspx?id=1035
http://recordsearch.naa.gov.au/SearchNRetrieve/Interface/DetailsReports/ItemDetail.aspx?Barcode=638501
http://burgerlijkestand.zederik.nl/atlantis/?application=nadere%20toegangen&database=nt&entrypoint=personen&query=&vanaf=0&sessienummer=1.ef00&relatienr=1&extra1=&extra2=&aantal_per_pagina=9&service=object&templatename=item.htm&recordnumber=0&detailobject=%3cd%7cw2kzed02%3aD%3a%2fatlantisdatabase%2fatlantis.nt.db%7c86%7c0%7caca04%7c10000%3e#
Greep uit voorbeelden huiswerk
Boeken (Internet Archive)Google books (Google Books)Munich Digitisation Centre (Digital collections)
VisitorsYear Reading rooms Website1982 24.027 1988 29.788 1992 27.738 1998 26.5982002 25.0142006 17.958200720082009
VisitorsYear Reading rooms Website1982 24.027 1988 29.788 1992 27.738 1998 26.598 40.0482002 25.014 224.0502006 17.958 512.592200720082009
VisitorsYear Reading rooms Website1982 24.027 1988 29.788 1992 27.738 1998 26.598 40.0482002 25.014 224.0502006 17.958 512.5922007 92.6782008 118.3122009 106.000
VisitorsYear Reading rooms Website1982 24.027 1988 29.788 1992 27.738 1998 26.598 40.0482002 25.014 224.0502006 17.958 512.5922007 92.678 520.4832008 118.312 538.4832009 106.000 531.143
De tijd dat digitalisering een optie was is voorbij. Het aantal gebruikers van de website stijgt sterk. Maar ook de verwachting dat de stukken online te raadplegen zijn.
“We want to use the internet to facilitate historical research; Everybody should be able to use all archival collections at home 24/7”
V. Hoe lang duurt het om alles te digitaliseren
1 meter = 7.000 scans
Productie = 10.000 scans per week
A. 431 jaar
V. Hoeveel scans levert digitalisering van 32 kilometer archief
A. 224.000.000 scans
Aantal te digitaliseren documenten in een archief loopt in een project al snel in de honderdduizenden tot miljoenen
Incidentele en structurele kosten moeten ook bij deze enorme aantallen beheersbaar blijven
Incidentele en structurele kosten afhankelijk van:
B. Werkprocessen: organisatie reproductieproces
A. Technische aspecten: kwaliteitsnorm scans / bestandsgrootte
99% van alle documenten uit een archief betreft tekstdocumenten
Richtlijnen en best practices digitalisering nationaal en internationaal vaak:
- beeldmateriaal: foto’s, prenten tekeningen etc- High-end kwaliteit (preservering digitalisering / vervanging)
Maar wat als het grootschalig wordt, en uitgangspunt vooral “raadpleging” (oftewel “lezen van de tekst”) is?
Principes, begrippen, technische aspecten en werkproces
Grootschalig digitaliseren van archivalia
We Scan
We Store
We Do
Principles, image quality and workflow principles
Compression and filesize
Workflow, tools and practical issues
Goals of digitization projects vary from access to substitution of the originals
In every project quality standard and method are set, depending on purpose and type of material
For all projects we have one workflow
We always work on project basis
We scan
Digitization at the Amsterdam City Archives in general
We scan
1. At large scale
the more scans being made, the lower the price per scan
Large scale production is a prerequisite in order to keep production costs as low as possible
Documents that are being digitized in this reproduction process can have the following forms
We scan
Small and large sizeBound and loose-leafed entitiesCard indexesOld and modern materialLow and high contrast documentsText alone, text and image togetherHybrid forms
3. A broad spectrum of document types
Costs for producing and storing scans are determined to a high extent by the quality standard set for the scans
Purpose of the scans: archival research using the web, straight from screen or print
We scan
4. For archival research from screen or print
The higher the standard of quality, the higher the costs will be
In order to keep costs low it is prudent to allow the standard of quality follow from the requirement the end user places on the scan
Textual information legible in de originals must be legible in the scans
But has no added value for the customer at all
A quality higher than that inevitably will push up both incidental and structural costs
We scan
4. For archival research from screen or print
Specified (basic) quality standard:
Reproduction of all significant information
Reproduction of details which are not part of the textual information is not required
We scan
Scan quality and legibility
High quality scan
Modified scan (contrast)
Optimal tonal range
Example: very “light” original
Excellent flexibility
Poor tonal rangeLittle flexibility
Experience in practice learns that what is experienced as being “good legibility” is very personal.
We decided to solve this problem with a smart filter in the document viewer.
Poor legibility
Excellent legibility
Which one would you buy?
Skimming on the quality of scans (it can be better) is purely an economic decision, not one taken on principle
We scan
4. For archival research from screen or print
Price comparison scanning costsPrice rates scanning, external partnerHigh-end 3 – 10 $
Legibility 0,30 – 0,75 $
Legibility, auto-feed 0,05 $
It does make sense to let the standard of quality follow from the purpose the end-uses places on of the scans
This way damage or loss of the originals is ruled out
After digitization the originals can not be requested in the reading room anymore
We scan
5. For conservation and security
The scans in the scanning on request service are made for the purpose of access / archival research
Not as a substitute for the originals
Nevertheless, digitization does have a real conservation function
Conservation of the originals remains the major concern
A file can contain one – hundreds of documents
We scan
By definition the entire file is scanned
Never just a selection of pages
There are a few reasons for this:
6. Always complete files
The costs for scanning are not so much a factor of quantity, but rather of the manual processing involving in it
In the originals or the metadata it has to be indicated which documents are being digitized
When shown in the Archiefbank, the user expects completeness
When non-scanned pages have to be digitized later, the entire preparation process has to be gone through once again
Contracting out of scanning was a logical choice
We scan
The in-house scan facilities are not designed for large-scale digitizing
The complexity of the workflow and material to be scanned calls for
Investing only makes sense by very high production, organized on a large scale
7. Contracting out the scanning to external partners
Specialized hard- and softwareSpecialized set-upsKnowledgeVery complex technical infrastructure
This calls for intensive collaboration
Also, the workflows of archive and digitizer have to dovetail
We scan
There are many scanning companies
Most do have experience in bulk processing
But not in this degree of complexity and diversity
7.
Contracting out scanning is more than awarding a contract to a supplier
Contracting out the scanning to external partners
Customers think a low price is important
This means that costs for producing and storing scans have to be as low as possible
Archival research easily runs into the use of dozens to hundreds of documents
We scan
The price of an ordinary copy in our reading room should be the benchmark
Low costs
100 scans should not cost $ 100
The costs when purchasing scans online should be competitive with travel costs when visiting our reading room
We use a combination of 1 and 3
We store
Storage costs still are considerably high when producing large quantities of scans
In order to bring structural costs down file size of the scans has to be as low as possible
This can be achieved in three ways
Scans with a file size as small as possible
1. Skimming on resolution
3. Using (lossless or lossy) compression on the files
2. Skimming on bit depth / amount of colors (only possible in formats like TIFF and PNG)
Hoe fijner het gebruikte raster bij scanning, hoe meer informatie, hoe hoger de detaillering
Maar, hoe dan ook sterke vereenvoudiging van de werkelijkheid
Resolutie
Op een bepaald detailniveau zullen altijd de afzonderlijke “rastercellen” zien
Voor tekstdocumenten moet het raster een fijnmazigheid hebben die overeenkomt met details uit de tekstuele informatie. Een punt op een i moet als zodanig nog te onderscheiden zijn
Maar bijvoorbeeld details in de structuur van het papier hoeven in de scan niet zichtbaar te zijn
Resolutie wordt meestal uitgedrukt in DPI (Dots Per Inch)
Of – eigenlijk beter – PPI (Pixels Per Inch)
DPI zegt dus iets over de informatiedichtheid per lengtemaat
Resolutie
En daarmee iets over de theoretisch haalbare kwaliteit
Maar verder helemaal niets over de objectieve kwaliteit van een scan
Zowel een scanner van € 50,- van de Aldi, als een high-end scanner van € 50.000 kunnen op 300 dpi scannen
Maar de kwaliteit van de geproduceerde scan zal duidelijk verschillen
Meten van het detailoplossend vermogen van een scanner kan met behulp van controlekaartjes waarmee zogenaamde lijnenparen worden gemeten
Benchmark resolutie is meestal 300 dpi
Dit is gebaseerd op de kleinste letter e ( 1 mm) in drukwerk
Niet alle documenten bevatten details die zo klein zijn
Resolutie
benodigde resolutie kan o.a. worden berekend met de zogenaamde Quality Index:
http://www.library.cornell.edu/preservation/tutorial/conversion/conversion-04.html
Resolutie is in sterke mate bepalend voor de bestandsgrootte:
Resolutie (A4) Bestandsgrootte
300 dpi 24 Mb
400 dpi 44 Mb
800 dpi 177 Mb
1600 dpi 708 Mb
3200 dpi 2,8 Gb
Resolutie
Resolutie
Voorbeelden
300 dpi
200 dpi
150 dpi
Resolutie
Conclusie: bij 150 dpi: kleine bestanden en meeste tekst nog prima leesbaar
Maar, is het verstandig om hier bij digitaliseren van uit te gaan?
Bij lage resolutie ook lagere structurele beheerkosten. Over enkele jaren wellicht met betere technologie opnieuw scannen.
Maar niet voldoende wanneer we in de toekomst op basis van deze images in een hogere kwaliteit willen leveren, OCR toe willen passen en/of willen converteren naar betere compressie- en bestandsformaten.
Keuze afhankelijk van doelstellingen, middelen, aantallen
Kleur
Een pixel is een vakje met een enkele kleur
De kleinste eenheid van een digitaal bestand is een bit: deze heeft de waarde 0 of 1
Wanneer een pixel uit 1 bit bestaat kan deze pixel de waarde zwart (0) of wit (1) hebben
Willen we meer kleuren kunnen definiëren bij een pixel dan zullen we het aantal bits per pixel uit moeten breiden
Met 8 bits (die elk de waarde 1 of 0) aan kunnen nemen zijn 256 combinaties, en dus kleuren mogelijk (bijvoorbeeld 0 0 0 1 0 0 1 1 )
Kleurdiepte: bits en bytes
De meeste camera’s gebruiken 8 bits per kleurkanaal (in totaal dus 24 bits)
Hiermee zijn 16,7 miljoen kleuren mogelijk
24 bits (8 bits per kleurkanaal)
8 bits, grijswaarden
1 bit, zwart-wit
Compressie
Methode waarmee de informatie efficiënter beschreven kan worden
Peer Spel Spel Spel
Spel Peer Peer Spel
Spel Spel Peer Peer
Opslaan: 48 letters
P = Peer
S = Spel Woorden coderenCompressieBestandsgrootte neemt af
Compressie
P S S S
S P P S
S S P P
Opslaan: 12 letters (plus coderingstabel
P = Peer
S = Spel Resultaat
Compressie
Twee soorten compressie:
A. Lossless (exact omkeerbaar)Er gaat geen informatie verlorenVergelijk het met een kussen waar je alle lucht uitdrukt voor je deze verpakt. Haal je het kussen uit de verpakking dan wordt het weer exact het kussen zoals het was voor verpakking.
B. Lossy (niet exact omkeerbaar)Bepaalde informatie wordt weggegooidWeer drukken we lucht uit het kussen, maar omdat we een nog kleinere verpakking willen halen we ook een paar veertjes weg. Dit hoeft niet erg te zijn, want wellicht geeft het gemis van een paar veertjes in het gebruik geen oncomfortabeler kussen. Alleen, weggegooide veertjes zullen ook bij het opnieuw uit de verpakking halen niet meer worden toegevoegd.
Compressie en informatieverlies
Een veelgehoorde stelling:
Lossy compressie niet gebruiken bij opslag van images, want bij lossy compressie treedt informatieverlies op
Bij lossy compressie treedt inderdaad informatieverlies op, maar dat hoeft niet per definitie verlies van betekenisvolle informatie te betekenen
Sowieso is beter is om te zeggen: verlies van informatie ten opzichte van het ongecomprimeerde bestand.
Scanning is namelijk - ten opzichte van het origineel - onlosmakelijk verbonden met verlies van informatie, ook bij toepassing van lossless compressie.
Lossy compressie
Voorbeelden
JPEG kwaliteit 10 (300 dpi)
JPEG kwaliteit 12 (300 dpi)
JPEG kwaliteit 4 (300 dpi)
JPEG kwaliteit 4 (200 dpi)
JPEG 2000, part 6
Compressie en duurzaamheid
Veelgehoorde stelling:
Gecomprimeerde bestanden hebben een grotere kans om corrupt te raken dan niet gecomprimeerde bestanden. Daarom mag er geen datacompressie worden toegepast.
Uit onderzoek is gebleken dat deze stelling niet juist is.
Andere oplossingsrichting voor preservering: redundantie in opslagJuist gecomprimeerde bestanden lenen zich hier goed voor
We store
Resolution, compression and legibility: an example
300 dpi, high quility JPEG
200 dpi, low quility JPEG
Scans with a file size as small as possible
FilesizeFormat Compression Type Resolution Color Avg 500.000 %TIFF No --- 300 dpi 24 bits 22,1 Mb 11 Tb 100%
JPEG
Qua (ps) 12 Lossy 300 dpi 24 bits 7,5 Mb 3,7 Tb 34%
Qua (ps) 10 Lossy 300 dpi 24 bits 2,1 Mb 1,1 Tb 10%
Qua (ps) 4 Lossy 200 dpi 24 bits 255 Kb 124 Gb 1,1%
Qua (ps) 10 Lossy 400 dpi 24 bits 3,3 Mb 1,6 Tb 15%
JPEG2000Part 1 Lossless 300 dpi 24 bits 12 MB 6 Tb 55%
Part 6 Lossy 300 dpi 24 bits 120 Kb 59 Gb 0,5%
Comparison between file format, compression, resolution and file size
Scans with a file size as small as possible
We store
FilesizeFormat Compression Type Resolution Color Avg 500.000 %TIFF No --- 300 dpi 24 bits 22,1 Mb 11 Tb 100%
JPEG
Qua (ps) 12 Lossy 300 dpi 24 bits 7,5 Mb 3,7 Tb 34%
Qua (ps) 10 Lossy 300 dpi 24 bits 2,1 Mb 1,1 Tb 10%
Qua (ps) 4 Lossy 200 dpi 24 bits 255 Kb 124 Gb 1,1%
Qua (ps) 10 Lossy 400 dpi 24 bits 3,3 Mb 1,6 Tb 15%
JPEG2000Part 1 Lossless 300 dpi 24 bits 12 MB 6 Tb 55%
Part 6 Lossy 300 dpi 24 bits 120 Kb 59 Gb 0,5%
TIFF uncompressedComparison between file format, compression, resolution and file size
Scans with a file size as small as possible
We store
FilesizeFormat Compression Type Resolution Color Avg 500.000 %TIFF No --- 300 dpi 24 bits 22,1 Mb 11 Tb 100%
JPEG
Qua (ps) 12 Lossy 300 dpi 24 bits 7,5 Mb 3,7 Tb 34%
Qua (ps) 10 Lossy 300 dpi 24 bits 2,1 Mb 1,1 Tb 10%
Qua (ps) 4 Lossy 200 dpi 24 bits 255 Kb 124 Gb 1,1%
Qua (ps) 10 Lossy 400 dpi 24 bits 3,3 Mb 1,6 Tb 15%
JPEG2000Part 1 Lossless 300 dpi 24 bits 12 MB 6 Tb 55%
Part 6 Lossy 300 dpi 24 bits 120 Kb 59 Gb 0,5%
JPEG (psd) 10Comparison between file format, compression, resolution and file size
Scans with a file size as small as possible
We store
FilesizeFormat Compression Type Resolution Color Avg 500.000 %TIFF No --- 300 dpi 24 bits 22,1 Mb 11 Tb 100%
JPEG
Qua (ps) 12 Lossy 300 dpi 24 bits 7,5 Mb 3,7 Tb 34%
Qua (ps) 10 Lossy 300 dpi 24 bits 2,1 Mb 1,1 Tb 10%
Qua (ps) 4 Lossy 200 dpi 24 bits 255 Kb 124 Gb 1,1%
Qua (ps) 10 Lossy 400 dpi 24 bits 3,3 Mb 1,6 Tb 15%
JPEG2000Part 1 Lossless 300 dpi 24 bits 12 MB 6 Tb 55%
Part 6 Lossy 300 dpi 24 bits 120 Kb 59 Gb 0,5%
JPEG (psd) 4Comparison between file format, compression, resolution and file size
Scans with a file size as small as possible
We store
FilesizeFormat Compression Type Resolution Color Avg 500.000 %TIFF No --- 300 dpi 24 bits 22,1 Mb 11 Tb 100%
JPEG
Qua (ps) 12 Lossy 300 dpi 24 bits 7,5 Mb 3,7 Tb 34%
Qua (ps) 10 Lossy 300 dpi 24 bits 2,1 Mb 1,1 Tb 10%
Qua (ps) 4 Lossy 200 dpi 24 bits 255 Kb 124 Gb 1,1%
Qua (ps) 10 Lossy 400 dpi 24 bits 3,3 Mb 1,6 Tb 15%
JPEG2000Part 1 Lossless 300 dpi 24 bits 12 MB 6 Tb 55%
Part 6 Lossy 300 dpi 24 bits 120 Kb 59 Gb 0,5%
JPEG2000 losslessComparison between file format, compression, resolution and file size
Scans with a file size as small as possible
We store
We store
Comparison storage costs
Fileformat Storage Costs 1 year Costs 10 yearsTiff uncompressed 11 TB $ 77.000 $ 770.000JPEG 10 1,1 TB $ 7.700 $ 77.000JPEG 4 (200 dpi) 124 GB $ 868 $ 8.680JPEG 2000 (part 1, ll) 6 TB $ 42.000 $ 420.000
Storage of 500.000 images Avg size per scan uncompressed = 22,1 MB
Price rate: 1 TB, storage in a controlled e-repository environment on two separate locations, including IT costs
$ 7.000 (NLD, nov 2009)
Scans with a file size as small as possible
(File)size still does matter!
Projects with different goals, document types and partners take place at the same time
A streamlined, standardized process is indispensable when digitizing on a large scale
Guidelines and best practices often take no account of these complex factors and the amount of scans to be produced
We developed a process in which large scale and flexibility are starting points
All digitization projects follow this process
Developing the reproduction process
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
We Do
We developed a simple, but effective workflow application in-house
This asks for workflow management with a user-friendly application
For all projects, at any moment, it has to be clear:
We Do
What the current status is of each to digitize unitWhere each unit can be locatedWhat current and succeeding tasks are to be performed on each unit
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
Developing the reproduction process
In the following slides we focus on the weekly production of 10.000 scansin the digitizing on request service
We developed a simple, but effective workflow application in-house
This asks for workflow management with a user-friendly application
For all projects, at any moment, it has to be clear:
We Do
What the current status is of each to be digitized unitWhere each unit can be locatedWhat current and succeeding tasks are to be performed on each unit
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
Developing the reproduction process
All public files can be requested for digitization via the findings aids in the Archiefbank
Just by clicking on the “digitize” button
Production of 10.000 scans on weekly basis
1. Requesting for digitization
We Do2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
A unit to be digitized must be able to be identified at each step of the handling process
The units therefore get a unique meaningless order number
An order number is provided by the metadata management systemand is the basis for
In practice: all units to be digitized get an order ticket
2. Providing ordernumbers
Communication with the digitizerScanningAssigning filenamesRegistration of filenamesBilling by digitizer
We Do2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
A unit to be digitized must be able to be identified at each step of the handling process
The units therefore get a unique meaningless order number
An order number is provided by the metadata management systemand is the basis for
In practice: all units to be digitized get an order ticket
2. Providing ordernumbers
Communication with the digitizerScanningAssigning filenamesRegistration of filenamesBilling by digitizer
We Do2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
The workflow system generates a list of all originals to asses from the repositories
The list is sorted on repository / shelf to make retrieval efficient
We Do
3. Assessing the originals
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
All assessed originals are stored in a special room
In this room all checks are executed
We Do
4. Checking the originals
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
Information about the originals in our management systems is not always complete
If an item falls into one of these categories the request is rejected
B. Condition of the material
A rough check of the originals takes place
A. Content
We Do
4. Checking the originals
Copyrights Publicity Privacy
Items that are in such a condition that digitizing or transport could cause damage, or are packaged in a way that scanning in conventional set-ups is not possible do not qualify for standard way of digitization
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
Information about the originals in our management systems is not always complete
If an item falls into one of these categories the request is rejected
B. Condition of the material
A rough check of the originals takes place
A. Content
We Do
4. Checking the originals
Copyrights Publicity Privacy
Items that are in such a condition that digitizing or transport could cause damage, or are packaged in a way that scanning in conventional set-ups is not possible do not qualify for standard way of digitization
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
Material preparation is limited to the most minimal
We Do
4. Checking the originals
Staples are being removed as a ruleSmall reparations are executed by our restoration employees
The sequence of the originals as found in the repository is not checked or altered
We Do
We don’t
The originals are not numbered
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
But this is only true when the numbering tallies exact, because:
Numbering the originals has one advantage:
We Do
Not number the originals
The completeness of the scans (compared to the originals) can be guaranteed
Numbers that are assigned double lead to illogical end numbers (100 scans: scan 100 has been numbered as 99)
Experiments with numbering in practice learned that faultless numbering can not be realized
A missing number in a sequence of scans leads to the conclusion that there is one original that has not been scanned
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
Securing completeness can be realized by other means:
We Do
Comparing scans to originals 1:1 after digitizationScanning the originals twice
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
# scans = 365 # scans = 365
Low quality High quality master files
Not number the originals
For secure transport, special flight cases are used
We Do
5. Transport
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
It has to be perfectly clear which filenames this should be
After scanning the scan operator or data manager has to assign filenames to the scans
Because, when the meaning changes, filenames should change too
As a rule filenames contain no meaningful information
We Do
6. / 7. Scanning and assigning filenames
Filenames are the key between scans metadata
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
Assigning filenames at City Archives AmsterdamCustomer request Management systems
First 6#: ordernrLast 6#: serial nr
Order ticket
FilenameScanning the order
A20758000001
A20758000002
A20758000003
RangeA20758000001 – A20758999999
Archive 195File 836 Order: A20758
A20758000004
A20758000005
Scan report
A20758000001A20758000002A20758000003A20758000004A20758000005
12 digits
Registrationfilenames
Import
An application from which all checks can be executed is in development
Scans and metadata are checked efficiently
Where possible checks are automated
10. 11. Checking scans and metadata
Check MethodViruses Virus checkerData integrity MD-5 checksum comparisonFile format validity Jhove
Quality scansVisual check reference scansVisual check production scans
Completeness Depends on projectFilenames Script
Basic checks
We Do2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
After import the “order for digitization” of each unit is completed
After approving of all checks, scans and metadata are imported into the managementsystems
The imports are executed automatically, on basis of scripts and standard protocols for file transfer
13. 14. Import metadata and scans into management systems
We Do2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
After import the metadata are optimized for the search system
For exchange of finding aids we use EAD
From any workstation at the archive, directly via the CMS of the website
The website is hosted from an external location
Metadata are uploaded to the webserver by simple HTTP transfer
18. Import metadata into the website
We Do2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
Until then scans are transported by use of portable USB harddisks
Bandwith of the internet connections at the archive is still too small for direct sFTP (or suchlike) upload of large quantities of scans to the webserver
It seems likely that in the near future this will change
17. Import scans into the website
Transport medium
We Do2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
Derivates for use of thumbnails and zoom / contrast functionality are made
After connecting the harddisk to the server the import process starts
Some basic checks are executed on the scans
Import17. Import scans into the website
We Do2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
The requester can decide whether to buy scans or not
When both scans and metadata have been imported, automatically an email is send to the requester for digitization
This email contains a link to the finding aid and thumbnails on the website
Request completed
We Do
The happy customer:
2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
MARAC Conference October 30 2009
The requester can decide whether to buy scans or not
When both scans and metadata have been imported, automatically an e-mail is sentto the requester for digitization
This email contains a link to the finding aid and thumbnails on the website
Request complete!
The happy customer:
We Do2. Providing Ordernr(s)
3. Assessing the originals
4. Preparing the originals
5. Transport
6. Scanning 7. Assinging filenames
8. Transport
9. Checking originals
10. Checking scans
13. Import in controled
storage system
15. Export scans
17. Import scans
16. Export metadata
18. Import metadata
14. Import in metadata system
11. Checking metadata
1. Requesting digitalization
12. Originals back to
repositry
Costs Archiefbank (2008)Digitsation on request € 140,000Webservices € 52,000Digitization projects € 200,000
Income Archiefbank (2008)
Digitsation on request € 100,000Project funding € 330,350Government € 40,000
Costs and income