NDE Digitaal Erfgoed Bruikbaar · 7/7/2017 · 4.9 Organization 7 4.10 Organization profile 7 4.11...

NDE

Digitaal Erfgoed Bruikbaar

Distributed network of digital heritage information

- high-level functional design -

Version: 7 July 2017 - draft .

The latest version of this document is available on GitHub See: https://github.com/netwerk-digitaal-erfgoed/high-level-design

2

Contents 1. Introduction 4 2. Status of this document 4 3. Audience 5 4. Glossary 6

4.1 Collection 6 4.2 Cultural heritage institution 6 4.3 Cultural heritage object 6 4.4 Dataset 6 4.5 Dataset profile 6 4.6 Object 6 4.7 Object description 6 4.8 Object profile 7 4.9 Organization 7 4.10 Organization profile 7 4.11 Provenance 7 4.12 Provider 7 4.13 Term 7 4.14 Term description 7 4.15 Terminology source 7

5. Design considerations 8 5.1 Simplicity 8 5.2 Use terms from terminology sources 8 5.3 Refer to information 8 5.4 Make information available at the source 8 5.5 Be neutral about the nature of information 8 5.6 Use open standards for information integration 9 5.7 Collect information selectively 9

6. Building blocks 10 6.1 Back end building blocks 10

6.1.1 Terminology management system 10 6.1.2 Collection management system 11 6.1.3 Aggregator 11

6.2 Cross-domain building blocks 12 6.2.1 Registry 12 6.2.2 Network of Terms 12 6.2.3 Knowledge Graph 13

6.3 Front end building blocks 14 6.3.1 Browser 14 6.3.2 Service portal 15

7. Functions 16 7.1 Registry 16

7.1.1 Register organization profiles 16 7.1.2 Register dataset profiles 16 7.1.2.1 Register profiles of datasets of term descriptions 17 7.1.2.3 Register profiles of datasets of object descriptions 17

3

7.1.3 Register object profiles 18 7.1.4 Expose profiles 19 7.1.5 Exchange profiles with distributed registries 19

7.2 Network of Terms 19 7.2.1 Retrieve dataset profiles from Registry 19 7.2.2 Load terms in datasets 19 7.2.3 Search terms in datasets of external terminology sources 20 7.2.4 Search terms by collection manager 20 7.2.5 Propose terms by collection manager 21 7.2.6 Exchange terms with distributed networks of terms 22

7.3 Knowledge Graph 22 7.3.1 Retrieve profiles from Registry 22 7.3.2 Verify relations in profiles 23 7.3.3 Typify objects descriptions 23 7.3.4 Retrieve related terms from Network of Terms 24 7.3.5 Search for relations by browser 24 7.3.6 Exchange relations with distributed knowledge graphs 24

7.4 Browser 25 7.4.1 Select identifiers from Knowledge Graph 25 7.4.2 Retrieve term descriptions from terminology management systems 25 7.4.3 Retrieve object descriptions from collection management systems 26 7.4.4 Translate user queries to term identifiers 27

4

1. Introduction This document outlines the high-level functional design of the distributed network of digital heritage information. The document was commissioned by work program Usable of the Digital Heritage Network (NDE), a partnership between cultural heritage institutions in the Netherlands. The document describes the design of a new, cross-domain infrastructure for improving the usability of digital heritage information beyond the boundaries of archives, libraries, museums and research institutes. The design is high-level because it describes the distributed network in general terms, not its details. The design is functional because it describes the functionality of the distributed network, not for instance its organizational or technical considerations. Other types of designs - detailed, non-functional - will be made in the next steps of the program. The document builds upon three other documents: 1. The National Digital Heritage Strategy offers a perspective on developing a national,

cross-sector infrastructure of digital heritage facilities. 2. The Digital Heritage Reference Architecture (DERA 1.0; in Dutch) provides a coherent

set of architectural goals, principles and requirements that enable cultural heritage institutions to work together in a shared infrastructure.

3. The position paper Towards a distributed network of heritage information (Word

document) proposes an approach for connecting heritage information using state of the art concepts and technologies, such as Linked Data.

2. Status of this document This is a working draft. It may be updated at any time, evolving as we go. It is published for examination, experimental implementation and evaluation. Feedback is highly appreciated. This document lays a - mostly theoretical - foundation for the future development of the infrastructure. In the upcoming months the ideas in this document are going to be applied and validated in various projects of cultural heritage institutions and related organizations. The outcomes will be used to strengthen the design. The latest version of this document is available on GitHub.

5

3. Audience The intended audience of this document is threefold:

1. IT strategists, architects, analysts and developers from organizations that publish or use heritage information. For example: cultural heritage institutions, service portals or suppliers of IT solutions.

2. Product owners and product managers from organizations that publish or use

heritage information. For example: cultural heritage institutions, terminology source providers or aggregator providers.

3. Researchers from the academic community. We hope that researchers will amend

and improve the design in this document with insights from their own research findings.

The document assumes knowledge about topics such as aggregation, web architecture, publishing data on the web and Linked Data.

6

4. Glossary The glossary explains the key terms used in this design. The definitions are going to be discussed with partners in the cultural heritage community in order to connect to existing glossaries.

4.1 Collection A group of cultural heritage objects to be seen, studied, or kept together. For example: the paintings of a museum, the books of a library or the records of an archive. A collection has meaning to its maintainers and users; it deals, for instance, with a certain topic. A collection can be available in one or more datasets, suitable for processing by the distributed network.

4.2 Cultural heritage institution An organization that curates a collection of cultural heritage objects. For example: an archive, a library, a museum or a research institute that maintains a cultural heritage collection.

4.3 Cultural heritage object A heritage asset, physical or born-digital, curated or made available by a cultural heritage institution. For example: a book, an article in an electronic journal, an archive of historical records, a monument or a painting.

4.4 Dataset A selection of data, published or curated by a single agent, and available for access or download in one or more formats.1 For example: a dataset can contain object descriptions, term descriptions or links between two datasets ("alignments").

4.5 Dataset profile The metadata of a dataset describing its characteristics, both administrative, descriptive and structural. For example: the identifier, title, curating organization, language, terms or provenance information of a dataset.

4.6 Object An item of interest to cultural heritage, physical or born-digital. An object can be anything, depending on the curating institution: it can be a cultural heritage object or stem from a cultural heritage object. For example: a monument (such as a fortification) or the parts of a monument (such as the gates, towers or walls of a fortification). Or a magazine, the articles in the magazine or the paragraphs in the article of a magazine. Or a set of archival records or an annotation by an expert about these records. An object typically has an object description.

4.7 Object description The metadata of an object, both administrative, descriptive and structural. For example: the title, date of creation, terms or provenance information of an object. An object description is maintained in a collection management system.

1 This definition is based on DCAT's. However, the DCAT term "collection" is omitted intentionally:

7

4.8 Object profile A small subset of the metadata of an object containing its URI and the URIs of the terms that characterize the object, the datasets to which the object belongs and other, related objects (for instance, the parts of a multi-volume book).

4.9 Organization A body that represents a collection of people organized together into a community or other social, commercial or political structure.2 For example: a cultural heritage institution, an aggregator provider or a terminology source provider.

4.10 Organization profile The metadata of an organization describing its characteristics, both administrative, descriptive and structural. For example: the identifier, name or address of an organization.

4.11 Provenance A statement of any changes in ownership and custody of the resource since its creation that are significant for its authenticity, integrity and interpretation. The statement may include a description of any changes successive custodians made to the resource.3

4.12 Provider An organization that contributes data to the distributed network of heritage information. For example: a cultural heritage institution, aggregator provider or terminology source provider.

4.13 Term A word or phrase used to describe a thing or to express a concept.4 For example: the name of a person or place, the title of a work or the subject of a painting. A term typically is part of an object description; a term supports the object’s discoverability for users.

4.14 Term description The metadata of a term, both administrative, descriptive and structural. For example: the identifier, label, language, scope notes or place in hierarchy of a term. A term description is maintained in a terminology management system.

4.15 Terminology source A structured list of terms. For example: a classification system of subjects, a thesaurus of persons or an authority list of places or time periods. A terminology source has a controlled vocabulary; its terms are predefined and approved by the maintainer of the source. A terminology source is maintained by a terminology source provider.

2 https://www.w3.org/TR/2014/REC-vocab-org-20140116/#org:Organization 3 http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms#provenance 4 https://en.oxforddictionaries.com/definition/term

8

5. Design considerations The design of the distributed network of heritage information is guided by a number of design considerations. These considerations on the one hand are an elaboration of principles of the Digital Heritage Reference Architecture (DERA). These considerations on the other hand are new and could be a supplement to the DERA.

5.1 Simplicity Designing, developing and operating any system is complex. This also holds true for a new infrastructure of heritage information. The distributed network aims to be as simple as possible in order to minimize and control complexity. The distributed network aims to be lightweight, with minimal functionality (“just enough”). As Edward de Bono puts it: “To get simplicity you have to want to get it. To want to get simplicity you have to put a high value on simplicity.”

5.2 Use terms from terminology sources The use of controlled terms for describing objects is not commonplace for cultural heritage institutions in the Netherlands. One institution uses terms from an authoritative terminology source, another institution uses its own, inborn terms and yet another institution doesn’t use terms at all. This diversity severely hinders the findability of their information, especially outside of their institution. We want to improve this. The distributed network makes terms from established terminology sources easy accessible for use. This enables institutions to describe their objects adequately and enables service portals to find these objects easily and unambiguously.

5.3 Refer to information The current infrastructure of heritage information makes it hard to find information cross-domain. Most information resides in a particular domain, stored in domain-specific aggregators. Other information is available across domains, but mostly stored in thematic aggregators that hold specific subsets of information. We want to improve this. The distributed network makes information findable cross-domain by registering the locations where information can be found. Similarly to the index of a book, which uses pointers that refer to sections in the book, the distributed network uses pointers that refer to information. This makes a light infrastructure: it does not have to aggregate the information from all institutions in all domains to allow a service portal to find information cross-domain, it simply has to register the places where a portal needs to look.

5.4 Make information available at the source The current infrastructure of heritage information predominantly relies on aggregation. Information is copied from one system to another, producing duplicates or derivatives in various systems with various owners. This is inefficient and ineffective. We want to improve this. The distributed network makes information distributed available at the source: a cultural heritage institution or a terminology source provider. Information lives in this source; this is where information is maintained and where experts do their work. Service portals that want to use information from a source can get it by addressing the system of the source via standardized protocols for data exchange.

5.5 Be neutral about the nature of information The distributed network of heritage information is a new infrastructure for exchanging information. Yet the institutions that constitute the Digital Heritage Network vary greatly,

9

ranging from archives to libraries and from museums to research institutes. This poses a challenge: how can an infrastructure support a broad range of organizations with different types of objects and different perspectives on data modeling? The distributed network wants to solve this by being neutral about the nature of information. The distributed network doesn’t have an overarching data model or super ontology to which institutions must comply. Instead, the distributed network aims to be unopinionated, respecting both the depth and breadth that institutions apply in their information.

5.6 Use open standards for information integration The current infrastructure of heritage information consists of information silos. Information systems, such as collection management and terminology management systems, use different formats and protocols for exposing information. For example: one system implements open standards, the other proprietary standards and yet another homegrown solutions. These differences cause incompatibilities and make it difficult or impossible to integrate information from various sources - a serious impediment to cross-domain interoperability. We want to improve this. The distributed network fuels information integration by applying open standards. Open standards allow information to be shared and reused across system, organization and domain boundaries. The distributed network especially adheres to Linked Data principles and methods, for example by using RDF for data interchange, HTTP for data transport and URIs for resource identification and retrieval.

5.7 Collect information selectively The current infrastructure of heritage information bears a heavy burden. Its aggregators hold an abundance of data and its service portals reharvest or query this data to help users find specific bits - needles in a haystack. We want to improve this. The distributed network enables portals to select and collect only the information they want - and if they want more, they can follow their nose and look for additional information. Contrary to the distributed network, which is largely unopinionated about information, a portal can be as opinionated as it wants to be in order to deliver a meaningful service to its users.

10

6. Building blocks The distributed network of heritage information consists of building blocks. A building block, according to The Open Group Architecture Framework (TOGAF), is a package of functionality defined to meet business needs with published interfaces to access the functionality5. There are eight building blocks, divided into three groups or layers: back end building blocks, cross-domain building blocks and front end building blocks. The diagram underneath depicts the blocks:

The blue blocks are back end building blocks. The green blocks are cross-domain building blocks. The red blocks are front end building blocks. Each block is described hereafter.

6.1 Back end building blocks

6.1.1 Terminology management system A terminology management system manages and publishes the term descriptions of a terminology source. It is used by a terminology source maintainer for describing terms in the source, for example by assigning labels or scope notes. A terminology management system is either purpose-built or generic. Purpose-built systems include RKDartists& and the shared automated cataloging system of the Dutch libraries (GGC). Generic systems include OpenSKOS and PoolParty, systems that can be used for all sorts of terminology sources, such as the Common Thesaurus for Audiovisual Archives (GTAA) and the Erfgoedthesaurus.

5 http://pubs.opengroup.org/architecture/togaf8-doc/arch/chap32.html

11

A terminology management system can contain different types of terminology sources. For example: a terminology source can be national (such as RKDartists&) or international (such as DBpedia). Another example: a terminology source can be self-contained (such as the Art & Architecture Thesaurus (AAT)) or an adaptation, extension or fusion of one or more sources (such as the Visual Thesaurus for Fashion and Costumes (VTMK)). A terminology management system is compliant with the distributed network if it can publish its term descriptions as Linked Open Data in a standardized term vocabulary, such as SKOS.

6.1.2 Collection management system A collection management system - or collection registration system - manages and publishes the object descriptions of a cultural heritage institution. It is used by a collection manager, such as a cataloger, curator or librarian, for describing objects in the collection of the institution, for example by assigning terms from terminology sources. A collection management system is compliant with the distributed network if it meets three conditions. First, it can assist a collection manager in finding appropriate terms for describing objects by querying the Network of Terms and storing the matching identifiers as part of the object descriptions. Second, it can publish its object descriptions as Linked Open Data in a domain-specific or domain-independent vocabulary, such as Encoded Archival Description (EAD), Europeana Data Model (EDM) or Schema.org. Third, it can register organization, dataset and object profiles in the Registry. A cultural heritage institution can use an aggregator if its collection management system is yet unable to publish Linked Data or register profiles on its own.

6.1.3 Aggregator An aggregator - or harvester - gathers together the metadata of objects from a number of providers into a combined data store. A user can then query the aggregator unequivocally, independent of how the providers had originally delivered their data. An aggregator typically uses a protocol such as OAI Protocol for Metadata Harvesting (OAI-PMH). An aggregator collects specific types of information from designated institutions. For example: the regional aggregator Collectie Gelderland collects information from institutions in its province, whereas the national aggregator Digitale Collectie collects information from institutions all over the country. Another example: the thematic aggregator Netwerk Oorlogsbronnen collects information related to World War II from a variety of institutions, whereas the domain aggregator Archives Portal Europe (APE) collects information from solely archival institutions. An aggregator is compliant with the distributed network if it meets two conditions. First, it can publish its object descriptions as Linked Open Data in a domain-specific or domain-independent vocabulary. Second, it can register object profiles of its providers in the Registry. The role of an aggregator in the distributed network of heritage information will change in time, depending on the capabilities of the collection management systems of cultural

12

heritage institutions. First, an aggregator is necessary for institutions whose collection management systems cannot publish their object descriptions as Linked Data. An aggregator does this for them. Second, an aggregator is necessary for institutions whose collection management systems can publish Linked Data, but have limited possibilities in doing so, for example when exporting or querying object descriptions. A specific Linked Data aggregator - also known as a Linked Data platform or Linked Data wrapper - assists in exposing Linked Data. Third, an aggregator is no longer needed for institutions whose collection management systems can offer their object descriptions as full-grown Linked Data. An aggregator may then become a browser or service portal.

6.2 Cross-domain building blocks

6.2.1 Registry The Registry is a cross-domain building block for registering profiles. It contains three types: organization profiles, dataset profiles and object profiles. These profiles are registered in the Registry by their maintainers, notably cultural heritage institutions. The profiles as a whole show which organizations, datasets and objects constitute the distributed network of heritage information. A profile principally consists of identifiers or URIs. Identifiers point to the places where information about organizations, datasets and objects can be found. The goal of a profile is to facilitate discovery: its identifiers are starting points that enable users of the Registry to find relevant heritage information. Each institution decides for itself which profiles it wants to register in the Registry. The Registry does not impose conditions on the content, type or purpose of the contributions by institutions. For example: one institution may register multiple dataset profiles in the Registry, one for each dataset, whereas another institution may register one dataset profile of just one dataset. The Registry can be a distributed service. Rather than having one central service, multiple, autonomous registries can exist for specific purposes. For example: the cultural heritage institutions in the province of Friesland can register their profiles in a Frisian Registry or the archives in the Netherlands can register their profiles in an Archive Registry. These registries can then exchange profiles using the infrastructure of the distributed network. The functions of the Registry are explained in section “Functions”.

6.2.2 Network of Terms The Network of Terms is a cross-domain building block for finding terms. It can for instance be used by collection managers of institutions: they can query the Network of Terms and select terms that describe their objects. The Network of Terms exposes two types of information. First, the term descriptions of selected terminology sources. For example: National Thesaurus for Author Names (NTA) or RKDartists&. Second, the relations between terms, such as synonyms or hypernyms. If a relation exists in a terminology source, the Network of Terms will use it. For example: the NTA refers in its description of Multatuli to another description - that of the actual author, Eduard Douwes Dekker. If a relation does not exist in a terminology source, the maintainer

13

of the Network of Terms can create it. For example: both the NTA and DBpedia have a description of Multatuli, but there is no relation between these descriptions; the sources do not refer to each other. The maintainer can make this relation explicit by registering it in the Registry; the Network of Terms will then adopt it. A terminology source can be incorporated in the Network of Terms if it meets a number of conditions. First, it is relevant to the Dutch cultural heritage sector and curated by a Dutch cultural heritage institution. Second, it is standardized and widely used. Third, it is publicly available and easily accessible, without financial, technical or legal constraints. This implies that the Network of Terms does not contain terms of proprietary sources nor of institution-specific sources that aren’t used outside of a particular institution. It also implies that the Network of Terms does not contain terms of international sources such as DBpedia, GeoNames or VIAF: incorporating such sources would make the Network too heavy. However, the Network can be a mediator. This allows users to look for terms in external sources by querying the Network, without having to address the sources directly. The Network of Terms can be a distributed service. Rather than having one central service, multiple, autonomous networks of terms can exist for specific purposes. For example: the terminology sources used by libraries can be put in a Network of Terms for Libraries and the sources used by museums in a Network of Terms for Museums. These networks of terms can then exchange term descriptions using the infrastructure of the distributed network. The functions of the Network of Terms are explained in section “Functions”.

6.2.3 Knowledge Graph The Knowledge Graph is a cross-domain building block for finding relations between organizations, datasets, objects and terms. The Knowledge Graph gathers these relations by retrieving the profiles from the Registry and extracting the URIs inside the profiles. The Knowledge Graph then relates and stores these URIs. Subsequently other building blocks, such as a browser or service portal, can query the Knowledge Graph and discover relations. For example: a browser can request the Knowledge Graph to find all URIs of datasets or objects that are related to the subject term Dutch East Indies and belong to the organizations Museum Bronbeek and National Archives. The Network of Terms and Knowledge Graph serve a distinct purpose. Although both building blocks have information about terms, the Network of Terms exposes all terms of terminology sources regardless of the institutions that use the terms, whereas the Knowledge Graph exposes the terms that are actually being used by institutions and the terms that have a direct relation with these terms, such as identical, related, broader or narrower terms. For example: the Network of Terms may contain the author terms Eduard Douwes Dekker and his pen name Multatuli, both stemming from the National Thesaurus for Author Names. An institution that owns a copy of Dekker’s book Max Havelaar may use just the term Multatuli in the book’s object description. The Knowledge Graph then knows of Multatuli and also of his orthonym Eduard Douwes Dekker: there is a direct relation between the author terms in the NTA that the Knowledge Graph can use. Another example: the Network of Terms may contain the cities Eindhoven and Roosendaal and their province Noord-Brabant,

14

originating from the Erfgoedthesaurus. An institution that maintains a monument in Eindhoven may use that term in its object description. The Knowledge Graph then knows of the city and of its province, a broader term. However, the Knowledge Graph doesn’t know of Roosendaal: there is no direct relation between Eindhoven and Roosendaal and no object description referencing Roosendaal. The Knowledge Graph can be a distributed service. Rather than having one central service, multiple, autonomous knowledge graphs can exist for specific purposes. For example: the relations of libraries can be put in a Knowledge Graph for Libraries and the relations of museums in a Knowledge Graph for Museums. These knowledge graphs can then exchange relations using the infrastructure of the distributed network. The functions of the Knowledge Graph are explained in section “Functions”.

6.3 Front end building blocks

6.3.1 Browser A browser is a building block for discovering heritage information for a service portal. The browser retrieves the terms, relations and profiles from the cross-domain building blocks, notably the Knowledge Graph. The browser then retrieves the associated object and term descriptions from the collection and terminology management systems of the providers. The browser saves the information in its data store, ready for querying by service portals. A browser provides decoupling: a service portal can use it to get information without having to know and implement the interfaces of the underlying building blocks. A browser also makes it easier to develop new service portals. For example: an organization may use it to build both a website and an app - the foundation is the same. A browser is different from an aggregator. Although both building blocks collect information, an aggregator generally harvests all available information from providers, without knowing what information is relevant to its users. Instead, the browser collects information selectively, depending on the need of its users, the service portals. This turns harvesting into a directed, service-driven process. A browser supports a number of protocols for retrieving information from providers, depending on the capabilities of their systems. For example: the collection management system of one institution can be queried periodically via a commonly used protocol such as OAI-PMH whereas the system of another institution can be queried in real-time via a modern protocol such as Linked Data Fragments (LDF). Eventually the network will evolve into an open and real-time browsable network of distributed heritage information based on LDF and other state of the art protocols. A basic browser is going to be developed as a cross-domain building block, suitable for common use cases of service portals. A service portal can develop its own browser if it has specific demands. The functions of the browser are explained in section “Functions”.

15

6.3.2 Service portal A service portal is a building block that provides a service to users. A service portal uses the cross-domain building blocks of the distributed network, typically by means of a browser. A service portal can be developed by anyone. For example: cultural heritage institutions, non-profit organizations, governments or businesses. A service portal can be developed for anyone. For example: the general public, exhibition designers, journalists, scientists, students or teachers. There are two types of service portals. First, a user interface. A user interface is operated by an end user for finding and presenting heritage information. For example: a website that can be used by visitors of a museum to browse the collection or an app that can be used by students to explore the paintings of Rembrandt. Second, a service platform. A service platform combines and enriches heritage information and makes it usable in a specific context, depending on its goals and users. For example: Digitale Collectie, Europeana or Netwerk Oorlogsbronnen. A service platform may or may not have a user interface where its information is accessible to the public. A service platform without user interface is known as a dark portal.

16

7. Functions This section outlines the primary functions of the cross-domain building blocks. The functions are a refinement of the descriptions in section “Building blocks”. A function is described only insofar it interacts with another building block.

7.1 Registry

7.1.1 Register organization profiles The Registry keeps a list of the organizations that contribute data to the distributed network of heritage information. For example: cultural heritage institutions, aggregator providers or terminology source providers. Each organization has a distinct organization profile, containing for instance its name and identifier. This allows users of the Registry to identify an organization unambiguously. A maintainer of an organization is responsible for maintaining the profile of his organization. There are two ways of doing this. First, the maintainer can use his registration system - such as a collection management system or terminology management system - to make changes to his profile; the registration system then updates the profile in the Registry. Second, the maintainer can logon to the administration interface of the Registry and make changes there. The Registry assigns identifiers to the organization profiles. These identifiers can then be used for retrieving the profiles. For example, the identifier of museum A can look like this: http://registry.nde.nl/organizations/a. If a user looks up this URI, the Registry returns the profile of A. If an organization has its own organization identifier - persistent and dereferencable - it can be added to the profile so as to signify the relation between both.

7.1.2 Register dataset profiles The Registry keeps a list of the datasets that are available to the distributed network of heritage information. Each dataset has a distinct dataset profile. The profile contains administrative, descriptive and structural data about the dataset. For example: identifier, title and curating organization. This allows users of the Registry to identify a dataset unambiguously. The actual content of a dataset, however, is not part of the profile. Instead, the profile contains a reference - a URL - to the location where the content can be found. Currently we distinguish four dataset profile types:

1. A profile for a dataset of term descriptions 2. A profile for a dataset of alignments 3. A profile for a dataset of object descriptions 4. A profile for a dataset of annotations

Each type will be explained hereafter. A maintainer of an organization is responsible for maintaining the profiles of his datasets. There are two ways of doing this. First, the maintainer can use his registration system - such as a collection management system or terminology management system - to make changes to his profiles; the registration system then updates the profiles in the Registry. Second, the maintainer can logon to the administration interface of the Registry and make changes there.

17

The Registry assigns identifiers to the dataset profiles. These identifiers can then be used for retrieving the profiles. For example, the identifier of dataset B of organization A can look like this: http://registry.nde.nl/organizations/a/datasets/b. If a user looks up this URI, the Registry returns the profile of B. If an organization has its own dataset identifiers - persistent and dereferencable - these can be added to the dataset profiles so as to signify the relation between both.

7.1.2.1 Register profiles of datasets of term descriptions The Registry maintains the profiles of datasets of term descriptions. A dataset contains the descriptions of terms from a terminology source, such as Iconclass or WO 2 Thesaurus. The dataset may include alignments with other terms. For example: the terms in the Erfgoedthesaurus refer to their equivalents in DBpedia. A terminology source may also use a separate dataset and corresponding dataset profile for describing the alignments between terms (see function “Register profiles of datasets of alignments”). A maintainer of a terminology source may register multiple versions of his dataset in the Registry. Each version corresponds to a dataset and has its own dataset profile. For example: if there are two officially supported versions of RKDartists&, dataset profile A describes version one and dataset profile B describes version two. Both profiles may refer to different locations where the term descriptions in the datasets can be found.

7.1.2.2 Register profiles of datasets of alignments The Registry maintains the profiles of datasets of alignments. A dataset contains the relations between terms that are described in different datasets, possibly from different terminology sources. For example: a relation between person A in dataset B and person C in dataset D. This type of dataset is also known as a linkset. A dataset of alignments can be created by either a maintainer of a terminology source or a maintainer of the Network of Terms. The corresponding dataset profile contains provenance information to denote the origin of the alignments.

7.1.2.3 Register profiles of datasets of object descriptions The Registry maintains the profiles of datasets of object descriptions. A dataset contains the descriptions of objects from a cultural heritage institution. For example: the dataset Nationale Bibliografie (“National Bibliography”) comprises the publications of the National Library of the Netherlands. A collection manager may register as many or as few dataset profiles as he sees fit, reflecting the collection strategy of his institution. For example: one organization can register one profile for its entire collection of objects whereas another organization can register separate profiles for each object type in its collection, such as books, magazines and newspapers.

7.1.2.4 Register profiles of datasets of annotations The Registry maintains the profiles of datasets of annotations. Annotations enrich the descriptions of cultural heritage objects.

18

For example: an expert at an institution can annotate objects by adding historical context information. The resulting set of annotations can be registered in the Registry in its own dataset profile. Another example: an editor of a cultural heritage app can write child-friendly summaries about objects. The set of summaries can be registered in the Registry in a dataset profile. Another example: an end user of a service portal can compose his own collection or mash-up of his favorite paintings via the website of an institution, such as the Rijksmuseum’s Rijksstudio. This kind of user-generated content, too, can be registered in the Registry. A dataset profile of annotations contains provenance information. This allows users - such as service portals or end users - to trace the origin of the dataset and to decide upon its usefulness in a particular context. The design of annotations and their datasets and dataset profiles is not finished yet. Further details are going to be added to this document later on.

7.1.3 Register object profiles The Registry keeps a list of object profiles that are available to the distributed network of heritage information. These profiles are provided by cultural heritage institutions or their aggregator providers. An object profile consists of four types of identifiers. First, the identifier of the object. Second, the identifier of the dataset to which the object belongs. Third, the identifiers of the terms - if any - that describe the object. Fourth, the identifiers of objects - if any - that have a relationship with the object, such as the preliminary studies of a painting or the volumes of a book. Note that other types of information that are typically part of an object’s description, such as its type, title or date of creation, are not part of an object profile. A collection manager of a cultural heritage institution is responsible for maintaining the profiles of his objects. Note that an object profile is a subset of an object description, not a new record that needs separate maintenance. A change to an object profile can be registered in the Registry in either of two ways. First, by notifying the Registry instantly, when the change occurs. This ensures the topicality of information; the Registry has the current state of the object’s profile. Second, by notifying the Registry periodically, for example once a day. The Registry will then batch process the changes. The best approach for registering changes needs further investigation and will be added to this document later on. Contrary to an organization profile and a dataset profile, the Registry does not assign an identifier to an object profile. An object profile has its own identifier, issued by the cultural heritage institution or its aggregator provider. This identifier is persistent and dereferencable. For example, an identifier of an object profile can look like this: https://www.rijksmuseum.nl/nl/collectie/RP-P-OB-79.671. Alternatively, a Handle can be used: http://hdl.handle.net/10934/RM0001.COLLECT.446245. If a user looks up this URI, the provider’s system returns the object’s description, expressed in RDF.

19

Ideally there is only one object profile of an object in the Registry, identified by the object’s identifier. But this will not always be the case. If a collection management system of a cultural heritage institution cannot issue URIs for its objects, the institution’s aggregator will do this. An institution can, however, deliver its object descriptions to multiple aggregators, in which case each aggregator assigns its own URIs to the descriptions and sends these to the Registry. Consequently the Registry has multiple object profiles for one object.

7.1.4 Expose profiles The primary task of the Registry is to register organization, dataset and object profiles. It leaves the task of using this information to other building blocks. For example: the Network of Terms wants to get the profiles of datasets of term descriptions from the Registry so that it can collect the terms from terminology sources. Another example: a service portal wants to get the profiles of datasets of object descriptions so that it can present the datasets in its user interface, akin to CKAN. The Registry exposes its information by supporting two features. First, by returning profiles of organizations and datasets. These profiles are maintained by the Registry and can be retrieved using their Registry-assigned identifier. For example: http://registry.nde.nl/organizations/a or http://registry.nde.nl/organizations/a/datasets/b. Second, by synchronizing profiles of organizations, datasets and objects. A building block can get the profiles from the Registry by making an initial synchronization request. The Registry, in turn, delivers its current profiles. The building block can stay in sync with the Registry by requesting a change list of the profiles that were added, changed or removed since its previous request. The Registry, in turn, delivers the corresponding profiles.

7.1.5 Exchange profiles with distributed registries The Registry can be a distributed service, consisting of multiple registries instead of one comprehensive service. These registries must be able to exchange organization, dataset and object profiles with each other. The design of this distributed service is not finished yet. Further details are going to be added to this document later on.

7.2 Network of Terms

7.2.1 Retrieve dataset profiles from Registry The Network of Terms exposes the term descriptions from selected terminology sources. These sources are registered in dataset profiles in the Registry. The Network obtains these profiles by retrieving (a change list of) the profiles from the Registry periodically, for example once a day. It then processes each profile. First, if a dataset profile is new or has been updated, the Network of Terms records that the term descriptions in the corresponding dataset are eligible for (re-)loading (see function “Load terms in datasets”). Second, if a dataset profile is no longer available in the Registry - for instance: it was deleted by its maintainer - the Network of Terms removes all of its references to this profile and the corresponding term descriptions.

7.2.2 Load terms in datasets The Network of Terms, after retrieving a dataset profile from the Registry, loads the term descriptions in the dataset.

20

The Network of Terms first addresses the dataset using the access information specified in the dataset profile, such as the location of the dataset and the method for requesting the dataset. The access information depends on the capabilities of the terminology management system. For example: one dataset can be retrieved by downloading a data dump and another dataset can be retrieved by querying the system’s repository via a protocol such as OAI-PMH. The term descriptions in a dataset are expressed in RDF and standard vocabularies, such as SKOS. The Network of Terms performs basic validation to ensure the well-formedness of the structure of the dataset. It does not, however, make corrections. If a dataset, for whatever reason, is ill-formed, it’s the responsibility of the maintainer of the terminology source or alignments to fix it and publish a valid dataset. The Network of Terms then processes the term descriptions. It saves a subset of each description in its data store, consisting of the information that is required for finding and understanding the term, such as type and label. The Network of Terms periodically checks and processes the datasets again, for example once a day. This ensures that the Network has current term descriptions. First, if a description had been added to or updated in the dataset, the Network adds or updates the description too. Second, if a description is no longer available in the dataset, the Network of Terms removes this description. It can happen that users of the Network of Terms - notably the maintainers of cultural heritage institutions - use references to the obsolete term. It’s the responsibility of the maintainer of the terminology source to inform his users, for example by requesting them to use alternative references, and the responsibility of the terminology management system to return an appropriate error if an obsolete term description is requested.

7.2.3 Search terms in datasets of external terminology sources The Network of Terms exposes terms from a variety of terminology sources, but it only loads terms from national sources. Loading terms from external sources - for instance: DBpedia - would make the Network large and unwieldy. Instead, the Network acts as a mediator for such sources. This allows a user to query the Network for a term in an external source. Invisible to the user, the Network routes the user’s query to the source, addressing the source’s own services for finding term descriptions. The Network then returns the search result to the user. The design for searching external terminology sources is not finished yet. Further details are going to be added to this document later on.

7.2.4 Search terms by collection manager When describing an object in his collection management system, a collection manager of a cultural heritage institution is searching for terms that define the object adequately. The collection manager can then use these terms to enrich or annotate the object description. Ideally the collection management system of the collection manager searches for terms by querying the terminology sources registered in the Registry. This, however, can be

21

challenging: each collection management system must then implement the APIs of the terminology sources and keep track of changes in the Registry, such as the addition of new and the removal of old terminology sources. The Network of Terms therefore supports the collection management system with this search by offering a unified interface to the different terminology sources. This interface allows a collection manager to search for terms. First, by finding the terms that match the collection manager’s query. For example: all terms with a specific status (e.g. no candidate terms) or type (e.g. only “persons”) or stemming from a specific terminology source (e.g. Nederlandse Thesaurus van Auteursnamen or DBpedia) or organization (e.g. Koninklijke Bibliotheek). Second, by auto-completing the collection manager’s query. For example: all terms of which the label starts with “Ams”, possibly limited to a specific type (e.g. only “places”) or one or more terminology sources (e.g. Erfgeothesaurus). The Network of Terms returns specific information per term description. First, descriptive information for understanding the term, such as type, preferred and alternative label, scope notes and relations. Second, administrative information, such as the candidate status of the term. If a collection manager needs more information about a term, such as the birthdate of a person or the geographic coordinates of a place, the collection management system can get this information from the corresponding terminology source by dereferencing the URI of the term. The collection management system decides what must be done with the resulting information, for example by disposing it after use or by saving it in its data store for the performance of future use. Note that the Network of Terms merely facilitates a collection manager in finding terms. The Network, however, does not warn the collection manager of changes. For example: it may happen that the label of a term alters; this may or may not alter its meaning (“semantic drift”). Even though the identifier of the term remains the same, the new label may or may not be fitting to the collection manager. It’s the responsibility of the maintainer of the terminology source to inform his users about changes.

7.2.5 Propose terms by collection manager A collection manager of a cultural heritage institution may not find the term he’s looking for when describing an object in his collection management system. For example: an artist may not have been included in RKDartists& or a subject may be missing from Iconclass. The collection manager should then be able to propose a term - a candidate term or provisional term - to a terminology source. Ideally the collection manager contacts the maintainer of the terminology source directly, using the procedures set by the source. This, however, can be challenging; a collection manager must know these procedures. The Network of Terms therefore supports the collection manager in proposing a term. The collection manager can send the information about a candidate term to the Network of Terms. This information includes for example the terminology source to which the term should be added, the label of the term and a motivation. The Network of Terms then passes this information to the appropriate terminology source. What happens next depends on the source. For instance, the source can register the candidate term for review by its maintainer.

22

The source can also generate and return a provisional identifier for this term - a URI. The collection manager of the cultural heritage institution can save this identifier with the object’s description and continue his work. Meanwhile, the maintainer of the terminology source reviews the candidate term and either rejects or approves it. He then notifies the proposer - the collection manager - about his decision. On approval the maintainer of the terminology source adds the term to the dataset of the source. The Network of Terms subsequently loads the dataset, including the description of the newly added term. The maintainer of the institution can then search for the term and use it in his object descriptions. Note that the Network of Terms supports candidate terms only insofar a terminology source has an API to which the candidate terms can be proposed automatically. If a source doesn’t have an API - for instance: candidate terms can only be proposed manually - the collection manager must contact the terminology source himself.

7.2.6 Exchange terms with distributed networks of terms The Network of Terms can be a distributed service, consisting of multiple networks of terms instead of one comprehensive service. These networks of terms must be able to exchange term descriptions with each other. The design of this distributed service is not finished yet. Further details are going to be added to this document later on.

7.3 Knowledge Graph

7.3.1 Retrieve profiles from Registry The Knowledge Graph contains the relations between organizations, datasets, objects and terms. These relations originate from the profiles in the Registry. The Knowledge Graph obtains these by retrieving (a change list of) the profiles from the Registry periodically, for example once a day. It then processes each profile. First, if a profile is new, the Knowledge Graph gets the identifiers in the profile and stores these in its data store. For example: an organization profile includes the identifier of the organization (http://registry.nde.nl/organizations/a); a dataset profile includes the identifier of the license (http://standaarden.overheid.nl/owms/terms/cc0.rdf); an object profile includes the identifier of the object (https://www.rijksmuseum.nl/nl/collectie/RP-P-1910-2115) and its terms (http://www.iconclass.org/rdf/?notation=45K1). Second, if a profile has been updated, the Knowledge Graph gets the identifiers in the profile. New identifiers in the profile are added to the data store and identifiers that no longer exist in the profile are removed from the data store. Third, if a profile is no longer available in the Registry - for instance: it was deleted by its maintainer - the Knowledge Graph removes all references to this profile. As a result the Knowledge Graph has the most recent identifiers used in the latest version of the profiles. The Knowledge Graph can then connect these and create a network of relations, ready for querying by browsers or service portals.

23

7.3.2 Verify relations in profiles The relations in the Knowledge Graph tell how organizations, datasets, objects and terms are connected. The integrity of these relations is therefore of utmost importance. Even though maintainers at cultural heritage institutions and terminology sources providers are responsible for managing the relations in their profiles, the Knowledge Graph also verifies the relations in order to safeguard their integrity. First, if a profile has been synchronized by the Knowledge Graph with the Registry, it validates the containing relations, such as the identifiers of organizations, datasets, objects and terms. If this validation fails - for instance: an identifier is not a valid URI - the Knowledge Graph rejects the relation. If this validation succeeds, the Knowledge Graph accepts the relation provisionally. Second, the Knowledge Graph performs a deep validation of the relations. This is a deferred process that happens asynchronously. The Knowledge Graph fetches the resource to which a relation points, such as an object or term description, by dereferencing its URI. If this validation fails - for instance: the object or term description does not exist - the Knowledge Graph rejects the relation. If this validation succeeds, the Knowledge Graph accepts the relation. Third, the Knowledge Graph periodically validates the accepted relations again, for example once a month. This rerun of the deep validation ensures that the relations remain up-to-date, independent of the changes that were - or weren’t - provided by the maintainer of a cultural heritage institution or terminology sources provider. For example: if an object or term description, for whatever reason, no longer exists, making its identifier a dead link, the Knowledge Graph rejects the relation. Note that a description may not exist temporarily, for instance due to system maintenance. A relation is therefore rejected only if the description is unavailable for several times in a row. The maintainer of the Knowledge Graph can gather the rejected relations periodically and contact the maintainers of the concerning cultural heritage institutions or terminology sources providers.

7.3.3 Typify objects descriptions The Knowledge Graph contains the relations of objects, taken from the object profiles in the Registry. The types of the objects, however, are not part of the profiles, but are helpful to users when querying the Knowledge Graph: a user typically does not want to request all object descriptions, regardless of what they represent, but an intersection of object descriptions, matching one or more types. For example: publications, not buildings. To enable these queries the Knowledge Graph determines the type of each object description. There are two ways of doing this. First, by using the type information in the profile of the dataset to which an object belongs. This profile can denote that all of the objects in its dataset are of a particular type. Second, by extracting the object’s type from its description, for instance after dereferencing its identifier and verifying its relations (see function “Verify relations in profiles”). The best approach for determining the type needs further investigation and will be added to this document later on. The Knowledge Graph typifies object descriptions on a high-level by mapping specific types to generic types. For example: if a collection manager of a cultural heritage institution has

24

typified an object as “Clipping”, the Knowledge Graph typifies it as “Written Work”. This allows users to query the Knowledge Graph without having to know the exact types used by institutions. If, however, a user needs to have the exact types, it can collect the object descriptions and determine the types itself (see function “Retrieve object descriptions from collection management systems”).

7.3.4 Retrieve related terms from Network of Terms The Knowledge Graph contains two types of terms. First, terms that appear in profiles from the Registry. These terms have been assigned by maintainers while describing their organizations, datasets and objects. Second, terms that are related to the terms from the profiles. These related terms support users in finding information. For example: if a user is looking for object descriptions using the geographical term “Den Haag”, the Knowledge Graph also returns object descriptions that use the city’s alternative name, “‘s-Gravenhage”. Related terms are not part of the profiles. The Knowledge Graph gathers these by querying the Network of Terms. For each term used in a profile the Knowledge Graph requests its equivalent, associated and hierarchical terms, independent of the originating terminology or alignment sources. The extent to which the Knowledge Graph should retrieve related terms needs further investigation. Principally the Knowledge Graph gathers the directly related terms of a term, but it may be useful to also gather indirectly related terms. For example: the broader term of the broader term of a term, such as the province (“Zuid-Holland”) of the municipality (“Gemeente Den Haag”) of the city of Den Haag.

7.3.5 Search for relations by browser The Knowledge Graph helps the browser of a service portal to find relations between organizations, datasets, objects and terms. For this purpose a browser can query the Knowledge Graph with a set of conditions. In reply the Knowledge Graph returns the relations matching the conditions. For example: if a portal wants to present heritage information related to the Dutch colony of Dutch East Indies, the portal’s browser can request the identifiers of objects that match specific organization, dataset and/or term criteria. For instance, the objects must belong to organizations Museum Bronbeek and National Archives, must be part of datasets that are available under the license CC0 and must have the terms “Java” or “Sumatra”. The Knowledge Graph can be queried in real-time, instantly giving answers to questions of users. However, this type of querying is unlikely: the Knowledge Graph only returns identifiers, not the metadata of organizations, datasets, objects or terms. The metadata must be retrieved from their respective sources, such as collection management systems and terminology management systems. This makes real-time querying a considerable challenge to browsers. Instead, the Knowledge Graph will typically be queried periodically, for example once a day. This allows a browser to collect the metadata of the matching identifiers in a deferred process.

7.3.6 Exchange relations with distributed knowledge graphs The Knowledge Graph can be a distributed service, consisting of multiple knowledge graphs instead of one comprehensive service. These knowledge graphs must be able to exchange

25

relations with each other. The design of this distributed service is not finished yet. Further details are going to be added to this document later on.

7.4 Browser

7.4.1 Select identifiers from Knowledge Graph A browser helps a service portal to select the identifiers of organizations, datasets, objects and terms that are related to the topic of the portal. These identifiers can then be used for retrieving information about the organizations, datasets, objects and terms. For example: a service portal wants to present object descriptions from cultural heritage institutions that are part of the Netwerk Zuiderzeecollectie. The portal’s browser selects the identifiers of objects that belong to these institutions by querying the Knowledge Graph, using the identifiers of the institutions, registered in the organization profiles of the Registry, as input. This query returns a set of matching object identifiers. Another example: a service portal wants to present the cultural heritage institutions that have information about the topic Dutch East Indies. The portal’s browser first selects the terms that are related to this topic by querying the Knowledge Graph. This query returns a set of terms, originating from various terminology sources, that have a relation with the topic - for instance: the term “Nederlands-Indië” in DBpedia, the narrower term “Java” in the Brinkman Thesaurus or the related term “KNIL” in WO2 Thesaurus. The browser then selects the curating institutions by querying the Knowledge Graph, using the identifiers of the terms as input. This query returns a set of matching organization identifiers. A browser typically saves the identifiers found by the Knowledge Graph in its own data store. It queries the Knowledge Graph again periodically, for example once a week. This ensures that the browser uses current identifiers: if identifiers are added to or removed from the Knowledge Graph, the browser can update its identifiers accordingly. The design for selecting identifiers is not finished yet. For example: instead of selecting object identifiers directly from the Knowledge Graph, it could be an option to select these indirectly, from their sources. A browser then queries the Knowledge Graph using term identifiers as input. In reply the Knowledge Graph returns a set of dataset profiles of which the datasets contain object descriptions referencing the term identifiers. The browser subsequently queries the datasets for retrieving the object descriptions. This approach could benefit the scalability of the distributed network. Further details will be added to this document later on.

7.4.2 Retrieve term descriptions from terminology management systems The Knowledge Graph has limited information about terms from terminology sources: it does not have the full descriptions of terms, such as birthdates of persons or geographic coordinates of places. If a service portal needs these descriptions for serving its users, the portal’s browser can collect the descriptions at their source, the terminology management system. It does so by dereferencing the URIs of the term descriptions. A terminology management system then returns the description for each URI, expressed in RDF and term vocabularies such as SKOS.

26

The browser typically saves the term descriptions in its own data store, principally for performance. For example: it can build a faceted index by indexing the labels and identifiers in the descriptions. The browser retrieves the term descriptions again periodically, for example once a week. This ensures that the browser uses current descriptions: if the maintainer of a terminology source updates a description - for instance: by correcting the birthdate of a person - the browser can update its description accordingly. The browser acts upon the response of a terminology management system when requesting a term description. For example: a description that no longer exists triggers the system to yield a specific status (410 Gone). This status causes the browser to remove the description from its data store.

7.4.3 Retrieve object descriptions from collection management systems The Knowledge Graph has limited information about objects from cultural heritage institutions: it does not have the full descriptions of objects, such as their title, summary or date of creation. If a service portal needs these descriptions for serving its users, the portal’s browser can collect the descriptions at their source, the collection management system of the institution (or its aggregator). It does so by dereferencing the URIs of the object descriptions. A collection management system then returns the description for each URI, expressed in RDF and a domain-specific or domain-independent vocabulary. The browser, based on the input of the service portal, determines which object descriptions are usable and which are not. For example: a service portal may wish to use only objects of a specific type, such as a book, monument or painting. Or a service portal may wish to use only objects that have an image or that were created in a certain time period. The browser subsequently filters object descriptions, keeping what it wants. The browser typically saves the remaining objects descriptions in its own data store, principally for performance. For example: it can build a full-text index by indexing the words in the descriptions, such as titles or summaries. This allows a user of a service portal to find heritage information not only by using controlled terms but also by using free-form words (which is especially useful for object descriptions that lack terms, such as archival records). The browser retrieves the object descriptions again periodically, for example once a week. This ensures that the browser uses current descriptions: if a collection manager updates a description - for instance: by changing the summary of a book - the browser can update its description accordingly. The browser acts upon the response of a collection management system when requesting an object description. For example: a description that no longer exists triggers the system to yield a specific status (410 Gone). This status causes the browser to remove the description from its data store.

27

7.4.4 Translate user queries to term identifiers A browser keeps the term descriptions of terminology sources (see function “Retrieve term descriptions from terminology management systems”). The browser can use these descriptions to support a user of a service portal in finding specific information. First, by translating a user’s query to identifiers of terms in terminology sources. For example: if a user searches for “NSB” or “Nationaal-Socialistische Beweging”, the browser looks for term descriptions with these labels. The browser may then find a term description in the WO2 Thesaurus with identifier https://data.niod.nl/WO2_Thesaurus/2244. Another example: if a user searches for “Slag Mookerheide”, the browser looks for term descriptions with these labels. The browser may then find a term description in DBpedia with identifier http://nl.dbpedia.org/resource/Slag_op_de_Mookerheide. Second, by searching for heritage information that uses the identifiers. For example: the browser can query its collection of object descriptions and find the descriptions that contain the identifiers. For identifier https://data.niod.nl/WO2_Thesaurus/2244 this may result in newspapers of or archival records about the NSB. For identifier http://nl.dbpedia.org/resource/Slag_op_de_Mookerheide this may result in paintings of or books about the battle. Contrary to a full-text search, this search for identifiers allows a user to find information with greater accuracy and relevancy. A full-text search yields all resources in which the user’s search terms somehow occur, whereas an identifier search yields only those resources to which the identifier was explicitly assigned by a collection manager.

NDE Digitaal Erfgoed Bruikbaar · 7/7/2017 · 4.9 Organization 7 4.10 Organization profile 7 4.11...

Documents

Transcript of NDE Digitaal Erfgoed Bruikbaar · 7/7/2017 · 4.9 Organization 7 4.10 Organization profile 7 4.11...