Ruben Verborgh - Creëren, aanbieden en gebruiken van Connected Data (CC BY-SA 4.0)

Creëren, aanbieden en gebruiken van Connected Data

Ruben Verborgh

Kruispuntbank van OndernemingenFEDERAAL

AdressendatabankenREGIONAAL

Duizenden brieven van de Federale Overheidkeerden jaarlijks terug……omdat het adres van het bedrijf veranderde.

Nog eens duizenden brievenkeerden jaarlijks terug……omdat het adres en het bedrijf nooit hadden bestaan.

Kruispuntbank van OndernemingenFEDERAAL

AdressendatabankenREGIONAAL

Postdoctoraal onderzoeker aan Universiteit Gent – iMinds

Semantic WebLinked Data

Web APIs

Ruben Verborgh

Schaalbare toegang tot Linked Data

aanbieden

creëren

gebruiken

Connected Data

Connected Data

objectenmetadata

services

databronnen

Linked Data

Linked Data

Bill

knows

Al

Linked Data

Bill

Al

http://dbpedia.org/resource/Bill_Clinton

knows

http://dbpedia.org/resource/Al_Gore

http://xmlns.com/foaf/0.1/knows

Linked Data




Linked Data

wederzijdse relatie tussen personen


Linked Datahttp://xmlns.com/foaf/0.1/Person




http://xmlns.com/foaf/0.1/Person

Linked Data

Bill

knows

Al

aanbieden

creëren

gebruiken

Connected Data

Weinig data wordt connected geboren.


DBpedia is de Linked versie van Wikipedia.

Gestructureerde datawordt door een script in triples omgezet.


http://wikipedia.org/wiki/Bill_Clinton

DBpedia is de Linked versie van Wikipedia.

Zo’n script wordtgeschreven door IT’ers en is specifiek voor iedere website.

Hoe kunnen we zelf eenvoudig data connected maken?

Hoe geven we dingen een URL?

Hoe linken we die URLs?


gestructureerde data

ongestructureerde data

Record ID: 402320 Object Title: College bed/lounge designed by John Andrews, 1965 Registration Number: 2010/9/1

Categories: Sofa-beds|Furniture Height: 310 mm Width: 860 mm Depth: Diameter: Weight:

Hoe kunnen we dit stuk data linken aan andere?

LCSH: Library of Congress Subject Headings

Waar vinden we meer over “Sofa-beds” en “Furniture”?

AAT: Art and Architecture Thesaurus

DDC: Dewey Decimal Classification

Waar vinden we meer over “Sofa-beds” en “Furniture”?

Waarom zouden wedeze URL gebruiken?

“Furniture” is een tekenreeks.

http://id.loc.gov/authorities/subjects/sh85052522.html#conceptidentificeert een stuk Connected Data,dat verbonden is met andere.

Record ID: 402320 Object Title: College bed/lounge designed by John Andrews, 1965 Registration Number: 2010/9/1

Categories: Sofa-beds|Furniture Height: 310 mm Width: 860 mm Depth: Diameter: Weight:

Hoe gaan we van “Furniture” naar “LCSH Furniture”?

http://id.loc.gov/authorities/subjects/sh85052522.html#concept

We queryen de LCSH-dataset via de SPARQL-querytaal.

SELECT * WHERE { ?concept skos:prefLabel “Furniture". }

http://id.loc.gov/authorities/subjects/sh85052522.html#concept

We hoeven dit niet manueel te doen voor elke entry.

OpenRefine

OpenRefine is zoals Excel voor grote hoeveelheden data.

OpenRefine kan automatisch queries uitvoeren voor links.

Met een minimale inspanningis 90% van de dataset gelinkt.

This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technologycopyright c�2012 (American Society for Information Science and Technology)

LCSH

LCSH + AAT68.4%

81.1% 77.1%

89.8% of records reconciledPHM Collection

AAT

Figure 3: Almost 90% of the PHM records have been reconciled by combining the LCSH and the AAT.

and 19.8% is exclusively matched to the AAT (but only account for 29,676 occurrences). If we take into account boththe LCSH and the AAT reconciliation, we can state that 109,599 out of 167,016 rows have been reconciled, or 65.6%.On the record level, this comes down to 67,579 records or 89.8% that have been automatically reconciled, as illustratedto scale in Figure 3.

5.3 Assessing the characteristics of the reconciled termsTwo questions arose about the reconciled headings: 1) how are the terms structured at the syntactic level and 2) do thereconciled headings provide a sufficient level of granularity to offer an added value in the context of information searchand retrieval?

5.3.1 Syntactic structure of the reconciled terms

In order to assess the internal structure of the terms reconciled with the LCSH and the AAT, we performed a part-of-speech (POS) analysis with the help of the Natural Language Toolkit38, a collection of Python modules for advancedtext analytics, providing among other tools a probabilistic (maximum entropy) POS tagger. The tags used originate fromthe Penn Treebank project39, which is the most widely established reference in the field of Natural Language Processing.

Table 2 shows the five most common structures, with figures and percentages for both the LCSH and AAT (NNSstands for plural common noun; NN for singular or mass noun; JJ for adjective and VBG for gerund, i.e. -ing verbal form).Terms consisting of a single plural noun ("Flatirons") account for about half of all categories within both vocabularies,followed by terms formed by a plural noun modified by another noun ("Chocolate moulds"). Singular or mass nouns("Glass") come third for LCSH terms, but are rarer in the AAT, as could be expected from the earlier discussion ofthe singular/plural alternance (see section 4.3.2). More plural noun follow, modified either by an adjective ("Acousticguitars") or by a gerund ("Copying machines").

In total, 43 different patterns were identified for the LCSH terms, and 38 for the AAT ones (with a large overlapbetween the two). These include very uncommon structures such as NN JJ NN NNS (e.g., "Gelatin dry plate negatives")and NN CC NN NN (e.g., "Storage and display furniture"), which account for only one category in the dataset. Apartfrom the singular-noun discrepancy described above, no substantial difference was found between the LCSH and theAAT, complex terms with three words or more remaining exceptions in both vocabularies.

Two-word terms, however, when added together, represent a large number of the categories, ranging from 39.6%for the LCSH to 43.6% for the AAT. In this context, the skos:altLabel demonstrates its utility in the sample bylinking non-preferred terms used by the Powerhouse museum such as "Hand loom", "Chocolate moulds", and "Personaleffects" to the topical terms from the LCSH "Handlooms", "Chocolate molds", and "Personal belongings" respectively.

38http://www.nltk.org/39http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

14

http://freeyourmetadata.org/publications/


gestructureerde data

ongestructureerde data

On March 15th, we visitedWashington to see the White House.


Automatisch detecteren van Named Entities in tekst.

Automatisch disambigueren van Named Entities in tekst.

OpenRefine kan automatisch disambigueren via Web services.

Zelfs als niet IT-specialistkan je Connected Data maken.

Misschien niet voor 100% van de data, maar toch heel goedkoop voor 80% à 90%.

gratis tutorials

Using OpenRefine !

Ruben VerborghMax De Wilde

gratis starthoofdstuk

Linked Data for Libraries, Archives and Museums !

Seth van HoolandRuben Verborgh

gratis starthoofdstuk

aanbiedencreëren

gebruiken

Connected Data

Hoe geven we gebruikers toegang tot Connected Data?

data-dump

SPARQL endpoint

een eigen API

herbruikbare API

Gebruikers downloaden alles, en queryen data lokaal.

voordeel eenvoudige interface !

nadelen grote bestanden niet up-to-date


data-dump

SPARQL endpoint

een eigen API

herbruikbare API

Linked Data bestaat uit triples. SPARQL is een triple-querytaal.

SELECT * { ?movie dbpedia-owl:starring dbpedia:Al_Gore. ?movie rdfs:label ?title. ?movie dbpedia-owl:director ?director. }

De gebruikers beslissenwat zij willen zien.

Endpoints bieden miljoenentriples Linked Data aan.

Iedere gebruiker kan zeggen: “ik wil dit soort triples.”

SPARQL endpoints hebben beperkte beschikbaarheid.

If you have operational need for SPARQL accessible data,you must have your own infrastructure.

No public endpoints. Public endpoints are for lookups and discovery; sort of a dataset demo.

—Orri Erling, OpenLink (2014)

Gebruikers kiezen queries zoals ze zelf willen.

voordelen up-to-date uniform en flexibel te bevragen !

nadelen hoge kost voor de aanbieder lage beschikbaarheid


data-dump

SPARQL endpoint

een eigen API

herbruikbare API

Er bestaan reeds meer dan 12.000 Web APIs.

Er zijn dus 12.000 verschillende manieren om hetzelfde te doen.

Bouw dus vooral geen API. Je wil niet nummer 12.001 zijn.

“The lie of the API”

APIs stellen data beschikbaarzoals de aanbieder dit wil.

voordelen up-to-date goedkoop om aan te bieden !

nadelen duur om te bouwen en onderhouden specifieke querysoftware nodig


data-dump

SPARQL endpoint

een eigen API

herbruikbare API

Hoe kunnen we één API maken voor Connected Data?

goedkoop om aan te bieden !

eenvoudig om te queryen !

toch up-to-date

De basis van Linked Data bestaat uit triples.

Bied data aan per triple-patroon.

Bill_Clinton ? ?

? ? Al_Gore

? knows ?

De client van de gebruiker lost complexe vragen op.

SELECT * {

?movie dbpedia-owl:starring dbpedia:Al_Gore.

?movie rdfs:label ?title.

?movie dbpedia-owl:director ?director.

}

Simple servers en slimme clientszorgen voor schaalbaarheid.

voordelen goedkoop om aan te bieden hoge beschikbaarheid up-to-date data !

nadeel queries gaan trager

Ons onderzoek bestudeert de trade-offs tussen Web APIs.

data-dump

SPARQLquery-resultatentriple-patronen

linkeddatafragments.org

aanbieden

creëren

gebruiken

Connected Data

Op welke manieren kunnen we Connected Data gebruiken?

offline

zoals een databank

zoals het Web

Download alles lokaal, en doe zoals gewoonlijk.


offline

zoals een databank

zoals het Web

De databank-filosofie: vraag—wacht—doe.

MySQLdatabase

Resultaat

De databank-filosofie: vraag—wacht—doe.

SPARQL endpoint

Resultaat


offline

zoals een databank

zoals het Web

De Web-filosofie:vraag—doe terwijl data stroomt.

ResultaatResultaat

Connected Data begint met intelligente applicaties.

Bouw geen intelligente servers.

Bouw servers die clients in staat stellen om intelligent te reageren.

Pluk het laaghangende fruit.Wacht niet tot de hele boom rijp is.

@RubenVerborghruben.verborgh.org

Ruben Verborgh - Creëren, aanbieden en gebruiken van Connected Data (CC BY-SA 4.0)

Data & Analytics

Transcript of Ruben Verborgh - Creëren, aanbieden en gebruiken van Connected Data (CC BY-SA 4.0)