Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het...

149

Transcript of Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het...

Page 1: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse
Page 2: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse
Page 3: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Het publiceren van datasets in het transportdomein voor maximaal hergebruik

Publishing Transport Data for Maximum Reuse

Pieter Colpaert

Promotoren: prof. dr. E. Mannens, dr. ir. R. VerborghProefschrift ingediend tot het behalen van de graad van

Doctor in de industriële wetenschappen

Vakgroep Elektronica en InformatiesystemenVoorzitter: prof. dr. ir. R. Van de Walle

Faculteit Ingenieurswetenschappen en ArchitectuurAcademiejaar 2017 - 2018

Page 4: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

ISBN 978-94-6355-040-6NUR 982, 993Wettelijk depot: D/2017/10.500/75

Page 5: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Examination boardProf. Oscar CorchoProf. Filip De TurckProf. Sidharta GautamaDr. ir. Philip LerouxDhr. Björn De Vidts

ChairProf. Patrick De Baets

SupervisorsProf. Erik MannensDr. ir. Ruben Verborgh

Page 6: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse
Page 7: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Preface

“ It is the most beautiful day of theyear so far. ”

― Jozef Colpaert.

When I started studying engineering, I thought datamanagement was in a far more advanced state. I could notimagine that still, in 2007, manual intervention was neededto find the right data source and to integrate it into yourown application. I could also not imagine that it would beillegal to – in your spare time – create a more mobilefriendly webpage for accessing the time schedules of theBelgian railway company. Nonetheless, the iRail projectstill received a cease and desist letter, claiming a breach ofIntellectual Property Rights (ipr). This sparked my interest indata availability in general, as much as it sparked mypersonal interest into ipr, as I was desparate tounderstand whether creating such a website was indeedillegal.

IRail did not stop the project. Instead, a non profitorganization was set up in 2010 to foster creativity usingmobility data, bringing together a community ofenthusiasts – I was one of these – after the story hit themedia. The organization released an ApplicationProgramming Interface (api) which would allow third partiesto integrate railway data within their own services. The apiis still online as an open-source project, acceptingcontributions from other transport data enthusiasts.Thanks to the iRail project, I was able to access mostinteresting research data in primetime. The query logs ofthis api for instance, would prove themselves priceless.

In 2011, I wrote my master’s thesis on extending theOpen Data publishing framework The DataTank – which Istarted earlier that year – with a queryable interface overhttp. The goal of the thesis was to offer a betterexperience to developers that wanted to usegovernmental datasets in their own software. While we did

Page 8: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

design a query language for small in-memory documents,we did not take into account the scalability of these serverinterfaces, neither did we study the effects on theinformation system as a whole. We only tested the overalquery execution time, which would show an increasingresponse time when an increasing amount of datasetswould be combined. The querying interface on top of TheDataTank later disappeared again from the stable release,yet the ambition to make Open Data more used and usefulremained.

Challenged by officials whom I wanted to prove therewas indeed commercial value in Open Data, I co-foundedthe start-up FlatTurtle with Yeri Tiete, the founder of iRail,and Christophe Petitjean, the excentric business owner ofrentalvalue that came up with the idea to sell informationdisplays to professional real estate owners. Theseinformation displays would show the latest informationabout for example public transport, weather, news, orinternal affairs. With FlatTurtle, we were unable to reusedatasets by relying on basic building blocks of the Web: atthe time, for example, the servers were not using propercache headers, legal conditions were not clear, andidentifiers would conflict and change across data updates.

The company did not make me financially rich, yet whatI have been able to learn in terms of running a business inthis period was invaluable. Furthermore, it taught me how– while there are a lot of people – there is still a limit to theamount of people you can meet in one day. While trying tochange the world for the better, whether it is with aproduct you sell, with a research proposal, or with ageneral idea – such as the one of Open Data –, your impactwill be as big as the quality of your pitch to explain thesolution. With this in mind during my time pursuing a PhD,I tried my best when I gave one of the many invited talksexplaining Linked Open Transport Data and itsimportance. Having to explain this subject over and overagain influenced the first chapters of this book heavily.

At the same time, the iRail non-profit merged togetherwith other initiatives such as Open Street Map, CreativeCommons and Open Access, into Open KnowledgeBelgium. Still today I am part of the board of directors ofthat non profit organization, trying to create a world where

Page 9: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

knowledge creates power for the many, not the few.When starting my PhD in November 2012, I also

thought the field of Web Engineering would be in a moreadvanced state. In the field of Web apis for example, newvague paradigms are still today popping up without clearcomparisons between their advantages anddisadvantages. In this PhD, a modest contribution is doneto measure data publishing interfaces for the purpose ofpublic transit route planning. Instead of only measuringresponse times, I measured the impact of this interface onthe information system as a whole, measuring cost-efficiency of a server interface, cacheability for a certainmix of queries, and described non-measurable benefitssuch as flexibility for developers or privacy by design.

Overall over the last four years, I have been happy.Being able to come home in the evening with a feeling thatyou are contributing to a better world is how I woulddescribe my dream job. When confronted with these kindof life questions late at night in a bar with friends, I would –like a geek does – with great pleasure and in great lengthexplain the Kardashev scale. This scale in 3 levels is amethod of measuring a civilization’s level of technologicaladvancement. Today, humanity is at level zero, not beingable to survive a natural disaster and not being able to beindependent for its energy source from the host planet.Crucial to becoming a type 1 civilization – and it is unsurewhether humanity will become a type 1 civilization – is tohave an information system in which each individual cancontribute to the civilization’s knowledge and use it tomake informed decisions. Such information systems willmore than ever in the next years play a crucial role in – justto name a few – education, science, decision making, andpolitics.

Belgium in particular has been an interesting countryto do research into governmental organizations. It is adense country, where for research into governmentalorganizations, as in a small geographic area, differentgovernmental levels, from local to European, andcompanies of different scale can be studied. This alsomakes using a decentralized approach to data governancea necessity: the Belgian federal government’s datasetshave to be interoperable with the datasets from

Page 10: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

departments and agencies of the Flemish government, aswell as with the databases of all local governments. In ourwork with these organizations, we had an interdisciplinaryteam in which I worked together with people from amongothers mict and smit. While it is impossible to mentioneveryone, I owe a big thank you to two people with whom Iwithout doubt have collaborated the most so far. Nils andMathias, I have had a blast studying the road to Open Datatogether. I look forward to, for the next few years, to makethe region of Flanders a leading example in real-time OpenData publishing, and to grow our interdisciplinary team ofOpen Data researchers as there is still a lot of work leftundone.

Ruben, it is a true honor to be able to be part of yourteam. You not only set the bar high for yourself and theteam, you also know how to guide the team towardssuccess and impact. I look forward to continue working onKnowledge on Web-Scale under your supervision as apostdoctoral researcher. Erik, thanks for always protectingmy back.

To my – old and new, close and distant – colleagues,project partners, and people encountered in local,regional, federal, and European government organizations:thank you for being an infinite source of inspiration. Youwill undoubtedly recognize parts of this dissertation thatwere the result of a discussion we may have had orproblem you confronted me with.

Mom and dad, you often refer to me as the optimist (Ioptimistically call it realism). I am happy to be able to livewith this trait. As the quote used to introduce this prefacegoes, I am sure this is not merely caused by genetics, butthat this was also caused by the way how I was raised.From my perspective – that is all I can speak for – all wentwell in this process: thank you!

Finally, Annelies, thank you for reminding me from timeto time there is more to life than Open Data.

Pieter

Page 11: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Table of Contents

Chapter 1 — Introduction1. Research question2. The chapters and publications3. Innovation in route planning applications4. The projects

Chapter 2 — Open Data and Interoperability1. A data format2. Documenting meaning3. Intellectual Property Rights and Open Data4. Sharing data: a challenge raising

interoperability1. Legally2. Technically3. Syntactically and semantically

5. Interoperability layers6. The 5 stars of Linked Data7. Information management: a proposal8. Intelligent Agents

1. Caching for scalability2. The all knowing server and the

user agent3. Queryability

9. Conclusion

Chapter 3 — Measuring Interoperability1. Comparing identifiers

1. An initial metric2. Identifier interoperability, relevance

and number of conflicts3. The role of Linked Open Data

2. Studying interoperability indirectly3. Related work

Page 12: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

4. Qualitatively measuring differentparameters

1. Legal interoperability2. Technical interoperability3. Syntactic interoperability4. Semantic interoperability5. Querying interoperability

5. Conclusion

Chapter 4 — Raising Interoperability of GovernmentalDatasets

1. Data portals: making datasets discoverable1. The DataTank2. 5 stars of Open Data Portals3. Open Data Portal Europe and the

interoperability of transport data4. Queryable metadata

2. Open Data in Flanders1. A tumultuous background2. Discussing datasets at the

Department of Transport andPublic Works

3. Challenges and workshops4. Recommendations for action

3. Local Decisions as Linked Open Data1. Implementation and

demonstration4. Conclusion

Chapter 5 — Transport Data1. Data on the road

1. Constraints for a large-scale itsdata-sharing system

2. A use case in Ghent3. Access to road networks

2. Public Transit Time Schedoles1. Route planning queries2. Other queries3. Route planning algorithms4. Exchanging route planning data

over the Web3. Conclusion

Page 13: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Chapter 6 — Public Transit Route Planning OverLightweight Linked Data Interfaces

1. The Linked Connections framework2. Evaluation design3. Results of the overall cost-efficiency test4. Linked Connections with wheelchair

accessibility1. Linked Connections with filtering

on the client2. Linked Connections with filtering

on the server and client5. Wheelchair accessibility feature experiment

1. Evaluation design2. Results

6. Discussion1. Network latency2. Actual query response times3. More advanced federated route

planning4. Denser public transport networks5. Real-time updates: accounting for

delays6. Beyond the eat query7. On disk space consumption8. Non-measurable benefits

7. Conclusion and discussion

Chapter 7 — Conclusion

Page 14: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse
Page 15: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

apibgp

corscpucsacsv

dcatdtpw

eatgtfs

hateoas

htmlhttp

ietfiiop

imiinspire

iprits

jsonlc

Glossary

Application Programming Interface

Basic Graph Patterns

Cross Origin Resource Sharing

Central Processing Unit

Connection Scan Algorithm

Comma-Separated Values

Data CATalogue vocabulary

Flemish Department of Transport and PublicWorks

Earliest Arrival Time

General Transit Feed Specification

Hypermedia As The Engine Of ApplicationState

Hypertext Markup Language

HyperText Transfer Protocol

Internet Engineering Task Force

Identifier Interoperability

Information Modeling and Interoperability

Infrastructure for Spatial Information in theEuropean Community

Intellectual Property Rights

Intelligent Transport Systems

Javascript Object Notation

Linked Connections

Page 16: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

ldfmeat

psirdf

restrsdsoasiri

tpfuriurl

w3cxml

Linked Data Fragments

Minimum Expected Arrival Time

Public Sector Information

Resource Description Framework

Representational State Transfer

Road Sign Database

Service Oriented Architecture

The Service Interface for Real TimeInformation

Triple Pattern Fragments

Uniform Resource Identifier

Uniform Resource Locator

World Wide Web Consortium

Extensible Markup Language

Page 17: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Samenvatting

De manier waarop reizigers hun routes willen plannenis divers. Enkele voorbeelden: routes berekenenrekeninghoudend met een fysische beperking; hetcombineren van verschillende transportmiddelen; het inrekening brengen of een eindgebruiker een (plooi-)fiets,wagen of bepaalde abonnementen bezit; of zelfs hetberekenen van routes op basis van de mooiste foto’s opsocialmediakanalen. Eerder dan louter een wiskundigprobleem, is routeplanning vandaag een probleem datafhangt van de beschikbaarheid van gegevens: een beterrouteadvies kan gegeven worden als er nog meer datasetsworden verwerkt tijdens de query-evaluatie.

Overheidsadministraties beheren datasets die kunnenbijdragen tot zo’n routeplanningsadvies. Vandaag zijn erduidelijke aanwijzingen dat die administraties hun data albeginnen publiceren op opendataportalen. Toch isvandaag de kost om die datasets te hergebruiken inandere systemen te hoog. Dat kunnen we zeggen omdatwe simpelweg nog geen bewijs hebben gevonden dat veeldatasets worden gebruikt. Hoe kunnen weopendatastrategieën verbeteren en ervoor zorgen datdeze datasets meer gebruikt en nuttiger worden?

Het publiceren van data voor maximaal hergebruikbetekent het zoeken naar een lagere kost om die data teintegreren binnen systemen van derde partijen. Dit is eenautomatisatie-probleem: in het ideale geval werkt softwaregeschreven om te werken met een dataset, direct ook meteen dataset gepubliceerd door een andere autoriteit. Wekunnen die adoptiekost verlagen en het hergebruikautomatiseren als we de interoperabiliteit tussen al dezedatasets op het Web verhogen. Daarom is de focus van ditdoctoraat het bestuderen van databroninteroperabiliteit –elk met hun heterogeniteitsproblemen – op 5 veschillende

Page 18: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

lagen. De (i) juridische laag beschrijft de vraag of wevolgens de wet en de licenties, de twee datasets mogensamenvoegen. Op de (ii) technische laag bestuderen wedan weer de moeilijkheden die gepaard gaan met tweedatasets fysisch samen te brengen. De (iii) syntactischelaag beschrijft of dat het formaat waar de dataset ingeserialiseerd is, kan worden gecombineerd met anderebronnen. Verder kan de syntax ook bouwblokkenaanbieden om op een gestandardiseerde manierelementen te identifieren of in te delen in eendomeinmodel. Dit creëert de basis om te komen tot eenhogere (iv) semantische interoperabiliteit, gezien voordezelfde objecten in de echte wereld de identificatorengealigneerd kunnen worden.

Het doel van dit doctoraat is het bestuderen hoe dedatabroninteroperabiliteit verhoogd kan worden, omde hergebruikskost in routeplanningssoftware teverlagen.

Eenmaal twee datasets interoperabel zijn over dezevier lagen, is het niet automatisch zo dat er makkelijkvragen kunnen gesteld worden over beide bronnen.Gezien we databronnen bestuderen gepubliceerd doormeerdere autoriteiten, voegen we ook de (v) querying-laagtoe. Twee extremen bestaan vandaag om data tepubliceren: of de queries worden volledig uitgevoerd op deserver van de data-publisher, of enkel een datadumpwordt aangeboden, en de queries worden dus vollediguitgevoerd op de infrastructuur van de hergebruiker. DeLinked Data Fragments (ldf)-as beschrijft eenonderzoekskader om een evenwicht te vinden tussenfunctionaliteiten te voorzien door een data-publicerendeserver, of functionaliteiten die door de hergebruikermoeten worden geïmplementeerd. In plaats van ééndatadump aan te bieden, worden gelinkte fragmentenvoorgesteld. Hierdoor kunnen programma’s via metadatain ieder fragment, meer fragmenten ontdekken.

Op ieder niveau kunnen we vandaag al generiekeoplossingen voorstellen om het potentieel hergebruik vandata te maximaliseren. De juridische aspectenbijvoorbeeld worden afgehandeld door de Open Definition.Deze definitie eist dat een document wordtgemetadateerd met een publieke licentie. De enige

Page 19: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

voorwaarden die mogen worden opgenomen, zijn teneerste dat enerzijds er een verplichting mag zijn omnaamsvermelding te eisen, en ten tweede dat er maggeëist worden dat een afgeleid document moet wordengepubliceerd onder dezelfde licentie. De juridischeinteroperabiliteit stijgt nog als de hergebruiksvoorwaardenzelf ook machineleesbaar zijn, en op hun eigen manier ookworden gepubliceerd voor maximimaal hergebruik.

Het Web is ons wereld-wijd informatiesysteem. Omtechnische interoperabiliteit te verzekeren, is de uniformeinterface – een van de rest architecturale beperkingen –die we aannemen http. Maar ook voor de identificatorenverkiezen we web-adressen of http uris te gebruiken. Opdeze manier kunnen dezelfde identificatoren gebruiktworden voor toegang tot verschillende representaties voorhetzelfde object. De identificaties worden een globaalunieke tekenreeks, waardoor identificatieconflictenworden vermeden. Bovendien kunnen verschillendeserialisaties met behulp van de Resource DescriptionFramework (rdf) elk gegevenselement met deze http urisannoteren, waardoor ieder element in deze databronautomatisch gedocumenteerd wordt.

In 2015 kreeg ik de kans om samen metcommunicatiewetenschappers de organisatorischeuitdagingen te bestuderen bij het Departement voorMobiliteit en Openbare Werken van de Vlaamse Overheid.Dankzij drie Europese directives (psi, inspire en its) en huneigen inzicht, konden we al een strategie ontdekken tot hetmaximaliseren van hergebruik van hun gegevens. Hoe zoeen opendatastrategie naar het volgende niveau te tillen,was echter nog onduidelijk. Door 27 data-eigenaars endirecteurs te interviewen, kwamen we tot een lijst vanvoorstellen tot acties overheen alleinteroperabiliteitsniveaus.

Geen enkele van de veelgebruikte specificaties vandaag– zoals gtfs of datex2 – binnen de transport-wereld,documenteren hun datamodellen met behulp van uri’s.Om deze toch bruikbaar te maken binnen de rdf-wereld,bouwde ik zelf mappings en publiceerde deze.

Om routes te plannen over verschillende bronnenbestudeerden we bestaande routeplanningsalgoritmes.Het te selecteren basisalgoritme moet ook een efficiënte

Page 20: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

fragmentatie-strategie toelaten. Onze hypothese was datop deze manier een nieuwe afweging gemaakt konworden, door een kostenefficiënte interface voorop testellen – overheids-servers zouden immers niet moeteninstaan voor de berekening van een antwoord op eenderwelke vraag – alsook het toelaten van voldoendeflexibiliteit bij hergebruikers. Ik besloot om het ConnectionScan Algoritme hiervoor te gebruiken.

In dit boek introduceren we het Linked Connections (lc)raamwerk. Een lc-server publiceert een gesorteerde lijstvan connecties – koppels van een vertrek en een aankomst– in fragmenten, aan elkaar gelinkt via volgende en vorigepagina links. Het Connection Scan Algoritme kan danworden geïmplementeerd op de infrastructuur van dedata-hergebruiker.

Het nadeel van dit publicatiemechanisme is uiteraardeen hoger verbruik van bandbreedte. Dit leidt tot hogerequerytijden als de hergebruiker voor de eerste keer zo’nvraag moet beantwoorden, zeker op een traag netwerk. Dehergebruiker kan hier echter evengoed een tussenliggendeserver zijn, dat naar een eindgebruiker toestel eenbeknopt antwoord stuurt. Als we bestudeerden wat deimpact was van een extra functionaliteit op de server, danmerkten we dat bij bijvoorbeeld eenrolstoeltoegankelijkheid-filter op de server, zowel deserver als het toestel van de eindgebruiker meer werkhadden om de data te verwerken.

We kunnen besluiten dat we met lc een raamwerkhebben gecreëerd waarbij we optimaal omspringen metde vijf databroninteroperabiliteitslagen. Een nieuweafweging werd gemaakt tussen werk te doen door deserver en werk te doen door de hergebruiker, waarbijhergebruik over het informatiesysteem gemaximaliseerdkan worden.

In het algemeen kan ik op basis van dit doctoraat, omeen beter opendatabeleid te implementeren, enkele tipsmeegeven om huidige http-interfaces te verbeteren. (i)Fragmenteer uw datasets en publiceer documenten overhet beveiligde https protocol. De manier waaropfragmenten gekozen moeten worden hangt af van gevaltot geval. (ii) Wanneer uw publicatie-mechanisme snellereantwoorden moet toelaten, kunnen op de server-side ook

Page 21: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

meer fragmenten worden aangeboden, meer filters,geaggregeerde datasets (relevant voor tijdsreeksen),enzovoort. Om de server-schaalbaarheid te optimaliseren,is het belangrijk om (iii) de juiste cache-headers teimplementeren. Om de vindbaarheid van de data teoptimaliseren, raad ik aan om (iv) hypermediabeschrijvingen toe te voegen. (v) Web-adressen of httpuri’s zijn er dan weer om identificatoren voor zowel dingenals het domeinmodel te documenteren. Voor de juridischeinteroperabiliteit moet er (vi) een link naar eenmachineleesbare open licentie toegevoegd worden in hetdocument. (vii) Voeg ook een Cross Origin ResourceSharing http header toe, zodanig dat ook op anderedomeinen webtoepassingen uw data kunnenhergebruiken. Ten laatste kan er ook nog (viii) dcat-apmetadata voorzien worden, zodat de dataset ook kanworden beschreven in opendataportalen.

Deze aanpak werkt niet enkel voor statische gegevens:documenten op het web kunnen immers veranderen. Hethttp protocol laat toe om voor een aantal seconden eeninformatie resource te cachen, of ook te cachen op basisvan een zogenoemde ETag. Zelfs als een document slechtsenkele seconden wordt gecachet, wint een server aanschaalbaarheid, en kan ook een maximum belastingberekend worden op een backend systeem.

Ondanks de lange tijd dat het Web al meegaat – of tochin termen van digitale ontwikkelingen – zijn er nog steedsorganisationele uitdagingen om een wereldwijdinformatiesysteem te bouwen voor iedereen. Ik hoop dat ikmet dit doctoraat een naslagwerk heb geschreven dat kandienen als input voor standaarden, en als een inspiratievoor mensen die zelf hun eigen opendatasysteem op web-schaal willen bouwen.

Page 22: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse
Page 23: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Summary

The way travelers want route planning advice isdiverse. To name a few: finding journeys that areaccessible with a certain disability; combining differentmodes of transport; taking into account whether thetraveler owns a (foldable) bike, car or public transitsubscription; or even calculating journeys with the nicestpictures on social network sites. Rather than merely beinga mathematical problem, route planning advice became adata accessibility problem. Better route planning advicecan only be given when more datasets can be used withinthe query evaluation.

Public administrations maintain datasets that maycontribute to such route planning advice. Today, there isevidence of such datasets being published on Open DataPortals, yet still the cost for adopting these datasets inend-user systems is too high, as there is no evidence yet ofwide reuse of these simple datasets. In order to makethese datasets more used and useful, how can we leverageOpen Data publishing policies?

Publishing data for maximum reuse means persuing alower cost for adoption of your dataset. This is anautomation challenge: ideally, software written to workwith one dataset, works as well with datasets published bya different authority. We can lower the cost for adoption ofdatasets and automating data reuse, when we raise theinteroperability between all datasets published on theWeb. Therefore, in this PhD we study the interoperabilityof data sources – each with their hetereneity problems –and introduce 5 data source interoperability levels. The (i)legal level puts forward the question whether we are legallyallowed to bring two datasets together. On the (ii) technicallevel, we can study whether there are technical difficultiesto physically bring the datasets together. The (iii) syntactic

Page 24: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

interoperability describes whether the serializations can bebrought together. Moreover, the syntax should providebuilding blocks to document identifiers used in thedataset, as well as the domain model used. This createsthe basis for reaching a higher (iv) semantic interoperability,as for the same real-world objects, identifiers can bealigned.

The goal of this PhD is to study how to raise thedata source interoperability of public datasets, inorder to lower the cost for adoption in route planningservices.

As we study data sources published by multipleauthorities, and as we still need to be able to evaluatequeries over these datasets, we also added the (v) queryinglevel. When the other four layers are fulfilled, we canotherwise still not garantuee a cost-efficient way toevaluate queries. Today, two extremes exist to publishdatasets on the Web: or the query evaluation happensentirely on the data publisher’s interface, or only a datadump is provided, and the query evaluation happensentirely on the infrastructure of a reuser after replicatingthe entire dataset. The Linked Data Fragments (ldf) axisintroduces a framework to study the effort done by clientsvs. the effort done by servers, and tries to find new trade-offs by fragmenting datasets in a finite number ofdocuments. By following hypermedia controls within thesedocuments, user agents can discover fragments as they goalong.

To each of these layers, we can map generic solutionsfor maximizing the potential reuse of a dataset. As we areworking towards Open Data, the legal aspect is covered bythe Open Definition. This definition requires the data to beaccompanied by a public license that informs end-usersabout the restrictions that apply when reusing thesedatasets. The only restrictions that may apply are the legalobligations to always mention the source of the originaldocument containing the data, and the legal restrictionthat when changing this document, the resultingdocument needs to be published with the same licenseconditions. The legal interoperability raises when thesereuse conditions itself are also machine interpretable, andare in the same way to be published for maximum reuse

Page 25: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

as well.We are using the Web as our worldwide information

system. In order to ensure technical interoperability, theuniform interface – one of the rest architectural constraints– that we adopt is http. Yet, also for the identifiers, wechoose to use http identifiers or uris. This way, the sameidentifiers can be used for accessing differentserializations and representations for the same object.This also means the identifiers become a globally uniquestring of characters, and thus avoids identifier conflicts.Furthermore, using the Resource Description Framework(rdf), different serializations can annotate each dataelement with these http uris as well, which enablesidentifier reuse and linking across independent datasources.

In 2015, I had the opportunity to study theorganizational challenges, together with communicationscientists, at the Flemish Department of Transport and PublicWorks (dtpw) of the Flemish government. Three Europeandirectives (psi, inspire, and its) extended with owninsights, created a clear willingness to publish data formaximum reuse. How to implement such an Open Datastrategy in a large organization was still unclear. As weinterviewed 27 data owners and directors, we came to alist of recommendations for next steps on allinteroperability levels.

None of the common specifications – such as gtfs,transmodel, and datex2 – for describing time schedules,road networks, disruptions, and road events have anauthoritative Linked Data approach. For the specific caseof public transit time schedules, we used the gtfsspecification and mapped the terms within the domainmodel to uris for these to become usable in rdf datasets.

For route planning over various sources, we studiedthe current existing public transit route planningalgorithms. The to be selected base algorithm on whichother route planning algorithms can be based, needs towork on top of a data model that allows for an efficientfragmentation strategy. Our hypothesis was that this way anew trade-off could be established, putting forward a cost-efficient way of publishing – as governmentalorganizations cannot afford to evaluate all queries over all

Page 26: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

datasets on their servers – as well as leave room for client-side flexibility. For this purpose, we found the ConnectionScan Algorithm (csa) to be a good fit.

We introduced the Linked Connections (lc) framework.An lc server publishes an ordered list of connections –departure and arrival pairs – in fragments interlinked withnext and previous page links. The csa algorithm can thenbe implemented on the side of the data consumer.Enabling the client to do the query execution comes withbenefits: (i) off-loading server, (ii) better user-perceivedperformance, (iii) more datasets can be taken intoaccount, and (iv) privacy by design, as the query itself isnever sent to a server.

The drawback of this publishing method is a higherbandwidth consumption, and when the client did notcache any resources yet, the querying – certainly whenusing a slow network – is slow. However, the clients do notnecessarily need to be the end-user devices, alsointermediary servers can evaluate queries over the web,and give concise and timely answers to a smaller set ofend-users. When studying whether an lc server shouldnow also expose the functionality of wheelchairaccessibility, we found that both client and server hadmore work to process the data.

With lc, we designed a framework with a high potentialinteroperability of all five levels on the Web. Weresearched a new trade-off for publishing public transportdata by evaluating the cost-efficiency. The trade-off chosenallows for flexibility on the client-side, while offering a cost-efficient interface to data publishers.

In order to achieve a better Web ecosystem for sharingdata, we propose a set of minimum extra requirementswhen using the http protocol. (i) Fragment your datasetsand publish the documents over https. The way thefragments are chosen depends on the domain model. (ii)When you want to enable faster query answering, provideaggregated documents with appropriate links (useful for –for example – time series), or expose more fragments onthe server-side. For scalability, (iii) add caching headers toeach document. For discoverability, (iv) add hypermediadescriptions in the document. (v) A web address (uri) perobject you describe, as well as http uris for the domain

Page 27: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

model. This way, each data element is documented andthere will be a framework to raise the semanticinteroperability. For the legal interoperability, (vi) add alink to a machine readable open license in the document.(vii) Add a Cross Origin Resource Sharing http header,enabling access from pages hosted on different origins.Finally, (viii) provide dcat-ap metadata for discoverabilityin Open Data Portals.

This approach does not limit itself to static data. Thehttp protocol allows for caching resources for smalleramounts of time. Even when a document may changeevery couple of seconds, the resource can still be cachedduring that period of time, and a maximum load on a back-end system can be calculated.

Despite the old age of the Web – at least in terms ofdigital technology advances – there are still organizationalchallenges to overcome to build a global informationsystem for the many. I hope this PhD can be the input forstandardization activities within the (public) transportdomain, and an inspiration to publishing on Web-scale forothers.

Page 28: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse
Page 29: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Chapter 1Introduction

“ If you want to go fast, go alone. Ifyou want to go far, go together. ”

― African proverb.

27

Page 30: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

How far do you live from work? Keep the answer to thisquestion in mind. Is the unit of measurement you usedto answer this question minutes or kilometers? Whenasking a certain audience this question, each time, asignificant amount of people answered with a distancein kilometers, while others would answer with adistance in time. Now imagine a software program hasto calculate the time distance from one point toanother for an end-user. Just imagine the amount ofdatasets that could be used to come up with a goodresponse to that question… For 4 years I have beenworking in projects that had one goal in common:sharing data for an unknown number of use cases andan unknown number of users.

In this chapter we first discuss the research question.Then we discuss in more detail the projects thatcontributed to this research and the structure of the restof this book.

1. research questionI will define open in

the next chapter.I studied lightweight interfaces for sharing open

transport datasets. The term Open Transport Data, on theone hand, entails the goal of maximizing the reuse of your

Transport data isused as a focus, yet

there is no cleardistinction betweentransport data and

other kinds of data.As an illustration,

even datasets likecriminality rates could

be at some pointused to provide a

better route planningexperience. Only inChapter 5 and 6 we

will dive into thespecifics of the

transport domain.

transport datasets. For example, in order to informcommuters better, a public transport agency wants tomake sure the last updates about their transit schedulesare available in each possible end-user interface. Anotherexample of a clear incentive for governmentalorganizations in specific to publish data for maximumreuse would be policy decisions: datasets maintained bypublic administrations should be published “once-only”and become as used and useful as possible, as part oftheir core task. Publishing data for maximum reuse is anautomation challenge: ideally, software written to workwith one dataset, works as well with datasets published bya different authority. We can lower the cost for adoption ofdatasets and automating data reuse, when we raise theinteroperability between all datasets published on the

28

Page 31: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Web.Lightweight interfaces on the other hand, entails that

when publishing the data for maximum reuse – and thuswhen this data becomes widely adopted – there is not anever growing publishing cost that comes with this serverinterface. We observe today that there are two commonways to share transport data. The first way is to provide anexport of all facts in one dump, which can be used byreusers to ask any question slowly. The second way is toprovide a data service, which can be used by reusers toanswer a set of specific questions quickly. The goal of theLinked Connections framework introduced in Chapter 6, isto experiment with the trade-offs between the effortsneeded to be done by reusers, questions that can beanswered quickly by reusers and the cost-efficiency of thedata publishing interfaces when reuse grows.

Hence, the research question of this PhD is: “how canthe data source interoperability of public datasets beraised, in order to lower the cost for adoption in routeplanning services?”

In order to have an answer to this question, this bookwill go broader than merely discussing the technicalaspects of datasources. I worked in close collaborationwith communication scientists to study the managementand publishing of data sources qualitatively as well. Thismight make this research question atypical for a PhDwithin software engineering.

2. assumptions and limitationsThis work assumes http is the uniform interface for

Open Data publishing. Other protocols exist (e.g., ftp or e-mail), yet the scale of the adoption of http servers forOpen Data is irreversible. If you would be reading thisdissertation at a time in the future where the httpprotocol is not used any longer (which at the time ofwriting, I would describe as “unlikely”), I first of all wouldhave to admit that this assumption in my PhD is terriblywrong. However, the experiments described in the nextchapters would also work for other protocols: an identifier

Chapter 1 — Introduction 29

Page 32: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

strategy would still be needed over this new protocol(Chapter 3 and Chapter 4), as well as a way to fragmentthese datasets (Chapter 6).

A second assumption is that the datasets that can bepublished will not have any privacy constraints. In practise,from the moment a dataset may contain something thatmay identify people, they need to go through a privacycheck before they would be able to published publicly. Asthis is a different research area, we made the assumptionthat every dataset mentioned in this PhD, has either noperson data, or has gone through the necessary checks inorder to be disseminated.

A limitation of the current work – yet part of futurework – is that we will be focusing on public transportrouting and not calculate routes over road networks.Calculating routes over a road network can however besolved in a similar fashion, following the principlesdescribed in the conclusion.

Finally, the last limitation is that we describe the costfor adoption, yet do not describe an economical model tocalculate such a cost. Instead, we assume that by raisingthe interoperability, the cost for adoption will lower.

3. the chapters and publicationsWhen I started research at what was back then still

called the MultiMediaLab, I was handed a booklet called “IsThis Really Science? The Semantic Webber’s Guide toEvaluating Research Contributions” [1] written by AbrahamBernstein and Natasha Noy. In that booklet, a quote fromErnest Rutherford was used: “All science is either physicsor stamp collecting”. This bold statement illustrates auseful distinction between two types of research: one thatstudies a phenomenon and creates hypotheses about itand the other that catalogues and categorizesobservations. The next three chapters will show how wesee and categorize the domain of data publishing, in orderto see more clearly how we can contribute to this domain.In Chapter 6, we introduce and evaluate LinkedConnections, in order to prove that it is indeed more cost-

30

Page 33: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

efficient to host.I based my dissertation on a collection of papers that I

have (co-)authored. Yet, it is also bringing together findingsfrom deliverables from the projects I have been part of,invited talks I have been giving, blog posts I have beenwriting, and on the side projects I have been side-trackedby out of curiosity. A short overview of the chapters andwhat they are based on, is given below:Chapter 2 – Open Data and interoperability

This chapter is based on my explanation on LinkedOpen Data, which I have been teaching over thecourse of my research position.

Chapter 3 – Measuring interoperabilityThis chapter is based on the first journal paper Iauthored for the Computer journal [2]. In the paper, Itried to quantify the interoperability of governmentaldatasets, in order to find out which datasets on anOpen Data Portal would need more work.

Chapter 4 – Raising interoperability of governmentaldatasets

This chapter describes three projects which eachwere valorized in a publication. One project was oncreating better Open Data Portals [3], another on anOpen Data policy for the dtpw [4], and a third projectwas on creating a Linked Data strategy for localcouncil decisions as a way to simplify administrativetasks.

Chapter 5 – Transport dataThis chapter comes in two parts: a part about data onthe road, which is based on a publication on morerecent work about real-time parking availabilities [6],and data about public transit route planning, which isbased on the related work of the paper introducingLinked Connections [7].

Chapter 6 – Public Transit route planning overlightweight Linked Data interfaces

Finally, the last chapter before the conclusion wasbased on four papers that also nicely illustrate thethought process over time. My PhD Symposium

Chapter 1 — Introduction 31

Page 34: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

paper written back in 2013 [8] illustrated that Iwanted to be able to answer any kind of routeplanning question (sic) over the Web of data. In 2015 Ipublished a demo paper [9] that would explain in aproof of concept, that I meant to give the client morefreedom to calculate routes the way they like. In 2016,I extended this proof of concept with wheelchairaccessibility features [10] and looked into what wouldbe more efficient for the information as a whole.Finally, in 2017 I published the paper evaluating thecost-efficiency of this new lightweight interface [7].

4. innovation in routeplanning applications

In 2016, everyone wanted an app. Not having an appwould mean not being able to call yourself a digitallyadvanced transit agency, even if it already has a websitewith exactly the same functionality. Tim Berners-Lee, in1989, concluded his proposal for the Web with the advicethat we should focus on creating a better informationsystem, that works for anything and is portable, then againhaving to work on new fancy graphic techniques andcomplex extra facilities that do not tackle the root of theproblem. Today, his conclusion could not be more ontopic.

We do not need a separate route planning app for eachtransit agency. If you would ask smartphone users, theyonly need one application that returns a route from oneplace to another. We can see evidence of public transitauthorities that understand this need. Public transitagencies share data among each other to include theirdata in each of their own app. However, instead of asolution, this now becomes a quadratic problem, in whicheach agency has to share their data with each otheragency that they seem relevant. If we want an app thatworks world-wide, this approach will not scale. Moreover,we become dependent on the goodwill of the public transitagencies to implement features for specific use cases. Take

32

Page 35: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

for example the ability to take into account your specificset of subscriptions, the ability to take into account wheelchair accessiblity, or the ability to assist you in planningyour next multi-day international trip.

At the time of writing,Google Maps is themost popular route

planning application.

Organically, we see that public transit companiesunderstand this need, and share their data with GoogleMaps. While among digital citizens, this move is regardedas a long due step in the right direction, it is stillquestionable whether only Google Maps should receivethis data. Giving one company the monopoly on creating aroute planning experience for 100% of the population isnot the solution either. Instead, we advocate for OpenData: everyone should be able to create and integrate aroute planner in their own service offering.

How exactly such an open dataset should be publishedis the subject of this book. Datasets need to be integratedin various “views”, which all work on top of a similar routeplanning Application Programming Interface (api). However,we should also be able to create different route planningapis ourselves for different use cases that were not kept inmind by transit agencies when publishing the data. Thisentails that the raw data should be published – not onlythe answers to advanced questions –.

5. the projectsThis book has been written while working on European,

Flemish, and bilaterally funded research projects. InNovember 2012, my first month at the MultiMedia Lab(now Internet and Data Lab), I was tasked with the furtherdevelopment of The DataTank. The DataTank is opensource software to open up datasets over http, while alsoadding the right metadata to these datasets. The firstversion of The DataTank was further developed thanks toa project at Westtoer, which needed a single point ofreference for their tourism datasets. After that, the projectfor ewi helped further shaping this project as simple dataportal software. The DataTank was initially created at OpenKnowledge Belgium and iRail (for a background, read thepreface) and was installed for the open data portal of

Chapter 1 — Introduction 33

Page 36: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

among others, Flanders, region of Kortrijk, Antwerp, andGhent. Today, research on this project ceased andcommercial support is available via third parties.

The first two years, I was funded on projects in thedomain of e-government, where there was a need forbetter data dissemination. In all the aforementionedprojects, the 5 stars of Linked Open Data was used as aframework. We would never however reach the 5 starsand would always be confronted with a glass ceiling: whywould 3 stars not be enough? In these two years, we thusmainly described data in various formats, not often withwell aligned data models, using the dcat specification,which at the time was still being built.

Apps for Europe was a different kind of project. Its goalwas to support governmental organizations with their firststeps towards Open Data and organize co-creation events.The project provided me with travel budget to travel toamong others Berlin, Manchester, Amsterdam, Paris, andSwitzerland. It provided me with the understanding ofwhen developers would start to use governmentaldatasets, and gave me the first insight in how and whypublic administrations maintain certain datasets.

In the next projects, we created our own way ofevaluating open data policies by introducing a frameworkto study the data source interoperability. This aligns withthe goal to maximize reuse, and thus to provably raise theinteroperability of data sources. Times are changing, andinstead of e-government, Open Data would now becomemore eagerly funded under the umbrella of Smart Cities.The first project that was not linked to e-government wasthe its vis project with its.be, a public-private partnershipthat works on the European directive regarding IntelligentTransport Systems (its). It also set up a data portal atdata.its.be, yet also put steps in the direction ofinteroperable semantics by transforming the datex2specification to a Linked Data vocabulary.

In 2015, I had the opportunity to study theorganizational challenges, together with communicationscientists, at the Flemish Department of Transport and PublicWorks (dtpw) of the Flemish government. Three Europeandirectives (psi, inspire, and its) extended with owninsights, created a clear willingness to publish data for

34

Page 37: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

maximum reuse. How to implement such an Open Datastrategy in a large organization, was still unclear. As weinterviewed 27 data owners and directors, we came to alist of recommendations for next steps on allinteroperability levels. This project marked the start formany more projects that would need our help forpublishing data for maximum reuse, such as the localcouncil decisions as Linked Data project, the SmartRoeselare project, the reuse assessment for the FlemishInstitute for the Sea, and finally for the Smart Flandersprogram.

Table 1: An overview of the projects I was part of from November 2012 untilJune 2017 when at Internet and Data Lab.

Name Period Description and impact

Westtoer tourism 2012Westtoer advanced The DataTank.Tourism Open Data portal createdhttp://datahub.westtoer.be

An experimentaldata publishingplatform for ewi

2012–2013

Advanced The DataTank with aconnection to the popular ckan,created the dcat-ap extension. Shapedthe current features of The DataTank

Apps for Europe 2012–2014

Advanced The DataTank and was thefirst project to bring together various …Apps for Ghent is still organized eachyear

Open Transport Net 2014–2016 A project on dcat, data portals andOpen Transport Data.

Flemish InnovationStudy for its.be 2014–2017

Setting up a data portal and advancingthe datex2 specification towards aLinked Data ontology

An open data visionfor the Departmentof Mobility andPublic Works

2015

A first project to methodologically raisethe impact and interoperability ofdatasets to be published at the Flemishgovernment.

Local Decisions asLinked Data 2016

Publishing local council decisions asLinked Data for administrativesimplification

Chapter 1 — Introduction 35

Page 38: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

The oasis team 2016–2018

Sharing experience between Ghentand Madrid on publishing Linked Dataabout public services and transportdata.

Smart Roeselare 2017Supporting the City of Roeselare withtheir Smart City vision to make morereuse of their data.

Raising the reuse ofdatasets maintainedby the Flemishinstitute for the Sea

2016–2017Towards an Open Sea Data innovationlab by making their data used anduseful.

Smart Flanders 2017–2020Supporting the 13 Flemish center citiesand Brussels to publish real-time opendata.

Each project contributed in their own respect to thisdissertation. While some gave me access to researchsubjects, other projects gave me the freedom to furtherexplore research directions I thought were worth furtherexploring.

references[1] Bernstein, A., Noy, N. (2014). Is this really science? the

semantic webber’s guide to evaluating researchcontributions. Technical report.

[2] Colpaert, P., Van Compernolle, M., De Vocht, L.,Dimou, A., Vander Sande, M., Verborgh, R., Mechant,P., Mannens, E. (2014, October). Quantifying theinteroperability of open government datasets. (pp.50–56). Computer.

[3] Colpaert, P., Joye, S., Mechant, P., Mannens, E., Vande Walle, R. (2013). The 5 Stars Of Open Data Portals.In proceedings of 7th international conference onmethodologies, technologies and tools enabling e-Government (pp. 61–67).

36

Page 39: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

[4] Colpaert, P., Van Compernolle, M., Walravens, N.,Mechant, P. (2017, April). Open Transport Data formaximizing reuse in multimodal route planners: a studyin Flanders. IET Intelligent Transport Systems.Institution of Engineering and Technology.

[5] Buyle, R., Colpaert, P., Van Compernolle, M., Mechant,P., Volders, V., Verborgh, R., Mannens, E. (2016,October). Local Council Decisions as Linked Data: aproof of concept. In proceedings of the 15thInternational Semantic Web Conference.

[7] Colpaert, P., Verborgh, R., Mannens, E. (2017). PublicTransit Route Planning through Lightweight Linked DataInterfaces. In proceedings of International Conferenceon Web Engineering.

[8] Colpaert, P. (2014). Route planning using Linked OpenData. In proceedings of European Semantic WebConference (pp. 827–833).

[9] Colpaert, P., Llaves, A., Verborgh, R., Corcho, O.,Mannens, E., Van de Walle, R. (2015). IntermodalPublic Transit Routing using Linked Connections. Inproceedings of International Semantic WebConference (Posters & Demos).

[10] Colpaert, P., Ballieu, S., Verborgh, R., Mannens, E.(2016). The impact of an extra feature on the scalabilityof Linked Connections. In proceedings of iswc2016.

Chapter 1 — Introduction 37

Page 40: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

38

Page 41: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Chapter 2Open Data andInteroperability

“ We should work toward a universallinked information system, in whichgenerality and portability are more

important than fancy graphicstechniques and complex extra facilities ”

― Tim Berners-Lee (1989).

39

Page 42: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

“Where is this website getting its data from?”, I asked myselfwhile I was informed that my bus is arriving in fiveminutes. We all have used the word data before, yet itremains difficult, even in academic mids, to definewhat the word precisely means. I have, in vain, beenlooking for the one definition that would help mestudy publishing data generically. I have been thinkingof my own definitions which, each time, I wouldgradually decide to move away from, as I would alwaysbe able to give a counter example that did not fit thedefinition. In this chapter, we will not introduce astandard definition for the term “data”. Instead, wewill talk about the interoperability between two ormore datasets from four perspectives: the syntacticperspective, semantic perspective, the legalperspective and the perspective of asking questionsperspective. That will introduce and motivate our view– as there is no “one way” to look at this – on data inthe large scale context of the Web.

A datum, the singular form of the word data in Latin,can be translated as “a given”. When someone orsomething makes for example the observation that acertain train will start from the Schaerbeek station, and byrecording this observation in this text, we have created a

In English, as well asin this PhD thesis, the

word data is alsooften used as a

singular word to referto the abstract

concept of a pile ofrecorded

observations.

datum. When we have many things that are given, we talkabout a dataset, or simply, data.

Data need a container to be stored in or transmittedwith. We can store different observations in a writtendocument, which in turn can be published as a book or canbe shared online in a Web format. Data can also be storedwithin a database, to then be shared with others throughe.g., data dumps or query services.

1. a data formatIn order to store or transit data in a document, we

need to agree on a serialization first. A simple example ofsuch a serialization is Comma-Separated Values (csv) [1], inwhich the elements are separated by commas. Each line in

40

Page 43: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

the file contains a record and each record has a value foreach column. An illustration of how a train leaving fromSchaerbeek station would look like in this tabular format, isgiven in Figure 1.

"id","starts from"

"Train P8008","Schaerbeek Station"

Figure 1: Example of a csv file describing a train linethat starts in the station of Schaerbeek

This is not the only way in which this data can beserialized into csv. We can imagine different headings areused, different ways to identify “Train P8008”, or even“starts from” and “Train P8008” to switch places. Eachserialization only specifies a certain syntax, which tells howdata elements are separated from each other. It is up tospecifications on a higher abstraction layer to define howthese different elements will relate to each other, based onthe building blocks provided by the serialization format.The same holds true for other serializations, such as thehierarchical Javascript Object Notation (json) or ExtensibleMarkup Language (xml) formats.

As people decide how datasets are shaped, humanlanguage is used to express facts within theseserializations. Noam Chomsky, who laid the foundations ofgenerative grammar, built models to study language as if itwere a mathematical problem. In Chomskian generativegrammar, the smallest building block to express a fact onecan think of, is a triple. Such a triple contains a subject (suchas “Train P8008”), a predicate (such as “starts from”), and anobject (such as “Schaerbeek Station”). Within a triple, asubject has a certain relation to an object, and this relationis described by the predicate. In Figure 2, we illustrate howour csv example in Figure 1 would look like in a triplestructure.

Chapter 2 — Open Data and Interoperability 41

Page 44: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Figure 2: The example in Figure 1 encoded andillustrated as a triple

This triple structure – rather than a tabular orhierarchical data model – helps studying data in its mostessential form. It allows to extent the theory we build forone datum or triple, to more data. By re-using the sameelements in triples, we are able to link and weave a graph

json, csv, and xmlare at the time of

writing popularformats on the Web,

that can beintepreted by any

modernprogramming

language today. Noknowledge is

required about theseserializations in the

remainder of thisbook.

of connected statements. Different dedicated serializationsfor triples exist, such as Turtle [2] and N-Triples [3], yet alsospecifications exist to encode triples within serializationslike json, csv, or xml.

2. documenting meaningLet us perform a thought experiment and imagine

three triples published by three different authorities. Onemachine publishes the triple in Figure 2, while two otherpublish the triples illustrated in Figures 3 and 4 – theserialization used can be any – representing the facts thatthe train P8008 starts from Schaerbeek Station,Schaerbeek Station is located in Schaerbeek City, and theBelgian singer Jacques Brel was born this city.

Figure 3: On a second machine, the fact thatSchaerbeek Station is located in the city ofSchaerbeek is published.

subject: Train P8008

predicate: starts from

object: Schaerbeek Station

subject: Schaerbeek Station

predicate: located in

object: Schaerbeek City

42

Page 45: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

A user agent issoftware that acts on

behalf of a user.

When a user agent visits these three machines, it cannow answer more questions than each of the machineswould be able to do on their own, such as: “What trainsleave in the city in which Jacques Brel was born?”. Aproblem occurs however. How does this user agent knowwhether “Schaerbeek City” and “Schaerbeek” are the sameentity?

Figure 4: On a third machine, the fact that JacquesBrel, the famous Belgian singer, was born inSchaerbeek is published.

Semantics, in thiscontext, refers to howtechnology can assist

in comparing themeaning between

two entities.

Instead of using words to identify things, numericidentifiers are commonly used. This way, everyorganization can have their context in which entities aredescribed and discussed. E.g., the station of Schaerbeekcould be given then identifier 132, while the city ofSchaerbeek could be given the identifier 121. Yet for anoutsider, it becomes unclear what the meaning is of 121and 132, as it is unclear where its semantics aredocumented, if documented at all.

Resources can beanything, including

documents, people,physical objects, and

abstract concepts [4].Within the Resource

Description Framework(rdf), they can beidentified using aUniform ResourceIdentifier (uri), orrepresented by a

literal value (such as adate or a string of

characters).

Linked Data solves this problem by using Webidentifiers, or http Uniform Resource Identifiers (uris) [5]. Itis a method to distribute and scale semantics over largeorganizations such as the Web. When looking up thisidentifier – by using the http protocol or using a Webbrowser – a definition must be returned, including linkstowards potential other interesting resources. The tripleformat to be used in combination with uris is standardizedwithin the rdf [4]. In Figure 5, we exemplified how thesethree triples would look like in rdf.

subject: Jacques Brel

predicate: born in

object: Schaerbeek

Chapter 2 — Open Data and Interoperability 43

Page 46: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

<http://phd.pietercolpaert.be/trains#P8008>

<http://phd.pietercolpaert.be/terms#startsFrom>

<http://irail.be/stations/NMBS/008811007> .

<http://irail.be/stations/NMBS/008811007>

<http://dbpedia.org/ontology/location>

<http://dbpedia.org/resource/Schaerbeek> .

<http://www.wikidata.org/entity/Q1666>

<http://www.wikidata.org/entity/P19>

<http://dbpedia.org/resource/Schaerbeek> .

Figure 5: The three triples are given a globalidentifier and are added using rdf’s simple N-Triplesserialization.

One can use LinkedOpen Vocabularies to

discover reusableuris [6].

The uris used for these triples already existed in otherdata sources, and we thus favoured using the sameidentifiers. It is up to a data publisher to make a choice onwhich data sources can provide the identifiers for a certainof entities. In this example, we found WikiData to be agood source to define the city of Schaerbeek and to defineJacques Brel. We however prefer iRail as a source for thestations in Belgium. As we currently did not find anyexisting identifiers for the train route P8008, we createdour own local identifier, and used the domain name of thisdissertation as a base for extending the knowledge on theWeb.

3. intellectual property rights andopen data

As Intellectual PropertyRights (ipr) legislation

diverges across theworld, we only

checked thecorrectness of this

chapter withEuropean copyright

legislation [7] in mind.

When a document is published on the Web, all rightsare reserved by default until 70 years after the death ofthe last author. When these documents are reused,modified and/or shared, the conscent of the copyrightholder is needed. This conscent can be given through awritten statement, but can also be given to everyone atonce through a public license. In order to mark your ownwork for reuse, licenses, such as the Creative Commons

44

Page 47: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

licenses, exist, that can be reused without having to inventthe same legal texts over and over again.

Copyright is only applicable on the container that isused for exchanging the data. On the abstract concept offacts or data, copyright legislation does not apply. TheEuropean directive on sui generis database law [8] specifiesthat, however, databases can be partially protected, if theowner can show that there has been “qualitatively and/orquantitatively a substantial investment in either theobtaining, verification, or presentation of the content ofthe database” [9]. It allows a database owner to protect itsdatabase from (partial) replication by third parties. So,while there is no copyright applicable on data itself,database rights may still be in place to protect a datasource. In 2014, the Creative Commons licenses wereextended [10] to also contain legal text on the sui generisdatabase law, and would since then also work for datasets.

More information onthe definition of Open

Data maintained byOpen Knowledge

International isavailable at

opendefinition.org [11].

Data can only be called Open Data, “when anyone isable to freely access, use, modify, and share for anypurpose (subject, at most, to requirements that preserveprovenance and openness)”. Some data are by definitionopen, as there is no Intellectual Property Rights (ipr)applicable. When there is some kind of ipr on the data, anopen license is required. This license must allow the right toreuse, modify, and share without exception. From themoment there are custom restrictions (other thanpreserving provenance or openness), it cannot be called“open”.

While the examples given here may soundstraightforward, these two ipr frameworks are the sourceof much uncertainty. Take for example the case of theDiary of Anne Frank [12], for which it is unclear who lastwrote the book. While some argue it is in the publicdomain, the organization now holding the copyright statesthe father did editorial changes, and the father of AnneFrank died much later. For the reason of avoidingcomplexity when reusing documents – and not only forthis reason – it is desired that the authoritative source canverify the document’s provenance or authors at all time,and a license or waiver are included in the dataset’smetadata.

When this book would be processed for the data facts

Chapter 2 — Open Data and Interoperability 45

Page 48: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

that are stored within this book, what happens tocopyright? Rulings in court help us to understand how thisshould be interpreted. A case in the online newspapersector, Public Relations Consultants Association Ltd vs. TheNewspaper Licensing Agency Ltd [13] in the UK in 2014,interpreted the 5th article of the European directive oncopyright in the way that a copy that happens for thepurpose of text and data mining is incidental, and noconscent should be granted for this type of copies.

Next, considering the database rights, it is unclear whata substantial investment is, regarding the data containedwithin these documents. One of the most prominentarrests for the area of Open (Transport) Data, was theruling of the British Horseracing Board Ltd and Others vs.William Hill Organization Ltd [14], which stated that theresults of horse races collected by a third party was notinfringing the database rights of the horse race organizer.The horse race organizer does not invest in the database,as the results are a natural consequence of holding horseraces. In the same way, we argue the railway schedules ofa public transport agency are not protected by databaselaw either, as a public transport agency does not have toinvest in maintaining this dataset. These interpretations ofcopyright and sui generis are also confirmed by a study ofthe EU Commission on intellectual property rights for textmining and data analysis [9].

4. sharing data: a challenge raisinginteroperability

A dataset is created in order for it to be shared at somepoint in time. If it is not shared with other people, it willneed to be shared with other systems (such as an archive)

46

Page 49: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

This is a use case wecame across whenvisiting the Flemish

Department ofTransport and Public

Works (dtpw),discussed inChapter 4.2.

or shared with your own organization. Take for example adataset that is created within a certain governmentalservice, for the specific use case of solving questions frommembers of parliament. While at first, the database’sdesign may not reflect a large number of use cases, thedataset is not just removed after answering this question.Instead, it will be kept on a hard drive at first, in order tobe more efficient when a follow-up question would beasked. The dataset may also be relevant for answeringdifferent questions, and thus, a project is created to sharethis kind of documents proactively [15].

Imagine you are a database maintainer for this project,and someone asks to share the list of the latest additionswith you. You set up a meeting and try to agree on termsregarding the legal conditions, you agree on how the datawill be sent to the third party, discuss which syntax to use,the semantics of the terms that are going to be shared,and which questions that should be answered over thedataset. The protocol that is created can be documented,and whenever a new question needs to be answered, theexisting protocol can be reused, or, when it would notcover all the needs any longer, needs to be rediscussed.When now more people want to reuse this data, it quicklybecomes untenable to keep having meetings with allreusers. Also vice versa, when a reuser wants to reuse asmuch data as possible, it becomes untenable to havemeetings with all data publishers.

In the previous chapter, we discussed datasets forwhich the goal was to maximize the reuse, which entailsmaximizing both the number of reusers, as well asmaximizing the amount of questions each reuser cansolve. In order to motivate data consumers to start reusinga certain dataset, some publishers rely on the intrinsicmotivation of citizens [16], yet when performing a cost-benefit analysis, the cost to manually fix the heterogenityof datasets is still too high compared to the benefits of thecompany itself [17]. In this PhD, we explore thepossibilities to lower this cost for adoption for a certaindataset, by lowering the data source interoperability [18],which we define as how easy it is to bring two, or more,datasets together.

Chapter 2 — Open Data and Interoperability 47

Page 50: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

4.1. LegallyThe first level is the legal level: data consumers must be

allowed for these two datasets to be queried together.When for a certain dataset a specific one on one contractneeds to be signed before it can be used, the cost foradoption for data consumers becomes too high [17]. Whentwo datasets are made available as Open Data and havean Open License attached to it, the interoperabilityproblems will be lower. Even for datasets that are possiblyin the public domain, the Open Data movement advocatesfor a clear disclaimer on each dataset with the publisher’sinterpretation.

4.2. TechnicallyThe second level is the technical level, which entails

how easy it is to bring two datasets physically together.Thanks to the Internet, we can acknowledge this ispossible today, yet the protocols to exchange data diverge,from e-mail, to the File Transfer Protocol (ftp), or theHyperText Transfer Protocol (http).

http [19], the protocol that powers the Web, allowsactions to be performed on a given resource, called request

It is up to theapplication’s

developers toimplement thesemethods. It is not

uncommon that the“safe” property is not

respected, resultingin undesired

consequences in forexample Mendeley

back in 2012 [20].

methods or also called http verbs. The protocol defines forevery method whether it is a safe method, defined as amethod that does not change anything on the server.

An example of a safe method is get. By executing a getrequest, you can download the representation of a certainresource. The same representation can be requested overand over again with the same result, until the resourcechanges its state – e.g., when a train is delayed (a changein the real world), or when someone adds a comment toan article using a post request –. The result of this requestcan be cached within clients, in servers, or in intermediatecaches.

post requests are not safe. Each time a post request isexecuted, a change or side-effect may happen on theserver. It is thus not cacheable, as each new request mustbe able to trigger a new change or must be able to result ina different response.

The protocol has also more than a hand full of other

48

Page 51: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

methods, has headers to indicate specifics such as whichencoding and format to use, how long the response can becached, and has response codes to indicate whether therequest was succesful. In this PhD, it is not up for debatewhether or not to use http: the scale of adoption it hasreached at the time of writing makes it the natural choicefor the uniform interface. The rationale followed in thispaper would also remain valid with other underlyingprotocols. In 2015, the http/2 protocol [21] was draftedand is today seeking for adoption. It is fully backwardscompatible with http/1.1, the currently widely adoptedspecification.

4.3. Syntactically and semanticallyThe third kind of interoperability describes whether the

serializations of both datasets can be read by a user agent.When deserialized, the data can be accessed in memory.The meaning behind these identifiers can conflict whenthe same identifier is used for something different. Whenthey do not conflict, there may also be multiple identifiersfor the same objects. In both ways, it will lower thesemantic interoperability.

5. interoperability layersThe term interoperability has been coined in several

research papers, both qualitative as quantitative. What theauthors of these papers have in common, is that theypropose to structure interoperability problems in differentcategories. For example, back in 2000, the imi model [22]was introduced, in order to discuss the exchange of objectoriented data between software systems. The imi modelhas only three levels: the syntax layer, the object layer andthe semantic layer. The syntax layer is responsible for“dumbing down” object-oriented information intodocument instances and byte streams. The object layer’sgoal is to offer applications an object-oriented view on theinformation that they operate upon. This layer defines howhierarchical elements are structured. Finally, the semantic

Chapter 2 — Open Data and Interoperability 49

Page 52: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

layer is the layer that makes sure the same meaning isused for the same identifiers. The authors argue that eachof these three layers should have their own technology tohave a fully interoperable service. Today we indeed seethat xml syntax has an object specification called rdf/xml,which standardizes how rdf triples can be containedwithin such a document.

Interoperability problems were also described asproblems that occur while integrating data. The goal of anintegration process is to provide users with a unified viewon one machine. Four types of heterogeneity arediscussed: Implementation heterogeneity occurs whendifferent data sources run on different hardware andstructural heterogeneity occurs when data sources havedifferent data models, in the same way as the object levelin the imi model. Syntax heterogeneity occurs when datasources have different languages and datarepresentations. Finally, semantic heterogeneity occurswhen “the conceptualisation of the different data sourcesis influenced by the designers’ view of the concepts andthe context to be modelled”.

Legal, organizational, political, semantic, and technicalare then again the five levels in which Europe categorizestheir datasets on the European Union’s data portal inorder to indicate for what interoperability level this datasetcould be used. These levels are intended to discuss howdata and knowledge is spread on a policy level within bigorganizations, such as Europe as a whole.

Finally, in a review on interoperability frameworks, fourcategories of interoperability are again identified:technical, syntactic, semantic and organizational. The firstthree are the same as in this paper, yet the organizationalinteroperability focuses on high level problems such ascultural differences and alignment with organizationalprocesses. Within the data source interoperability, weconsider the effects from organizational heterogeneity tohave affected the semantic interoperability.

50

Page 53: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

6. the 5 stars of linked dataIn order to persuade data managers to publish their

data as “Linked Open Data”, Tim Berners-Lee introduced a5 star rating for data in 2009, cfr. Figure 6. The first staradvocates to make the data available on the web inwhatever format, similar to our technical interoperability

This idea is also putforward by the WorldWide Web Consortium

(w3c) best practicesguide for data on theWeb [23] of the w3c.

layer. Furthermore, for open data, it also advocates for anOpen License, similar to the legal interoperability layer. Thesecond star requires that the dataset is machine readable.This way, the data that needs to be reused can be copypasted into different tools, allowing for the data to bestudied in different ways more easily. The third staradvocates for an open format, similar to the syntacticinteroperability, making sure anyone can read the datawithout having to install specific software. The fourth staradvocates for the use of uris as a way to make your termsand identifiers work in a distributed setting, and thusallowing a discussion on semantics using the rdftechnology. Finally, the fifth star advocates for reusingexisting uri vocabularies, and to link your data to otherdatasets on the Web. Only by doing the latter, the Web ofdata will be woven together. The 5 star system to advocatefor Linked Open Data has gained much traction andcannot lack from any introduction to Linked (Open) Data ormaintaining datasets on web-scale. We however arecautious using the 5 stars in our own work, as it could givethe impression a 100% interoperable 5 star dataset existsand no further investments would be needed at somepoint to make it better. For realists who rightfully believe aperfect dataset does not exist, wonder why going for 5 stardata is needed… Are 3 stars not good enough? Whenpresenting the roadmap as interoperability layers however,maintaining a dataset is a never ending effort, where eachinteroperability can be improved. Raising interoperability isnot one-sided: the goal is to be as interoperable aspossible with an information system. When theinformation management system, and the datasets in it domore effort towards interoperability, your dataset can alsobe made more interoperable over time.

Chapter 2 — Open Data and Interoperability 51

Page 54: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Figure 6: The 5 star scheme towards Linked Data, asused by Tim Berners-Lee to advocate for better dataexchange over the Web.

7. information management:a proposal

Let’s create a system to distribute data within our ownorganization for the many years to come. Ourrequirements would be that we want our data policy toscale: when more datasets are added, when more difficultquestions need to be answered, when more questions areasked, or when more user agents come into action, wewant our system to work in the same way. The systemshould also be able to evolve while maintainingbackwards-compatibility, as our organization is going tochange over time. Datasets that are published today,should still work when accessed when new personnel is inplace. Such a system should also have a low entry-barrier,as it needs to be adopted by both developers of useragents as data publishers.

Tim Berners-Lee created his proposal for Informationmanagement on large scale within cern in 1989 [24]. Whatwe now call “The Web” is a knowledge base with all ofmankind’s data, which still uses the same building blocks

52

Page 55: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

For an overview ofrest, we refer to the

second chapter in thePhD dissertation ofRuben Verborgh, a

review of rest after15 years [25], or theoriginal dissertationof Roy Fielding [26],

or Fielding’sreflections about rest

in 2017 [27].

as at the time of Tim Berners-Lee’s first experiments. Itwas Roy Fielding that, in 2000 – 11 years after the initialproposal for the Web –, derived a set of constraints [26]from what was already created. Defined whilestandardizing http/1.1, this set of “how to build largeknowledge bases”-constraints is known today asRepresentational State Transfer (rest). As with anyarchitectural style, developers can choose to follow theseconstraints, or to ignore them. When following thesecontraints, rest promises beneficial properties to yoursystem, such as a good network efficiency, betterscalability, higher reliability, a better user-perceivedperformance, and more simplicity.

Clients and servers implement the http protocol sothat their communication is technically interoperable. Justlike Linked Data insists on using http identifiers, rest’suniform interface constraint requires that every individualinformation resource on the Web is accessed through asingle identifier – a uri – regardless of the concrete formatit is represented in. Through a process called contentnegotiation, a client and a server agree on the bestrepresentation. For example, when a resource “station ofSchaerbeek” is identified by the uri http://irail.be/stations/NMBS/008811007 and a Web browser sends an http requestwith this uri, the server typically sends an htmlrepresentation of this resource. In contrast, an automatedroute planning user agent will usually ask and receive ajson representation of the same resource using the sameuri. This makes the identifier http://irail.be/stations/NMBS/008811007 semantically interoperable, since clientsconsuming different formats can still refer to the sameidentifier. This identifier is also sustainable (i.e.,semantically interoperable over time), because newrepresentation formats can be supported in the futurewithout a change of protocol or identifier [25].

In order to navigate from one representation toanother, controls are given within each representation.Fielding called this Hypermedia As The Engine Of ApplicationState (hateoas): when a user agent once received a starturl, it would be able to answer its end-user’s questions byusing the controls provided each step of the way.

Chapter 2 — Open Data and Interoperability 53

Page 56: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

8. intelligent agentsNow, we can create a user agent that provides its end-

users with the nearest railway station. A user story wouldlook like this: when you push a button, you should see thenearest station relative to your current location. In aService Oriented Architecture (soa), or how we wouldnaturally design such an interaction in small-scale set-ups,we expose a functionality on the server, which requires theapplication to send its current location to the server. A urlof such a request would look like this: http://{my-service}/nearestStation?longitude=3.14159&latitude=51.315. The serverthen responds with a concise and precise answer. Thisminimizes the data that has to be exchanged when onlyone question is asked, as only one station needs to betransferred. Does this advantage weigh up to thedisadvantages?

The number of information resources – or documents –When assuming that

a precision of 11m, or4 decimal places inboth longitude andlatitude, is enough,then we would still

have 6.48 × 1012

urls exposed for asimple feature.

that you potentially have to generate on the server, is overa trillion. As it is unlikely that two people – wanting to knowthe nearest railway station – are at exactly the samelocations, each http request has to be sent to the serverfor evaluation. Rightfully, soa practitioners introduce ratelimiting to this kind of requests to keep the number ofrequests low. An interesting business model is to sellpeople who need more requests, a higher rate limit. Yet,did we not want to maximize the reuse of our data, insteadof limiting the number of requests possible?

8.1. Caching for scalabilityAs there are only 646 stations served by the Belgian

railway company, describing this amount of stations easilyFor instance, https://ap

i.irail.be/stationsfits into one information resource identified by one url.When the server does not expose the functionality to filterthe stations on the basis of geolocation, all user agentsthat want to solve any question based on the location ofstations, have to fetch the same resource. This puts theserver at ease, as it can prepare the right document onceeach time the stations’ list is updated. Despite the fact thatnow all 646 stations had to be transferred to the user

54

Page 57: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

agent, and thus consumed significantly more bandwidth,In Computer Science,this is also called the

principle of locality

also this user agent can benefit. For example, when soonafter, a similar question is executed, the dataset willalready be present in the client’s cache, and now, no dataat all will need to be transferred. This raises the user-perceived performance of a user interface. When now thenumber of end-users increases by a factor of thousand persecond – not uncommon on the Web –, it becomes easierfor the server to keep delivering the same file for thoseuser agents that do not have it in cache already. When it isnot in cache of the user agent itself, it might already be inan intermediate cache on the Web, or in the server’s cache,not leading to the server having to invest in cpu time per

We empirically studythe effect of caching

on route planninginterfaces in

Chapter 7.

user. Caching, one of the rest constraints, thus has thepotential to eliminate some network interactions andserver load. When exploited, a better network efficiency,scalability, and user-perceived performance can beachieved.

8.2. The all knowing server and the user agentIn a closed environment – for instance, when you are

creating a back-end for a specific app, you assume theinformation lives in a Closed World – a server is assumed tobe all knowing. When asking for the nearest station, theserver should know all stations in existence and return theclosest one. Yet, when I would live nearby the border ofFrance, a server that assumes a Belgian context will not beable to give me a correct answer to this question. Instead, Iwould have to find a different server that also provides ananswer to this similar question. Furthermore, when now, Iwould love to find the nearest wheelchair accessiblestation, no answer can be returned, as the server does notexpose this kind of functionality. The server keeps useragents “dumb”.

On the Web, we must take into account an Open WorldAssumption: one organization can only publish the stationsit knows about, or a list of stations they use for their own

Chapter 2 — Open Data and Interoperability 55

Page 58: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Not to mention thatthe complexities that

would arise whendatasets would not

be interoperable. Forinstance, if not

everyone had thesame definition of “a

station”

use case. An implication of this [28], is that for user agents,it becomes impossible to get the total number of stations:following links, they can infinitely keep crawling the Webwhether there is someone that knows something about astation that was not mentioned before. However, a useragent can be intelligent enough, to, within its own context,decide whether or not the answers it received until now,are sufficient. For instance, when creating a publictransport app in Belgium, the app can be given a completelist of transport agencies in Belgium, according to its

When a user agent isgiven a start url, it

should be able tofollow links to

discover otherinformation

resources.

developer. When this user agent discovers and retrieves alist of stations published by all transport agencies, it canassume its list will be sufficient for its use case.

8.3. QueryabilityFor scalability and user-perceived performance, we

argued that publishing information resources with fewerserver-side functionality is a better way of publishing OpenData. We also argued that in the case of solving allquestions on the server, the server pretends to be allknowing, when in fact it is not: it just has a closed worldapproach. The user agent has to adopt the specific contextof the server, and is not able to answer questions overmultiple servers at once. A user agent should be able toadd its own knowledge to a problem, coming from otherdata sources on the Web or from the local context itcurrently has.

It is not because two datasets are legally, technically,syntactically, and semantically interoperable, that a useragent can answer questions over these two datasetswithout any problem. A query answering algorithm alsoneeds to be able to access the right parts of the data at acertain moment. First, we can make the server expose overa trillion information resources of our data, by answeringall specific questions on the server. However, as previouslydiscussed, the questions that can be answered depend onthe server infrastructure and the server context.Combining different services like this becomes increasinglydifficult. This idea is taken from soa, and is not a greatmatch for maximizing the reuse of open datasets.

Another option is that the data publisher publishes one56

Page 59: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

data dump of all data within an organization. While this ispreferred over a service when a diverse range of queries isneeded, a data dump has clear drawbacks too. Forinstance, when the dataset would update, the entiredataset needs to be downloaded and reprocessed by alluser agents again. Furthermore, for a user agent that onlywants to know the nearest station, the overhead todownload and interpret the entire file becomes too big.The possibility that data is downloaded that will never beused by the user agent becomes bigger too.

high client cost

high availability

high bandwidth

high server cost

low availability

low bandwidth

data dump query interface

Figure 7: The Linked Data Fragments (ldf) idea plotsthe expressiveness of a server interface on an axis.On the far left, an organization can decide to provideno expressiveness at all, by publishing one datadump. On the far right, the server may decide toanswer any question for all user agents.

Instead of choosing between data dumps and queryservices, we propose to explore options in-between. Onlyone very simple question, asking for all data, can beanswered by the server itself. It is up to the user agent tosolve more specific questions locally. When the dataset issplit in two fragments, e.g., all stations in the north of thecountry and all stations in the south of the country, theuser agent can, depending on the question that needs tobe solved, now choose to only download one part of thedataset. When publishing data in even smaller fragments,interlinked with each other, we help user agents to solvequestions in a more timely fashion. The more fragmentsare published by the server, the more expressive we call aserver interface. With this idea in mind, Linked DataFragments (ldf) [29], illustrated in Figure 7, wereintroduced. Publishing data is making trade-offs: based ona publisher’s budget, a more or less expressive interfacecan be installed.

Chapter 2 — Open Data and Interoperability 57

Page 60: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Just like a database that will be prepared to answercertain types of questions, a hypermedia interface can alsobe modeled for the questions it needs to answer. Byproviding extra controls and by providing more fragmentsof the data to be retrieved, the queryability of the interfacewill raise for particular problems.

9. conclusionThe goal of an Open Data policy is to share data with

anyone for any purpose. The intention is to maximize thereuse of a certain data source. When a data source needsto attract a wide variety of use cases, a cost-benefitanalysis can be made: when the cost for adoptions is lowerthan the benefits to reuse this data, the data is going to beadopted by third parties. This cost for adoption can belowered by making data sources more interoperable.

Figure 8: The layers of data source interoperabilityfor allowing user agents to query your data.

In order to make it more feasible for developers tomake their app work globally instead of only for onesystem, we introduced the term data sourceinteroperability. We define interoperability as how easy it isto evaluate questions over two data sources. The term canthen be generalized by comparing your data source withall other data sources within your organization, or evenmore generally, all other data sources on the Web. Wediscussed – and will further study – interoperability on fivelevels, as illustrated in Figure 8:

58

Page 61: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

1. Are we legally allowed to merge two datasets?2. Are there technical difficulties to physically bring

the datasets together?3. Can both syntaxes be read by the user agent?4. Do both datasets use a similar semantics for the

same identifiers and domain models?5. How difficult is it to query both datasets at once?

We only considerdata here that has to

be maximallydisseminated.

On the legal level, public open data licenses help to getdatasets adopted. It is not because the content of a licensecomplies to the Open Definition, that the cost for adoptionis minized. Licenses that are custom made may not be aseasy to use, as it needs to be checked whether it indeedcomplies. On the technical level, we are still working onbetter infrastructure with http/2. On the syntactic level, weare working on efficient ways to serialize triples data factsin data [30], such as with the on-going standardizing workwith JSON-LD, on-going research and development withinfor example hdt, csv on the Web, or rdf-thrift. On thesemantic level, we are looking for new vocabularies todescribe, avoid conflicts in identifiers using uris, and makesure our data terms are documented. And finally, whenquerying data, we are still working on researching how wecan crawl the entire Web with an Open Worldassumption [28], or how to query the Web using morestructured Linked Data Fragments interfaces.

In this chapter we did not mention data quality. Onedefinition defines data quality as the perceived dataquality when an end-user wants to use it for a certain usecase. E.g., “The data quality is not good, as it does notprovide me with the necessary precision or timeliness”.However, for other use cases the data may be perfectlysuitable. Another definition mentions data quality as howclose it corresponds to the real world. Furthermore, is itreally a core task of the government to raise the quality ofa dataset beyond the prime reason why the data wascreated in the first place? When talking about Open Data,the goal is to maximize data adoption by third parties.Even bad quality data – whatever that may be – might alsobe interesting to publish.

Interoperability is a challenge that is hard to advanceon your own: also other datasets need to become

Chapter 2 — Open Data and Interoperability 59

Page 62: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

interoperable with yours. It is an organizational problemthat slowly finds its way into policy making. In the nextchapter, we discuss how we can advance interoperabilitywithin large organizations.

references[1] Shafranovich, Y. (2005, October). Common Format and

mime Type for csv Files. ietf.

[2] Beckett, D., Berners-Lee, T., Prud’Hommeaux, E.,Carothers, G. (2014, February). rdf 1.1 Turtle – Terserdf Triple Language. w3c.

[3] Carothers, G., Seaborne, A. (2014, February). rdf 1.1N-Triples – A line-based syntax for an rdf graph. w3c.

[4] Schreiber, G., Raimond, Y. (2014, February). rdf 1.1Primer. w3c.

[5] Berners-Lee, T., Fielding, R., Masinter, L. (2005,January). Uniform Resource Identifier (uri): GenericSyntax. ietf.

[6] Vandenbussche, P.-Y. (2017). Linked OpenVocabularies (lov): a gateway to reusable semanticvocabularies on the Web. Semantic Web 8.3 (pp.437–452).

[7] European Parliament (2001). Directive 2001/29/EC ofthe European Parliament and of the Council of 22 May2001 on the harmonisation of certain aspects ofcopyright and related rights in the information society.eur-lex.

[8] European Parliament (1996). Directive 96/9/EC of theEuropean Parliament and of the Council of 11 March1996 on the legal protection of databases. eur-lex.

[9] Triaille, J-P., de Meeûs d’Argenteuil, J., de Francquen,A. (2014, March). Study on the legal framework of textand data mining.

[10] Creative Commons (2013, December). Creative

60

Page 63: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Commons 4.0/Sui generis database rights draft.

[11] Open Knowledge International (2004). The OpenDefinition.

[12] Moody, G. (2016, April). Copyright chaos: Why isn’tAnne Frank’s diary free now?. Ars Technica.

[13] Judgment of the Court (Fourth Chamber) (2014, June).Public Relations Consultants Association Ltd vNewspaper Licensing Agency Ltd and Others.. eur-lex.

[14] Judgment of the Court (Grand Chamber) (2004,November). The British Horseracing Board Ltd andOthers v William Hill Organization Ltd.. eur-lex.

[15] Research Service of the Flemish Government ().Overview of datasets of the Flemish Regional Indicators.

[16] Baccarne, B., Mechant, P., Schuurman, D., Colpaert,P., De Marez, L. (2014). Urban socio-technicalinnovations with and by citizens. (pp. 143–156).Interdisciplinary Studies Journal.

[17] Walravens, N., Van Compernolle, M., Colpaert, P.,Mechant, P., Ballon, P., Mannens, E. (2016). OpenGovernment Data’: based Business Models: a marketconsultation on the relationship with government in thecase of mobility and route-planning applications. Inproceedings of 13th International Joint Conferenceon e-Business and Telecommunications (pp. 64–71).

[19] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter,L., Leach, P., Bernsers-Lee, T. (1999, June). HypertextTransfer Protocol – HTTP/1.1. ietf.

[20] Verborgh, R. (2012, July). get doesn’t change the world.

[21] Belshe, M., Peon, R., Thomson, M. (2015, May).Hypertext Transfer Protocol Version 2 (http/2). ietf.

[22] Melnik, S., Decker, S. (2000, September). A LayeredApproach to Information Modeling and Interoperabilityon the Web. In proceedings of the ECDL’00 Workshopon the Semantic Web.

[23] Farias Lóscio, B., Burle, C., Calegari, N. (2016, August).

Chapter 2 — Open Data and Interoperability 61

Page 64: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Data on the Web Best Practices.

[24] Berners-Lee, T. (1989, March). InformationManagement: A Proposal.

[25] Verborgh, R., van Hooland, S., Cope, A.S., Chan, S.,Mannens, E., Van de Walle, R. (2015). The Fallacy of theMulti-API Culture: Conceptual and Practical Benefits ofRepresentational State Transfer (rest). Journal ofDocumentation (pp. 233–252).

[26] Fielding, R. (2000). Architectural Styles and the Designof Network-based Software Architectures. University ofCalifornia, Irvine.

[27] Fielding, R. T., Taylor, R. N., Erenkrantz, J., Gorlick, M.M., Whitehead E. J., Khare, R., Oreizy, R. (2017).Reflections on the REST Architectural Style and“Principled Design of the Modern Web Architecture”.Proceedings of the 2017 11th Joint Meeting onFoundations of Software Engineering (pp. 4–11).

[28] Hartig, O., Bizer, C., Freytag, J.C. (2009). Executingsparql queries over the Web of Linked Data. Inproceedings of International Semantic WebConference. Springer Berlin Heidelberg.

[29] Verborgh, R., Vander Sande, M., Hartig, O., VanHerwegen, J., De Vocht, L., De Meester, B.,Haesendonck, G., Colpaert, P. (2016, March). TriplePattern Fragments: a Low-cost Knowledge GraphInterface for the Web. Journal of Web Semantics.

62

Page 65: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Chapter 3Measuring Interoperability

“ When you take apart a ship plankby plank and assemble it again, is it still

the same ship? ”― Theseus’s paradox.

63

Page 66: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

In order to maximize potential reuse, an open data publisherwants its data source to be as interoperable – on alllevels – as possible with other datasets world-wide.Currently, there is no way to identify relevant datasetsto be interoperable with and there is no way tomeasure the interoperability itself. In this chapter wediscuss the possibility of comparing identifiers usedwithin various datasets as a way to measure semanticinteroperability. We introduce three metrics toexpress the interoperability between two datasets: theidentifier interoperability, the relevance, and thenumber of conflicts. The metrics are calculated from alist of statements which indicate for each pair ofidentifiers in the system whether they identify thesame concept or not. The effort to collect thesestatements is high, and while not only relevantdatasets are identified, also machine-readablefeedback is provided to the data maintainer. We willtherefore also look at qualitative methods to study theinteroperability within a large organization.

When raising the interoperability between two or moredatasets, the cost for adoption will lower. A user agent thatcan process one dataset, will be able to ask questions overmultiple without big investments. If only we could becomemore interoperable with all other datasets online, then ourdata would be picked up by all existing user agents. Ofcourse, this is not a one way effort: other datasets alsoneed to become more interoperable with ours, and with allother datasets on the Web. It becomes a quadraticallycomplex problem, where each dataset needs to adoptideas from other datasets, managed by differentorganizations with different ideas. In this chapter, we willlook at ways to measure the interoperability of datasets,which in the same way is also a quadratic problem as eachdataset needs to be compared with all other datasets.

How can we measure the impact of a certaintechnology on the interoperability? A first effort we did in2014, was comparing identifiers of available open datasetsacross different cities [1].

64

Page 67: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

1. comparing identifiersA simplistic approach would be to classify relations

between the identifiers (ids) of two datasets in fourcategories. One dataset can contain the same identifier asa dataset in the other. When this identifier identifies thesame real-world thing in both datasets, we call this acorrect id match. When this identifier identifies a differentreal-world thing, we can call this a false id match or an idconflict. In the same way, we have a correct different id anda false different id.

Table 2: A first dataset about the city of Ghent

id long lat type_sanit fee id_ghent

1 3.73 51.06 new_urinal free PS_151

In Table 1, we give an example of an open dataset ofpublic toilets in the city of Ghent. In Table 2, a similardataset about the city of Antwerp is given. An identifier iseach element that is not a literal value such as “3.73”.Identifiers may be elements of the data model, as well asreal-world objects described within the dataset. In thesetwo tables, we can label some identifiers as conflicting,other elements as correct id matches, others as correctdifferent ids or false different ids. However, the labelingcan happen differently depending on the domain expert.Certainly when “loose semantics” are utilized it becomesdifficult to label an identifier as identifying “the same as”another identifier.

Table 3: A second dataset, now about the city ofAntwerp

id long lat type fee description

1 4.41 51.23 urinal none Hessenhuis

1.1. An initial metricIn our research paper in 2013 for the sake of simplicity,

Chapter 3 — Measuring Interoperability 65

Page 68: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

we assumed that domain experts would be able to tellwhether two identifiers are “the same as” or “not the sameas”. When we would have a list of statements, classifyingthese identifiers in four categories, we would be able todeduct a metric for the interoperability of these twodatasets. The first metric we introduced was the id ratio.

ID% =# (correct id matches)

# (real-world concept matches)

Figure 1: The identifier ratio id%

We apply this formula on top of our two exampledatasets in Table 1 and Table 2. The correct identifiermatches are: “id”, “long”, “lat”, “fee” (4). Conflicts would be

Of course, thisdepends on our viewof the world and our

definitions of thethings within this

dataset.

the identifier “1” (1). Real-world concept matches would be:“id”, “long”, “lat”, “type/type_sanit”, “fee”, “urinal/new_urinal” (6). An initial metric would thus score thesetwo datasets as 66% interoperable, with 1 conflict.

In a simple experiment, we asked programmers to givea score for the interoperability between 5 differentdatasets and a reference dataset. The outcome revealedthat there were clear design issues with this metric: whenthere would be a low amount of real-world matches, thescore would be influenced quickly, as the number ofsamples to be tested is low. A full report, and the researchdata, on this experiment can be found on a Githubrepository at pietercolpaert/iiop-demo.

1.2. Identifier interoperability, relevance, andnumber of conflicts

There are two problems with id%. First, The id% may becalculated for 2 datasets which are not at all relevant tocompare. This can lead to an inaccurate high or lowinteroperability score, as the number of real-world conceptmatches is low. Second, when different datasets arebrought together, some identifiers are used more thanothers. An identifier which is used once has the sameweight as an identifier that is used in almost all facts.

Instead of calculating an identifier ratio, we introduce arelevance (ρ) metric. A higher score on this metric would

66

Page 69: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

mean that two datasets are relevant matches to bemerged. We now can introduce two types of relevance: therelevance of these two datasets as is (ρidentifiers), and therelevance of these two datasets when all identifiers wouldbe interoperable (ρreal-world). We define the ρ as thenumber of occurences of an identifier when both datasetswould be merged and would be expressed in triples. Theρidentifiers in our example would become 8, as 2 times 4statements can be extracted from a merged table. Theρreal-world would become 12, as the number of occurencesthat would be left when the dataset would be 100%interoperable, would be 2 times 6. We then define theIdentifier Interoperability (iiop), as the ratio between thesetwo relevance numbers. For this dataset, the iiop becomes66%. Coincidentally the same as the id%, with more rowsin the dataset, it would quickly be different. Whenrepeating the same experiment as with the id%, we nowsee a credible interoperability metric when comparing theρreal-world, the number of conflicts and the iiop.

While these three metrics may sound straightforward,it appears to be a tedious task to gather statements. Whatis the threshold to decide whether two identifiers are thesame as or not the same as [2]? A philosophical discussionarrises – cfr. Theseus’s paradox – whether somethingidentified by one party can truly be the same as somethingidentified by someone else. It is up to researchers to becautious with these statements, as a same as statementmay be prone to interpretation.

1.3. The role of Linked Open DataConflicts are the easiest to resolve. Instead of using

local identifiers when publishing the data, all identifierscan be preceded by a baseuri. E.g., instead of “1”, anidentifier can contain http://{myhost}/{resources}/1. Thissimple trick will eleminate the number of conflicts to zero.

Persistent identifiersare identifiers that

stand the test of time.Cool uris don’t

change.

In order to have persistent identifiers, organizationsintroduce a uri strategy. This strategy contains the rulesneeded to build new identifiers for datasets maintainedwithin this organization.

Another problem that arose, was that a third party

Chapter 3 — Measuring Interoperability 67

Page 70: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

cannot be entirely sure what was intended with a certainidentifier. Linked Data documents these terms byproviding a uniform interface for resolving thedocumentation of these identifiers: the http protocol. Thisway, a third party can be certain what the meaning is, andwhen the data is not linked, it can link the data itself moreeasily by comparing the definitions. The more relevantextra data is provided with this definition, the easierlinking datasets should become.

Linked Data helps to solve the semanticinteroperability, for which the identifier interoperability isjust an indicator. By using http uris, it becomes possibleto avoid conflicts, at least, when an authority does notcreate one identifier with multiple conflicting definitionsand when this definition, resolvable through the uri, isclear enough for third parties not to missinterpret it. Theeffort it would take for data publishers to document theirdatasets better and to provide “same as” statements withother data sources is similar to just providing uris withinyour data and linking them with existing datasets from thestart. However, the theoretical framework built, providesan extra motivation for Linked Data.

2. studyinginteroperability indirectly

It is understandable that policy makers today investtime and money in raising interoperability, rather thanmeasuring its current state. When however no globallyunique identifiers are used, can we work on semanticinteroperability? Even more generally, is interoperability aproblem data publishers are worried about at all? Or istheir foremost concern to comply to regulations to publishdata as is? In this section, we are going to measureinteroperability indirectly, by trying to find qualitativeanswers to these questions by studying the adoption bythird parties, studying the technology of publisheddatasets and finally studying the organizations themselves.

Making the cost-benefit analysis for data reuse, when

68

Page 71: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

more third party adoption can be seen, we can alsoconclude the datasets become more interoperable whenthe benefits did not change. We can perform interviews ofmarket players and ask them how easy it is to adoptcertain datasets today. This is an interesting post-assessment. In a study we executed within the domain ofmultimodal route planners [3], it appeared that only alimited amount of market players could be identified thatreused the datasets. Each time, reusers would replicate

In this paper [3] wereported on the study

we executed

the entire dataset locally. We concluded that still onlycompanies with large resources can afford to reuse data.This is again evidence that the cost for adoption, and thusthe interoperability problems need to lower.

When studying data policies today, we can observe acertain maturity on the five layers of data sourceinteroperability. For example, when a well-known openlicense is used, we can assume the legal interoperability ishigher than with other datasets. When the data is publiclyshared on the Web and accessible through a url, we cansay the technical interoperability is high, and whendocuments and server functionality are documentedthrough hypermedia controls, we can assume the queryinginteroperability is high. At some level, in order to raiseinteroperability, we have to make a decision for a certainstandard or technology together. From an academicperspective, we can only observe that such a decision doesnot hinder other interoperability layers. When studying theinteroperability, we could then give a higher score to these

With the upcominghttp/2.0 standard, allWeb-communication

will be secure bydefault (https). In the

rest of this book, wewill use http as an

acronym for theWeb’s protocol, yetwe will assume the

transport layer issecure.

technologies that are already well adopted. Today, thesetechnologies would be http as a uniform interface, rdf asa framework to raise the semantic interoperability, and thepopular Creative Commons licenses for the legal aspect. Inthe next chapter, we will study organizations based on thereasoning to set up a certain kind of service today.Epistemologically, new insights can be created by the dataowner to publish data in a more interoperable way.

Chapter 3 — Measuring Interoperability 69

Page 72: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

3. qualitatively measuring andraising interoperability on the 5layers

Aspects such as theorganizational,

cultural, or politicalenvironment may beenablers for a higher

interoperability onseveral levels, but

should as such not betaken into account for

studying data sourceinteroperability itself.In among others the

European frameworkisa² and eif, also

politicalinteroperability is

added.

When quantitatively studying dataset interoperability,we can discuss each of the five layers seperately. In thissection we give an overview of all layers and how they canbe used in interviews. Depending on the quality that wewant to reach in our information system, we may be morestrict on each of the levels, and thus for every project, adifferent kind of scoring card can be created. This is in linewith other interoperability studies [4], that overall agreethat interoperability is notoriously difficult to measure, andthus qualitatively set the expectation for every project.

3.1. Legal interoperabilityPolitical, organizational, or cultural aspects may

influence legal interoperability. When studied on a globalscale, there is a political willingness to reach a cooperationon ipr frameworks. The better the overarching iprframework, such as a concensus on international copyrightlaw, better legal interoperability is achieved. Today, stilldifferent governments argue they need to build their ownlicenses for open data publishing, which again howeverwould lower the legal interoperability. When there is areason to lower the interoperability, the interviewer shouldask whether and why these reasons weigh up to thedisadvantages of lower interoperability in the opinion ofthe interviewee.

When only considering open datasets, measuring theinteroperability boils down to making sure that a machinecan understand what the reuse rights are. The first andforemost measurement we can do, is testing whether wecan find an open license attached to the data source. TheCreative Commons open licenses for example, each have auri, which dereference into an rdf serialization whichprovides a machine readable explanation about the data.

70

Page 73: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

3.2. Technical interoperabilityNext, we can check how it is made available. When the

dataset can be reached through a protocol everyonespeaks, technical interoperability is 100%. Today, we cansafely assume everyone knows the basics of the httpprotocol. When a dataset has a direct url over which it canbe downloaded, we can say it is technically interoperable.The technical interoperability can even raise when we alsorefer to related different pages in the response headers orbody of the response, yet we keep studying these aspectsfor the querying interoperability layer. An interesting testto publish data in a technically interoperable way can bethat given a url, one is able to download the entiredataset.

3.3. Syntactic interoperabilityThe syntactic interoperability of two datasets is

maximized when both datasets use the same serialization.In the case of Open Data, the syntactic interoperability isnot worth measuring, as given a certain library in a certainprogamming language, every syntax can just be read inmemory without additional costs (e.g., the difference incost to parse xml vs. json will be marginal). Quantitativelystudying the syntactic interoperability thus involvesstudying whether the actual data is also machine readable,in accordance with the third star of Tim Berners-Lee.

However, much again depends on the intent. If it’s thecore task of a service to build spreadsheets with statisticsfrom various sources, and because of the fact that thesestatistics are summaries to, for instance, solveparliamentary questions, these spreadsheets in a closedformat need to be opened up. Will it be an added value ifanother government service now also provides a machinereadable version of these datasets? Researchers should becareful in what the outcome of blindly measuringinteroperability means for the internal governmentprocesses. Quick wins are not always the best solution:raising syntactic interoperability – and other kinds ofinteroperability as well – means changing internalprocesses and software. A holistic view of processes is

Chapter 3 — Measuring Interoperability 71

Page 74: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

needed. Interviews with data managers may thus be moreconstructive than merely measuring the syntacticinteroperability.

Some syntaxes allow semantic mark-up, while otherspecifications and standards for specific user agents donot allow a standard way for embedding triples. Still in thiscase, the identifiers for documents and real-world objectsshould remain technically interoperable. We thereforecould measure how ready a certain government is forcontent negotiation and whether there is a strategy formaintaining identifiers in the long run.

3.4. Semantic interoperabilityAn indicator for semantic interoperability can be

created by comparing identifiers, as shown in the firstsection. However, we can also qualitatively assess how wellsemantic interoperability issues are dealt with on a lessfundamental level. With a closed world assumption, wecan create a contract between all parties that redefineswhich set of words are going to be used. We can seeevidence of this in specific domains where standards areomnipresent. E.g., within the transport domain, datex2and gtfs are good examples. However, from the momentthis data needs to be combined within an open world –answering questions over the borders of datasets – thesestandards fail at making a connection.

Essentially, researching the semantic interoperabilityboils down to studying how good identifiers are regulatedwithin an information system. Again a qualitative approachcan be taken as well. A perfect world does not exist, but wecan interview governmental bodies to find out why theymade decisions towards a solution, and look whethersemantic interoperability was a problem they tried totackle somehow.

Without rdf there is no automated way to findsemantic interoperable properties or entities. Dependingon the qualitative study, and whether the semanticinteroperability is an issue that will be identified, we candecide to introduce this technology to the organization, ifnot yet known. Even when a full rdf solution is in place,still the semantic interoperability can thus vary across

72

Page 75: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

datasets.

3.5. Querying interoperabilityThe ldf axis – illustrated again in figure 2 – was at all

time used to discuss the queryability of interfaces. Whenonly a query interface would be offered, we would – as aquick win – also indicate that there is the possibility ofoffering data dumps. Hypermedia apis – what this axisadvocates for – are not yet part of off the shelf products,which make it particularly difficult for organizations toexplore other options.

high client cost

high availability

high bandwidth

high server cost

low availability

low bandwidth

data dump query interface

Figure 2: The Linked Data Fragments (ldf) idea plotsthe expressiveness of a server interface on an axis.On the far left, an organization can decide to provideno expressiveness at all, by publishing one datadump. On the far right, the server may decide toanswer any question for all user agents.

The best example of full interoperability would be thatby publishing the data using the http protocol, the databecomes automatically discovered and used in thisapplication.

4. conclusionRaising interoperability entails making it easier for all

user agents on the Web to discover, access andunderstand your data. We explored different ways tomeasure interoperability between two datasets. For OpenData however, we would then need to generalize – or scaleup – these approaches to interoperability between a

Chapter 3 — Measuring Interoperability 73

Page 76: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

dataset and “all other possible open datasets”.A first – not advised due to the investments needed –

way is to compare identifiers between these two datasetsor systems. When the same identifiers are used over thetwo datasets, we can assume a high interoperability.However, whether identifiers actually identify the samething is prone to interpretation. We see this exercisemainly as proof that Linked Data is the right way forward:by using http uris, we use the uniform interface of the Webto document identifiers. Furthermore, using Webaddresses instead of local identifiers will make sureanother Web framework – rdf – can be used to discuss therelation between different real-world objects. Throughcomparing identifiers, we showed the importance of rdf to

Only time will tellwhether using Web

addresses foridentifiers – and

consequentially rdf –will become the norm

for all aspects. Wellestablished standards

then will have toevolve to rdf as well.

raise the semantic interoperability. Also within the legal,technical, syntactical, and querying interoperability, rdfmay play its role. Without rdf, we would have to rely on adifferent mechanism to retrieve machine readable licenseinformation, or we would have to rely on a specificationthat reintroduces syntax rules for hypermedia.

A second way to measure interoperability betweendatasets is to study the effects. The fact that Open Data bydefinition should allow data to be redistributed andmashed up with other datasets, means it becomes hard toautomatically count each access to a data fact comingfrom your original dataset. A successful open data policycan thus be measured by the number of parties thatdeclare that they reuse the dataset. Interviewing theparties that voluntarily declared this fact, may result ininteresting insights on how to raise the interoperability.However, this is a post-assessment when an Open Datapolicy (or a data sharing policy) is in place.

A third way is to interview data owners within anorganization on their own vision on Open Data. During theinterviews, questions can be asked on why certaindecisions have been taken, each time categorizing theanswer at an interoperability layer. This is anepistemological approach, in which data owners will getnew insights when explaining their vision within thecontext of the 5 interoperability layers. Depending on theinterviewed organizations, different technologies can beassumed accepted. While some organizations will find it

74

Page 77: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

evident to use http as a uniform interface, others may stillsend data that should be public to all stake holders using afax machine. For fewer cases – discussed in the nextchapter – rdf was evident. Thefore, we need a good mixbetween desk research on the quality of the data andinterviews with data maintainers in order to create a goodoverview of the interoperability.

In the next chapter we introduce our own context ofthe organizations we worked with and will choose aqualitative approach to studying interoperability.

references[1] Colpaert, P., Van Compernolle, M., De Vocht, L.,

Dimou, A., Vander Sande, M., Verborgh, R., Mechant,P., Mannens, E. (2014, October). Quantifying theinteroperability of open government datasets. (pp.50–56). Computer.

[2] Halpin, H., Herman, I., J. Hayes, P. (2010, April). Whenowl:sameAs isn’t the Same: An Analysis of Identity Linkson the Semantic Web. In proceedings of the WorldWide Web conference, Linked Data on the Web(LDOW) workshop.

[3] Walravens, N., Van Compernolle, M., Colpaert, P.,Mechant, P., Ballon, P., Mannens, E. (2016). OpenGovernment Data’: based Business Models: a marketconsultation on the relationship with government in thecase of mobility and route-planning applications. Inproceedings of 13th International Joint Conferenceon e-Business and Telecommunications (pp. 64–71).

[4] C. Ford, T., M. Colombi, J., R. Graham, S., R. Jacques,D. (1980). A Survey on Interoperability Measurement. Inproceedings of 12th ICCRTS.

Chapter 3 — Measuring Interoperability 75

Page 78: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

76

Page 79: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Chapter 4Raising Interoperability ofGovernmental Datasets

“ We want to see a world where datacreates power for the many, not the few.

”― Open Knowledge International.

77

Page 80: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Maintaining datasets decentrally is a challenging jobconstantly weighing trade-offs: will you invest more inkeeping the history of your dataset? Or will you investmore in creating specific materialized versions of yourspecific data? As researchers within our team at imec,we were able to study different organizations from theinside out to study the barriers to publish data asOpen Data. We discuss three clusters of datasets thatwe have collaborated on to shape the Open Datapolicy. First, we introduce a proof of concept for LocalCouncil decision as Linked Data, a project that getsfurther developed in 2017–2018. Then we discuss thedatasets of the Department of Transport and PublicWorks – for whom we did an assessment of their OpenData strategy – and their history. Finally we discussthe governance of data portals and how we candiscover datasets through their metadata. We foundthat raising the interoperability within theseorganization is not easy, yet at a certain moment wewere able to suggest actions to improve on the state ofthe art. In each of the three use cases we formulate alist – supported by an organization – of next actions totake. Datasets necessary for the use case of routeplanning often originate from these datasets thatinitially were not built with route planning in mind.Yet, still, we would like that route planners ideallytake into account council decisions in order to updatetheir maps.

As we study datasets governed within Europe, we needto understand the policy background. Three Europeandirectives – the Public Sector Information (psi) directive, theInfrastructure for Spatial Information in the EuropeanCommunity (inspire) directive, and the Intelligent TransportSystems (its) directive – regulate how respectively publicsector information, geospatial information, and transportdata need to be shared. These directives still leave roomfor implementation.

First, the psi directive states that each policy documentthat has been created in the context of a public task,should be opened up. Thanks to this directive, openbecame the default: instead of having to find a reason tomake a document publicly available, now a good reason

78

Page 81: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

needs to be made public in order to keep certain datasetsprivate.

Next, the inspire directive regulates how to maintainand disseminate geospatial data. It defines theimplementation rules in order to set up Spatial DataInfrastructure (sdi) for the European Community. inspirehas a holistic approach, defining rules for among othersmetadata, querying, domain models or network services.Although it originated for purposes of EU environmentalpolicies and policies or activities which may have an impacton the environment, the 34 themes in inspire, can be usedwithin other domains as well. Take for example thedomain of public transport, where entities such as the“railway station of Schaerbeek” or administrative unitssuch as “the city of Schaerbeek” also need to be describedusing spatial features.

Finally, the its directive focuses on the transportdomain itself. It is in its essence not a data sharingdirective. It has delegated acts in which the sharing of datain an information system is described.

In this chapter, we will chronologically run throughthree periods. The first period was when the psi directivegained traction and when the first Open Data Portals wereset up. I just started my research position and was tryingto create a framework to study Open Data, its goals, andhow we could structure the priorities when implementingsuch a data portal. Next, we discuss transport datasetswithin the Flemish Department of Transport and Public Works(dtpw). There, the overlap between the three directivesbecomes apparent. Finally, we discuss an opportunity topublish the content of local council decision as theauthoritative source of a variety of facts.

Chapter 4 — Raising Interoperability of Governmental Datasets 79

Page 82: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

1. data portals: making datasetsdiscoverable

1.1. The DataTankThe first project I helped shaping at imec was The

DataTank [1]. It is an open-source project that helps, in itsessence, to automate data publishing, and bring datasetscloser to the http protocol and the Web. It contains toolsto republish unstructured data in common Web-formats.Furthermore, it automatically gets the best out of the httpprotocol, adding:

• An access-control-allow-origin: * header,which allows this resource to be used within ascript on a different domain.

• A caching header, which allows clients to knowwhether the resource updated or not.

• Paging headers when the resource would be toolarge.

• Content negotiation for different serializations.

In order to allow a quick overview of the data in theresource, the software also provides a best effort htmlvisualization of the data.

When creating The DataTank, we built it around the 5stars of Linked Open Data [2]. For each star, our goal wasto provide a http interface that would not overload theback-end systems. The data source interoperability can belifted on the legal level, where the metadata also includes auri to a specific license. Also on the technical level, itpublished a dataset over http and added headers to eachresponse. Furthermore, syntactically, it provides eachdataset in a set of common serializations. Semantically,The DataTank can read, but does not require, data in rdfserializations as well. Using the tdt/triples package, onecan automatically look up the documentation of a certainentity using http uri dereferencing [3]. This requires thedataset itself to use uris and an rdf serialization.

80

Page 83: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

1.2. 5 stars of Open Data PortalsWhen making the requirements analysis of the future

roadmap of The DataTank back in 2013, we published the5 stars of data portals [4]. The idealistic goal was that adata portal should allow the data to be maintained as if itwas a common good. The five stars of Open Data Portalsare to be interpreted cummulatively and were defined asfollows:★★ A dataset registry

A list of links towards datasets that are openlylicensed

★★★★ A metadata providerMake sure the authentic sources inside yourorganization are adding the right metadata fields(e.g., according to dcat). This list of datasets should inits turn be licensed openly, so that other portals canaggregate or cherrypick datasets.

★★★★★★ A cocreation platformSupport a conversation about your data.

★★★★★★★★ A data providerMake sure the resources inside your organization aregiven a unique identifier and that you haveinteroperable access to the datasets themself, notonly its metadata.

★★★★★★★★★★ A common datahubHave a way for third parties to contribute to yourdataset.

The DataTank’s functionality focuses on the second starand development on further stars were never persued.Data projects such as Wikidata, Open Corporates,Wikipedia, or Open Street Map come closest to what wasenvisioned with the fifth star.

1.3. Open Data Portal Europe and theinteroperability of transport data

The European Data Portal brings together all data that

Chapter 4 — Raising Interoperability of Governmental Datasets 81

Page 84: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

The portal is availableat https://www.europea

ndataportal.eu/

is available on national Open Data Portals in Europe.Analysis of the current 11,003 datasets can create anoverview of data interoperability in Europe.

On a legal level, we notice that the majority of datasetsuse standard licenses, which increases legalinteroperability. However, the Belgian public transportdatasets are still not on the European data portal as theydo not have an Open License. Technically, while 85.5% ofthe datasets are directly accessible through a url, 1,584(14.5%) dataset descriptions only link to an html page withinstructions how to access the dataset. On a semanticlevel, only 295 (2.7%) datasets use an rdf serialization.

1.4. Queryable metadataData describing other datasets, typically maintained by

a data portal, need to be available for reuse as well. Wecan thus apply the theory of maximizing reuse of datasets

ckan is at the time ofwriting the most

popular data portalsoftware.

to dcat data. The standard license within The DataTankand within ckan is the Creative Commons Zero waiver,which provides for full legal interoperability with any otherdataset. The data is technically available over http and istypically available in one or more rdf serializations.However, we still notice some interoperability issues withdata portal datasets today.

One of the problems happens – despite using rdf – onthe semantic level. For instance, when harvesting themetadata from European member states’ data portal onthe European data portal, new uris are created perdataset instead of reusing the existing ones.

high availability

high bandwidth

high server cost

low bandwidth

DCAT data dump SPARQL (query language) endpoint

Figure 1: The Linked Data Fragments (ldf) axisapplied to metadata.

One of the goals behind dcat is to allow metadatasets

82

Page 85: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

with similar semantics to be maintained decentrally.However, the querying itself still happens centrally. Weproposed an in-between solution with tpf. [5] This way, aEuropean data portal would not have to replicate alldatasets anymore, but would only be the authoritativeregistry for national data portals. Thanks to the tpfinterface and extra building blocks such as a full textsearch exposed on the level of each member state, thesame user interface would be able to be created.Furthermore, the harvesting that takes place today wouldbe automated using http caching infrastructure.

2. open data in flandersIn 2015, we had the opportunity study a common

vision throughout all the suborganizations within theFlemish Department of Transport and Public Works (dtpw) inFlanders. We studied the background of several datasets,and without taking a position ourselves, we wouldinterview all data maintainers and department directorson their experience with “Open and Big Data”. On the basisof this input, we created recommendations for action thatwould help forward the Open Data policy within the dtpw.

In this section, we first describe the background ofdifferent key datasets in Flanders, to then discuss thespecific datasets at the dtpw. We then report on theworkshops and the challenges that came out of theinterviews and explain our recommendations for action.

2.1. A tumultuous backgroundIn Flanders, the Road Sign Database (rsd) project

started in 2008 as a result of an incident the year beforewhere a trucker was jammed under a bridge, as the driverwas unaware the truck was too high. The investigationafter the incident showed there were no traffic signsnotifying drivers of a maximum height for vehicles. As aresult, the traffic in the city was jammed for several hours.As this was not the first time something like this happened,the rsd was born by ministrial decree, building a complete

Chapter 4 — Raising Interoperability of Governmental Datasets 83

Page 86: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

central database of traffic signs in Flanders. However, thersd has been the subject of data managementresearch [6], as the database did not live up to itsexpectations.

The responsibility to implement this database was split:the Agency for Roads and Traffic (art) would have tomaintain the traffic signs on the regional roads, while theFlemish Department of Transport and Public Works (dtpw)would have to maintain the traffic signs on the local roads.A company took 360° pictures of all roads in Flanders, andusing these pictures, all road signs were indexed in aregister. Two years later, by August 2010, this inventorywas complete for all 308 municipalities and the dtpwmade the rsd of the municipalities available via a softwareapplication. In order to keep the database up to date, themunipalicaties were asked to update it through thisapplication, and requested to sign an agreement. Somemunicipalities refused to sign the agreement, and others,who signed the agreement, complained this initialapplication was too complex and untrustworthy. Quickly,less than 30% of the municipalities kept using the rsd [6].

In 2011, for its own needs, art lauched a roadsdatabase instead of the rsd, which now includes specificinformation on the roads. In the same year, also GoogleStreet View was launched in Flanders, launching anapplication to view the 360° pictures Google took of allstreets, which local councils found easier to use than thersd application. In March 2013, a new, more stable anduser-friendly application of the rsd was launched, yet theadoption of the application remained low. The original costof the rsd was estimated to be €5 million, yet by Augstus2010, this estimate has risen to €15 million, and by March2013, when the new more user-friendly version shouldhave launched, had risen to €20 million. In April 2014, yetanother database was launched at the Agency forGeographic Information: the General Information Platformfor the Public Domain. The database collects all road worksin order not to block roads that are currently part of adeviation. Up to date, local governments are legally boundto filling out numerous database of the highergovernment, yet in practise, only few databases remainwell maintained.

84

Page 87: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

In August 2014, a new minister of mobility wasappointed. While first the new minister was looking intomaking it also legally binding to fill out the database, helater looked at limiting the scope of the rsd to speed signs,defeating its initial purpose, yet making sure third partieswould show the right speed limit in in-car navigationsystems. He would also demand a study to see what othermobility datasets could be shared as Open Data for thepurpose of serving route planner developers, which in itsturn funded the research reported in this paper.

This aligns well withthe European “Once

Only” principle.

In order to off-load local governments, one way that iscurrently looked into is to make local council decisionsavailable as Linked Data. As part of the psi directive, localadministration need to open up the decisions taken in thelocal council. The numerous databases local governmentshave to fill out today, would this way be able to fetch thefragments of the local council decisions they are interestedin themselves. Today, this approach is being evaluated.Perhaps this approach can be rolled out over all 308municipalities. We will discuss this project in more detail inthe next section.

In order to lower the cost for adoption for third partiesto integrate the data that will be published, we look intoraising the interoperability of these data sources. In thenext chapter, we describe the method in order to find thechallenges the dtpw still sees that need to be overcome inorder for all agencies to publish their data for maximizedreuse.

2.2. Discussing datasets at the Department ofTransport and Public Works

Out of 27 interviews with data owners and directorsworking in the policy domain of mobility, we collected a listof datasets potentially useful for multimodal routeplanning. Definitions of a dataset however divergeddepending on who was asked. Data maintainers oftenmentioned a dataset in the context of an internal database,used to fulfil an internal task, or used to store and sharedata with another team. When talking to directors, adataset would be a publicly communicated dataset, e.g., a

Chapter 4 — Raising Interoperability of Governmental Datasets 85

Page 88: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

dataset for which metadata can be found publicly, adataset that would be discussed in politics, or a dataset thepress would write stories about. In other cases, a datasetwould exist informally as a web page, or as a small file on acivil servant’s hard drive.

A data register of the mobility datasets that are part ofthe Open Data strategy can now be found athttp://opendata.mow.vlaanderen.be/. The list consists ofpublicly communicated datasets as well as informal datasources published on websites. During the interviews, wewere able to gather specific challenges related to specificdatasets useful for multimodal route planning,summarized in the following table: for a dataset to be trulyinteroperable, all boxes need to be ticked.

Table 4: Selection of studied datasets with their interoperability levels as ofOctober 2016

Dataset legal tech syntax semantic querying

Traffic Events openlicense yes xml no uris file

Roads database openlicense yes xml no uris no

Validatedstatistics

openlicense yes csv no uris no

Informationwebsites no yes html no uris linked

documents

Public Transittimetables

closedlicense

overFTP ZIP no uris dump

Road Signs nolicense yes none no uris no

Addressdatabase

openlicense yes xml PoC dump and

service

Truck Parkings openlicense yes xml no uris file

Metadatacatalogue

openlicense yes xml yes dump

86

Page 89: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Traffic events on the Flemish highways

This dataset is maintained by the Flemish TrafficCenter, has an open license and is publicly available. Itdecribes the traffic events, only on the highways, to whichthe core tasks of the traffic center is limited. The datasetscan be downloaded in xml. For the semantics in this xml,two versions in two different specifications (otap anddatex2) are available, for which the semantics can belooked up manually. The elements described in the filesare not given global identifiers however, making itimpossible to refer to a similar object in a differentdataset. The dataset is small and is published as a dynamicdata dump. As the dataset is small enough to be containedin one file, it can be fetched over http regularly, as well asthe updates. The http protocol works well for dynamicfiles, as caching headers can be configured in order not tooverload the server when many requests happen in ashort time. The file, except for the semantic interoperability,thus provides also as a good dataset for federated routeplanning queries.

Road database for regional roads

The road database for the regional roads is maintainedby the art. It is a geospatial dataset and already has tocomply to the inspire directive. Its geospatial layers arethus already available as web services on the geospatialaccess point of Flanders: http://geopunt.be. The roadmapin 2016 was to also add an open license and to alsopublish the data as linked files using the tn-its project’sspecification (http://tn-its.eu/).

Validated statistics of traffic congestion on the Flemishhighways

Today, validated statistics of traffic congestions on theFlemish highways are published under the Flemish OpenData License by the Flemish Traffic Center. A website wasdeveloped, which allows someone that is interested tocreate charts of the data, as well as export the selectedstatistics as XLS or csv. The legal, technical, and syntacticinteroperability are thus fully resolved. Yet when looking at

Chapter 4 — Raising Interoperability of Governmental Datasets 87

Page 90: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

the semantic interoperability, no global identifiers are usedwithin the dataset. Furthermore, when looking at thequerying interoperability, machines are even discouragedfrom using the files, as a test for whether you are a human(a captcha) is used to prevent machines from discoveringand downloading the data automatically. When requestinga csv file, the server generates the csv file with historicdata on the fly from the database.

Information Websites

Examples of such datasets are the real-time dataset ofwhether a bicycle elevator and tunnel is operating(http://fietsersliften.wegenenverkeer.be/), a real-timedataset of whether a bridge is open or not(http://www.zelzatebrug.be/) shows when a bridge north ofthe city of Ghent will open again (when closed), or adataset of quality labels of car parks next to highways(http://kwaliteitsparkings.be). The three examplesmentioned can be accessed in html. Nevertheless, this aswell is a valuable resource for end-user applications, aswhen the page would be openly licensed and when thedata would be annotated with web addresses, the data canbe extracted and replicated with standard tools andquestions can be answered over these different datasources. These three examples are always onlytechnologically and syntactically interoperable, as they usehtml to publish the data, yet there are no references tothe meaning of the words and terms used. Furthermore,there is no open license on these websites, not explicitlyallowing reuse of this data. Finally, as the data can easilybe crawled by user-agents and thus replicated, we reasonthat in a limited way, the data would be able to be used ina federated query.

Public transit time tables maintained by De Lijn

Planned timetables, as well as access to a real-timewebservice, can be requested through a one-on-onecontract. This contract results in an overly complex legalinteropability. First, a human interaction needs to requestaccess to the data, which can be denied. Furthermore, in

88

Page 91: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

the standard contract, it is not allowed for a third party tosublicense the data, which makes republishing the data, ora derived product, impossible. The planned timetables canbe retrieved in the gtfs specification, which is an openspecification, making the dataset syntactically interoperable.The identifiers used within this dataset for, e.g., stops,trips, or routes do not have a persistency strategy.Therefore, the semantic interoperability cannot beguaranteed. As a dump is provided, potential reusers haveaccess to the entire data source reliably. The queryinginteroperability could be higher when the dataset wouldbe split in smaller fragments.

Road Sign Database (rsd)

The database, in October 2016, is still only availablethrough a restricted application. It is a publicly discusseddataset, as its creation was commissioned by a decree. Ona regional level, the rsd is in reality two data stores: onedatabase for regional road signs, managed by art, and adatabase which collected the local road signs, managed bythe department itself. Some municipalities would howeveralso keep a copy of their own road signs on a local level,leading to many interoperability problems when trying tosync. Sharing this data with third parties however onlyhappens over the publicly communicated rsd, which isonly accessible through the application of the rsd itself.

Address database

A list of addresses is maintained as well by anotheragency, called Information Flanders. The database has tobe updated by the local administrators , just like the rsd.Thanks to the simplicity of the user-interface and the factthat it is mandatory to update the database whilechanging, removing, or adding addresses, the database iswell adopted by the local governments. It is licensed underan open license, and it is published on the Web in twoways: a data dump is updated regularly, and a couple ofweb services, which work on top of the latest dataset areprovided. Currently, Information Flanders is creating aProof of Concept (PoC) to expose the database as Linked

Chapter 4 — Raising Interoperability of Governmental Datasets 89

Page 92: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Data, as such every address will get a uri.

Truck parkings on the highways

This dataset needs to be shared with Europe, which inits turn makes this dataset publicly available at theEuropean Union’s data portal (http://data.europa.eu/euodp/en/data/dataset/etpa). The dataset is available publicly, underan open license, as xml, using the datex2 stylesheet. Thefile however does not contain persistent identifiers, thus itis impossible to guarantee the semantic interoperability. Aswith the traffic events, the file allows for querying bydownloading the entire file.

Open Data portal’s metadata

In order for datasets to be found by, e.g., routeplanning user agents, they need to be discoverable. Themetadata from all datasets in Flanders are available athttp://opendata.vlaanderen.be in rdf/xml. The metadatais licensed under a Creative Commons Zero license, andfor each dataset and its way to be downloaded(distribution), a uri is available. In order to describe thedataset, the uri vocabulary dcat is used, which is arecommendation by the European Commission in order todescribe data catalogues in an interoperable way.However, within inspire, another metadata standard wasspecified for geospatial data sources. Geodcat-AP is at thetime of writing being created to align inspire and dcat(https://joinup.ec.europa.eu/node/139283). It is thus far,the only dataset that complies in an early form to all theinteroperability levels introduced in this paper.

2.3. Challenges and workshopsWe organised two workshops: one to validate the

outcomes of the interviews with the differentgovernmental organisations, the other to align the marketneeds with the governmental Open Data roadmap. In thefirst workshop, we welcomed a representative of eachorganisation within the dtpw that we had already metduring a one on one interview. In the first half, we had an

90

Page 93: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

introductory program where we summarised the basics ofan Open Data policy: the open definition, theimplementation of the psi directive in Flanders, and theinteroperability layer model. Furthermore we also gave ashort summary of the results of the interviews with themarket stakeholders. The key challenges were listed,discussed, initially identified by the heads of division of thedtpw. In order to identify these challenges, all interviewswere first analysed in search of arguments both pro andcon an open data policy. In the second half of thisworkshop we had three parallel break-out sessions inwhich we discussed unresolved questions that came out ofthe interviews. The arguments that returned most oftenwere bundled and summarised into ten key challenges:

Should data publishing be centralised or decentralisedwithin the department and what process should befollowed?

This challenge refers to how data should reach themarket and the public. A variety of scenarios can beenvisaged here, each with benefits and disadvantages. Thisis not only a very practical challenge, but also one thatrelates to responsibility, ownership, and the philosophybehind setting up an open data policy. Potential scenariosthat may resolve this challenge are also dependent onpolitical decisions and the general vision for datamanagement at the policy level. The workshop showedthat a lot of political and related organisational aspectscome into play in relation to this challenge. Attention tothe balance between what is strategically possible andtechnically desirable is key in tackling this challenge.

Ensuring reusers interpret the data correctly

This refers to the fact that the context in which data aregenerated within government needs to be very wellunderstood by potential reusers. Certain types of datarequire a certain domain expertise to be interpreted in acorrect manner. While good and sufficient metadata canpartially answer this challenge, in very specific cases ameeting with related data managers from the opening

Chapter 4 — Raising Interoperability of Governmental Datasets 91

Page 94: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

organisation will be required to avoid misinterpretation.

Acquiring the right means and knowledge on how topublish open data within our organisation

Setting up an open data policy also requires internalknowledge, particularly in larger and complexorganisations. This means that the right people need to beidentified internally, giving them access to training, whilealso giving them responsibility and clearing them of othertasks. In other cases, there may be a need to attractexternal knowledge on the topic that can then beinternalized. In any event, proper training (and for examplea train-the-trainer programme) is key in developing andexecuting a successful open data strategy.

Knowing what reusers want

If the goal of opening up is to maximize reuse of data, itis important to understand what potential reusers arelooking for, not only in terms of content but also in termsof required standards, channels and interactions. Variousforms of interaction can be used to gain this insight andthe most appropriate one will depend on the organisationsinvolved and their goals. One-to-one meetings arepreferably avoided to alleviate concerns of preferentialtreatment, but co-creation workshops, conferences, andstudy days, can be a potential solution to this challenge.

Influencing what reusers do with the data

This is a challenge that governments certainly strugglewith: providing open data means giving up control overwhat happens with that data. The question captured hereis how governments can guide, nudge, or steer the reuseof data so that the resulting applications, services, orproducts still support the policies defined by them. Whileillegal reuse of data is by definition out of the question, themain challenge here – from the government’s perspective– is how to deal with undesired reuse. Again, consultationand dialogue are key: if the market understands the logicbehind certain datasets as well as the reasons behind the

92

Page 95: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

government opening them up, undesired reuse becomesless of a potential issue. If on the other hand the markethas ideas that government had not anticipated, a dialoguecan take place on the practical implications of that reuse.

Supporting evidence based policy-making

Open data does not only serve reuse outside of theopening organisation, but can also be put to use withindifferent departments and divisions for example. Thechallenge defined here is how to make optimal use of datato shape policies, based on real-life evidence. To resolvethis, an internal department or cell that follows up all data-related activities, acts as single point of contact anddefines data policies could play a role in examining howdata can contribute to policy-making.

Creating responsibility

The main question here is where the role ofgovernment stops and to which extent it should furtherenhance or improve datasets beyond its own purposes.Furthermore, the basic minimum quality should bedefined and explicated towards potential reusers. Tacklingthis challenge means clearly defining a priori where therole of government ends. As this is also a politicaldiscussion to some extent, having a clearly-communicatedpolicy is key.

Raising government’s efficiency

This challenge deals with the potential gains that opendata can mean for the Department as an organisation oforganisations, but also for the Department as an actorwithin the policy domain of Transport and Public Works.Internal processes need to be established to ensure thatan efficiency gain at the level of the own organisation isenabled.

Ensuring sustainability once a dataset is published

Next to covering short-term initiative, long-term

Chapter 4 — Raising Interoperability of Governmental Datasets 93

Page 96: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

processes also need to be set up within the organisationso that the open data policy is sustainable both for thegovernment organisation as well as the outside world. Thismeans a smart design of such processes and guidelines,which are also constantly evaluated and tested againstpractice.

Ensuring the technical availability of datasets

This final challenge questions the basic guarantees thatgovernment should provide towards the publication andavailability of the datasets. At which point does thisbecome a service that does not necessarily need to beprovided by government for free and what is the basiclevel of support (e.g., is a paid sla provided for 24/7 dataavailability and tech support)? Again this decision is apolitical one, but clearly communicating to stakeholdersand the market what they can expect is most important.This means having an internal discussion to define thesepolicies.

These challenges were discussed in smaller groupsduring the workshop in order to formulate solutions. Bygiving answers or providing “ways out” of these questions,the participants were challenged to think together anddevelop a solution that is carried by everyone in theorganisation.

In the second workshop, we invited several marketplayers reusing Flemish Open Data. As a keynote speaker,we invited CityMapper (http://citymapper.com), whichoutlined what data they need to create a world-widemultimodal route planner.

2.4. Recommendations for actionThe three directives (psi, inspire and its) were often

regarded as the reference documents to be implemented.The best-practices for psi, as put forward by the“Interoperability solutions for public administrations,business and citizens” (isa²) programme, focus on LinkedData standards for semantic interoperability. However, theinspire directive for geospatial data, brings forward anational access portal for geospatial data services, in which

94

Page 97: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

datasets are made available through services. There is ametadata effort, called geodcat-ap, which brings themetadata from these two worlds together in one LinkedData specification. The its directive also puts forward theirown specifications, such as netex at http://netex-cen.eu/,datex2 at http://www.datex2.eu/ and siri athttp://www.siri.org.uk/. These specifications do not requirepersistent identifiers, and do not make use of uris for thedata model. We advised the department to first comply tothe isa² best practices, as getting persistent,autodocumented identifiers is the only option today toraise the semantic interoperability on web-scale. Fordatasets that already complied to the inspire or itsdirective, the department would also make these availableas data dumps (e.g., as with the roads database).

The Flemish government has style guidelines for theirwebsites. We advised to implement extra guidelines forthe addition of structured data, e.g., with rdfa. Next, aconclusion from the first workshop was to invest inguidelines for the creation of databases. This should ensureeach internal and externally communicated dataset isannotated with the right context.

In order to overcome the many organisationalchallenges, recommendations for action were formulatedand accepted by the board of directors:

• Keeping a private data catalogue for all datasetsthat are created (open and non-open);

• All ICT policy documents need to have referencesto the Open Data principles outlined in the visiondocument;

• The department of dtpw is responsible forfollowing up these next steps, and will report tothe board of directors;

• Opening up datasets will be part of the roadmapof each sub-organisation within dtpw;

• On fixed moments, there will be meetings withthe Agency Information Flanders to discuss theOpen Data policy.

Finally, also specific recommendations to data owners, asexemplified in the table, were given.

Chapter 4 — Raising Interoperability of Governmental Datasets 95

Page 98: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

3. local decisions as linked opendata

Probably one of the most ambitious project I have beenpart of must have been on Local Decisions as Linked OpenData [7]. The core task of a local government is to makedecisions and document them for the many years to come.Local governments provide the decisions, or minutes, fromthese meetings to the Flemish Agency for DomesticGovernance as unstructured data. These decisions are theauthoritative source for – to name a few – the mandatesand roles within the government, the mobility plan, streetname changes, or local taxes.

Base registries aretrusted authentic

information sourcescontrolled by an

appointed publicadministration or

organizationappointed by the

government.

The rsd and the address database – as discussedearlier – are good examples of such base registries. Asthese examples illustrate, maintaining a base registrycomes with extra maintenance costs to create the datasetand keep it up to date. Could we circumvent these extracosts by relying on a decentral way of publishing theauthoritative ?

In other countries, we see prototyping happening withthe same ideas in mind. OpenRaadsInformatie publishesinformation from 5 local councils in the Netherlands asOpen Data, as well as the OParl project for local councils inGermany. Each of these projects use their own style ofjson api. The data from the municipalities is collectedthrough apis and by scraping websites and transformed toLinked Open Data. According to the Dutch project’sevaluation, the lack of metadata at the source causes adirect impact on the cohesion between the different assetsbecause they can’t be interlinked. Next, the w3c Open Govcommunity group is discussing and preparing an rdfontology to describe, among others, people, organizations,events, and proposals. Finally, in Flanders, theinteroperability program of the Flemish Government,“Open Standards for Linked Organizations” also referred toas oslo², focuses on the semantic level and extends the isacore Vocabularies to facilitate the integration of theFlemish base registries with one another and theirimplementation in business processes of both the publicand private sector.

96

Page 99: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

We interviewed local governments on how they registerand publish Local Council Decisions. We then organizedthree workshops which formulated the input for a proof ofconcept: two workshops were organized for creating apreliminary domain model, and one workshop wasorganized to create wireframes on how Local CouncilDecisions would be created and searched through in anideal scenario. The domain concepts were formalized intotwo Linked Data vocabularies: one for the metadata andone for describing public mandates, formalized inhttps://lblod.github.io/vocabulary. The proof of conceptconsists of four components:

1. an editor for local decisions,2. an html page publishing service responsible for

uri dereferencing,3. a crawler for local decisions, and4. two reuse examples on top of the harvested data.

We introduced a virtual local government calledVlaVirGem for which we can publish local decisions. Theeditor at lblod.github.io/editor is a proof of concept ofsuch an editor, which reuses existing base registries. Youcan choose to fill out a certain template for decisions thatoften occur, such as the resignation of a local counselor orthe installation of a new one. When filling out thenecessary fields, the editor will help you: for example, itwill autocomplete people that are currently in office. Youwill then still be able to edit the official document, whichcontains more information such as links to legalbackground, context and motivation, and metadata. Whenyou click the publish button, the decision is published as aplain html file on a file host. The uris are created ashash-uris from the document’s url.

A harvester is then set up using The DataTank. Byconfiguring a rich snippets harvester, html files are parsedand some links are followed to discover the next to beparsed document. The extracted triples are republishedfor both the raw data as an overview of the mandates. Thisdata is the start of two reuse demos athttp://vlavirgem.pieter.pm: the first for generating anautomatic list of mandates, and the second is a list of local

Chapter 4 — Raising Interoperability of Governmental Datasets 97

Page 100: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

decisions.Although Local Council Decisions contain high quality

information in the form of non-structured data, theinformation in the authoritative source for local mandatestoday does not. In order to reduce the workload to sharethis information (e.g., a newly appointed counselor) withother governments or the private sector, the local decisioncan be published as a Linked Open Data document at thesource.

4. conclusionIn the previous chapter, I mentioned three ways to

measure interoperability. The first method was to quantifythe interoperability by measuring the similarity based onuser-input. This method remained conceptual and remainsuntested in a real-world scenario today. On the basis ofthe projects that got funding over the next years, Ihypothesize that for government data today, more obviousbig steps towards raising interoperability can be taken thatdo not require a quantified semantic intoperabilityapproach.

The second way was to study the effects of a publishingstrategy. When more reuse could be noticed in servicesand applications, the better the balance between the costfor adoption and the benefits will be, and thus the more

You can see thecurrent reuse cases at

the European dataportal: https://www.europeandataportal.eu/en/

using-data/use-cases

interoperable a dataset is. As from the transport datasetspublished on the European data portal, only a limitedamount of reuse can be found, for which the high impactdatasets still have to be discovered. Also at the dtpw ourteam interviewed reusers, which would indicate that reuseof governmental datasets at this moment was limited [8].

The third way was to qualitatively study organizationson the basis of the interoperability layers. The firstapproach for this is through desk research. Again, a quickscan through the European or Flemish data portal reveilsthat the overall interoperability of these datasets is low. Asecond approach was to study the datasets qualitatively bymeans of an interview. This is what our team did at thedtpw, which revealed a list of current issues and

98

Page 101: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

recommendations for actions. The list of current issuescould be a useful means for comparing the results of thesequalitative studies too.

Finally, we elaborated on a proof of concept built forlocal council decisions as Linked Data. Council decisionsannotated with Linked Data have the benefit of lessmanual work and that civil servants can search easierthrough current legislation. Our team also noticed apotential quality gain in editing due to correct legalreferences (even referencing to decisions of theirmunicipality) and the use of qualitative factual data (e.g.,addresses linked to the Central Reference AddressDatabase). Finally, there are also efficiency gains in thepublication of the decisions that are automaticallypublished on the website of the local government, in thecodex, and without additional efforts suitable for reuse bythird parties. The Local Decisions as Linked Data today is atypical example of how the Flemish government canstimulate a decentralized data governance, yet offercentralized tools for local governments that are not able tofollow up on European data standards.

In the city of Ghent inFlanders for one, it

would take more than2 months to update

all route planningsystems when

updates happen.

How long would it take a local council decision to getinto a route planning system? Still today, someone from alocal, regional or national government would need tocontact route planning organizations and provide themwith the right updates in the format they need. It is anengineering problem to automate the process of datareuse on the various interoperability issues, but also apolicy problem to bring a better data culture within a largeorganization. Today, within the Flemish government, wenotice a change towards “assisted decentralization”, inwhich data systems are architectural decentral, but wherethe regional government provides services to theunderlying governments.

In these three use case, I started from the perspectiveof specific data publisher and worked my way up in theorganization to see what would be needed for a higherinteroperable dataset. In the next chapter, we will divedeeper into the field of transport data in specific.

Chapter 4 — Raising Interoperability of Governmental Datasets 99

Page 102: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

references[1] Colpaert, P., Dimou, A., Vander Sande, M., Breuer, J.,

Van Compernolle, M., Mannens, E., Mechant, P., Vande Walle, R. (2014). A three-level data publishing portal.In proceedings of the European Data Forum.

[2] Colpaert, P., Vander Sande, M., Mannens, E., Van deWalle, R. (2011). Follow the stars. In proceedings ofOpen Government Data Camp 2011.

[3] Colpaert, P., Verborgh, R., Mannens, E., Van de Walle,R. (2014). Painless uri dereferencing using TheDataTank. European Semantic Web Conference (pp.304–309).

[4] Colpaert, P., Joye, S., Mechant, P., Mannens, E., Vande Walle, R. (2013). The 5 Stars Of Open Data Portals.In proceedings of 7th international conference onmethodologies, technologies and tools enabling e-Government (pp. 61–67).

[5] Heyvaert, P., Colpaert, P., Verborgh, R., Mannens, E.,Van de Walle, R. (2015, June). Merging and Enrichingdcat Feeds to Improve Discoverability of Datasets.Proceedings of the 12th Extended Semantic WebConference: Posters and Demos (pp. 67–71).Springer.

[6] Van Cauter, L., Bannister, F., Crompvoets, J., Snoeck,M. (2016). When Innovation Stumbles: Applying Sauer’sFailure Model to the Flemish Road Sign DatabaseProject. IGI Global.

[7] Buyle, R., Colpaert, P., Van Compernolle, M., Mechant,P., Volders, V., Verborgh, R., Mannens, E. (2016,October). Local Council Decisions as Linked Data: aproof of concept. In proceedings of the 15thInternational Semantic Web Conference.

[8] Walravens, N., Van Compernolle, M., Colpaert, P.,Mechant, P., Ballon, P., Mannens, E. (2016). OpenGovernment Data’: based Business Models: a marketconsultation on the relationship with government in thecase of mobility and route-planning applications. In

100

Page 103: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

proceedings of 13th International Joint Conferenceon e-Business and Telecommunications (pp. 64–71).

Chapter 4 — Raising Interoperability of Governmental Datasets 101

Page 104: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

102

Page 105: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Chapter 5Transport Data

“ Logic will get you from A to B;imagination will take you

everywhere. ”― Anonymous.

103

Page 106: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Route planning applications can better target end-user needswhen they have access to a higher quantity ofdatasets. The algorithms used by these apps needaccess to datasets that exceed one database and needquery functionality that exceed standard querylanguages. We rethink the access to data by studyinghow route planning algorithms work. In this chapter,we describe the state of the art of route planninginterfaces. Transport data today is published in or adatadump, or a web-service which answers the entirequestion on the server-side. If we want to advance thestate of the art in sharing data for maximum reuse, wewill have to establish a new trade-off between clientand server effort.

The way travelers want route planning advice isdiverse: from finding journeys that are accessible with acertain disability [1], to taking into account whether thetraveler owns a (foldable) bike, a car or a public transitsubscription, or even calculating journeys with the nicestpictures on social network sites [2]. However, whenpresented with a traditional route planning http api takingorigin-destination queries, developers of, e.g., travelingtools are left with no flexibility to calculate journeys otherthan the functions provided by the server. As aconsequence, developers that can afford a larger serverinfrasture, integrate data dumps of the timetables (andtheir real-time updates), into their own system. This way,they are in full control of the algorithm, allowing them tocalculate journeys in their own manner, across datasources from multiple authorities.

For instance, trying to answer the question “how longdo I have to walk from one point to another?” can take intoaccount the geolocation of the streets, the weatherconditions at that time of the day, the steepness of theroad, whether or not there is a sidewalk, criminalityreports to check whether it is safe to walk through thesestreets, the wheelchair accessibility or accessibility for thevisual imparaired of the road, whether the street isblocked by works at that time, etc. We can imagine thecomplexities that arise if the user does not only want towalk, but that he also wants to get advice taking differenttransport modes into account. An open world approach is

104

Page 107: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

needed: a certain dataset should be queried with theassumption that it can only answer part of the question,and that a better answer can always be found by usingmore datasets. [3]

In this chapter, we first scratch the surface ofpublishing data for route planning on the road as well, bybuilding a proof of concept in collaboration with the city ofGhent. We then discuss the state of the art in the field ofpublic transit route planning, the focus of this book.Finally, we conclude with opportunities to evolve this stateof the art with Linked Connections, described in the nextchapter.

1. data on the roadAs we have discussed in the previous chapter, it is

difficult to draw the line between transport data and nontransport data. Even merely administrative datasets suchas the alteration of a streetname, may at some pointbecome useful for a route planning algorithm. We firstdiscuss, within the field of Intelligent Transport Systems (its),what the constraints should be for an information system,in order to share transport data world-wide.

1.1. Constraints for a large-scale its data-sharing system

Within the domain of its, the its directive helpedpopularizing publicly sharing data. For example, with adelegated regulation elaborating on a European AccessPoint for Truck Parking Data, it regulates sharing the datathrough a national access point similar to the inspiredirective, a directive for sharing data within the geo-spatialdomain.

When creating a system to distribute data within ourown domain – for the many years to come – an importantrequirement is that this data policy needs to scale upefficiently. When more datasets are added, when moredifficult questions need to be answered, when more

Chapter 5 — Transport Data 105

Page 108: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

questions are asked, or when more user agents come intoaction, we want our system to work without architecturalchanges. Furthermore, the system should be able toevolve while maintaining backwards-compatibility, as ourorganization is always changing. Datasets that arepublished today, still have to work when accessed whennew personnel is in place. Such a system should also havea low entry-barrier, as it needs to be adopted by bothdevelopers of user agents as well as data publishers.

1.2. A use case in GhentThe mobility organization of Ghent was given the task

to build a virtual traffic centre. The centre should informGhentians with the latest up to date information aboutmobility in the city. As the city is in a transition, banningcars from the city centre, this is a project with highexpectations. Information to build this traffic centre comesfrom existing geospatial layers, containing the on-streetparking zones with, among others, their tariffs, all off-street parking lots, all streets and addresses, and all trafficsigns. Also real-time datasets are available, such as thesensor data from the induction loops, bicycle counters,thermal cameras, and the availability of the off-site parkinglots. Third parties also contribute datasets, such as thepublic transit time schedules and their real-time updates,and traffic volumes in, to and leaving the city. In order tobring these datasets from various sources together, andanalyse them, both the semantics as the queryability islacking.

The city of Ghent allowed us to define a couple of urisfor parking sites. In this city, a uri strategy is in place tonegotiate identifiers since 2016. The uri strategy defines abase uri at “https://stad.gent/id/”. Using this strategy, thecity introduced uris for each parking site, similar to theexamples above. When we now would point our browserto https://stad.gent/id/parking/P1, we will be directed to apage about this parking space. Furthermore, this identifieris interoperable across different systems, as when wewould get this uri from a computer program, I cannegotiate its content type, and request an rdfrepresentation of choice.

106

Page 109: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

As there are only a couple of parking sites at the city ofGhent, describing this amount of parking sites easily fitsinto one information resource identified by one url, forinstance, http://linked.open.gent/parking/. When the serverdoes not expose the functionality to filter the parking siteson the basis of geolocation, all user agents that want tosolve any question based on the location of parking sites,have to fetch the same resource. This puts the server atease, as it can prepare the right document once each timethe parking sites list is updated. Despite the fact that nowall parking sites had to be transferred to the user agent,and thus consumed more bandwidth, also the user agentcan benefit. When a similar question is executed from thesame device, the dataset will already be present in theuser agent’s cache, and now, no data at all will need to betransferred. This raises the user-perceived performance ofa user interface. When now the number of end-usersincreases by a factor of thousand per second – notuncommon on the Web – it becomes easier for the serverto keep delivering the same file for those user agents thatdo not have it in cache already. When it is not in the useragents own cache, it might already be in an intermediatecache on the Web, or in the server’s cache, resulting in lesscpu time per user. Caching, another one of the restconstraints, thus has the potential to eliminate somenetwork interactions and server load. When exploited, abetter network efficiency, scalability, and user-perceivedperformance can be achieved.

Without the corsheader, a resource isby default flagged as

potentially containingprivate data, and

cannot be requestedby in-browser scripts

at a different domain.More information at e

nable-cors.org

In a test environment, we published a Linked Datadocument at http://linked.open.gent/parking/, bytransforming the current real-time xml feeds using a phpscript. First, this script adds metadata to this documentindicating this data has an open license, and thus becomeslegally interoperable with other Web resources. Next, weadded http headers, indicating that this document can beused for Cross Origin Resource Sharing (cors), as well as anhttp header that this document can be cached for 30seconds. Finally, we also added a content negotiation fordifferent rdf representations.

As we would like to solve Basic Graph Patterns (bgp)queries, we added the hypermedia controls requested bythe Triple Pattern Fragments (tpf) specification, which

Chapter 5 — Transport Data 107

Page 110: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

details how to filter the triples in this document based ontheir subject and/or predicate and/or object [4]. As thenumber of data facts that need to be published is smallenough, we directed all controls to the main document,

The latest version ofthe Triple Pattern

Fragmentshypermedia

specification can befound at https://www.hydra-cg.com/spec/latest/

triple-pattern-fragments/.

and did not expose extra server functionality.Following these steps, we made the data queryable

through clients that can exploit the Triple Patternhypermedia controls, such as the Linked Data Fragmentsclient available at http://client.linkeddatafragments.org. Thisclient is able to query over multiple Triple PatternFragments interfaces at once, and thus answer federatedqueries by following hypermedia controls. The followingquery selects the name, the real-time availability, and thegeolocation from Open Street Map (Linked Geo Data) of aparking lot in Ghent with more than 200 spaces available:http://bit.ly/2jUNnES. This demonstrates that complexquestions can still be answered over simple interfaces. Theoverall publishing approach is cost-efficient and stays asclose as possible to the http protocol as a uniforminterface. The only part where an extra technicalinvestment was needed, was in documenting thedefinitions of the new uris for the parking sites.

2. public transit time schedulesComputation of routes within road networks and public

transport networks are remarkably different: while somespeed up methods work well for the former, they do notfor the latter [5]. In this chapter, we will focus on publictransit route planning, further on referred to as routeplanning.

A public transit network is considered, in its mostessential form, to consist of stops, connections, trips, and atransfer graph [6]:

• A stop p is a physical location at which a vehiclecan arrive, drop off passengers, pick uppassengers and depart;

108

Page 111: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

• A connection c represents a vehicle driving from adeparture stop at a departure time, withoutintermediate halt, to an arrival stop at an arrivaltime;

• A trip is a collection of connections followed by avehicle;

• A transfer is when a passenger changes a vehicleat the same stop, or a nearby stop with a certainpath in between.

A network can also optionally contain routes, which is thecollection of trips that follow, to a certain extent, the sameseries of stops. As the number of routes is in many casesmuch smaller than the number of trips, the name of theseroutes are, for simplicity, commonly used to informpassengers. In some route planning applications such asRaptor [7] or Trip-based route planning [8], a clusteringalgorithm may re-cluster the trips into routes favorable forthe algorithm.

For future reference, an arrival and a departure are twoconcepts defined by a location and a timestamp. Adeparture at a certain stop, linked to an arrival at adifferent stop can thus create a connection.

2.1. Route planning queriesWe differentiate various types of queries over public

transport networks [9]:

• An Earliest Arrival Time (eat) q is a route planningquestion with a time of departure, a departurestop, and an arrival stop, expecting the earliestpossible arrival time at the destination;

• The Minimum Expected Arrival Time (meat) issimilar to an eat, yet it takes into account theprobability of delayed or canceled connections;

• Given departure stop, a profile query computesfor every other stop the set of all earliest arrivaljourneys, for every departure from the departurestop;

Chapter 5 — Transport Data 109

Page 112: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

• The multi-criteria profile query calculates a set ofjourneys that each are not outbettered in arrivaltime or number of transfers, also called a set ofPareto optimal journeys.

In route planning software, the eat problem is, amongothers, used for mobility studies, to calculate a latestarrival time for profile queries or for preprocessing tospeed up other methods. For providing actual routeplanning advice, a profile query is what most end-userinterfaces expect. The same problem solving algorithmcan, with minimal adaptations, be used for the latestdeparture time problem, which requests the last possibletime to leave in order to arrive on time at a certain stop.

2.2. Other queriesWithin any of the given problems, a user may need

extra personal features, such as the ability to requestjourneys that are wheelchair accessible, journeys providingthe most “happy” route [2], journeys where the user willhave a seat [1], or journeys with different transfer settingsdepending on, e.g., reachability by foldable bike orcriminality rates in a station.

Unrelated to the previous problems is that end usersalso still want answers to other questions without a routeplanning algorithm to be in place, such as getting a list ofthe next departures and arrivals at a certain stop, all stopsa certain trip contains, or a list of all stops in the area.

2.3. Route planning algorithmsOnly in the last decade, algorithms have been built

specifically for public transit route planning [10]. Twomodels were commonly used to represent such a network:a time-expanded model and a time-dependent model [11].In a time-expanded model, a large graph is modeled witharrivals and departures as the nodes, and edges toconnect a departure and an arrival together. The weightson these edges are constant. In a time-dependent model, asmaller graph is modeled in which vertices are physical

110

Page 113: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

stops and edges are transit connections between them.The weights on these edges change as a function of time.On both models, Dijkstra and Dijkstra-based algorithmscan be used to calculate routes [11].

Raptor [7] was the first base algorithm to disregardDijkstra-like graph algorithms and instead exploit the basicelements of a public transit network. It does this bystudying the routes within a network. In each round, itcomputes the earliest arrival time for journeys with a perround increasing number of transfers. It looks at eachroute in the network at most once per round. Using simplepruning rules and parallelization with multiple cores, thealgorithm can be made even faster. It is currently thealgorithm behind software like Open Trip Planner, orBliksemlabs’ RRRR software.

Trip-based route planning [8] uses the same ideas asRaptor and works with an array of trips. In a preprocessingphase, it links trips together when there is a possibility totransfer at a certain stop. During query execution, it cantake into account the data generated during thepreprocessing phase.

Transfer Patterns [10] preprocesses a timetable suchthat at execution time, it is known which transfers can beused at a certain departure stop and departure time. Thepreprocessing requires by design more cpu time andmemory than the preprocessing of trip-based routeplanning. Scalable Transfer Patterns [12] enhances theseresults by reducing both the necessary time and spaceconsumption by an order of magnitude. Within companiesthat have processing power of idle machines to theirdisposal, the latter is a desired approach, which makes itthe current algorithm behind the Google Maps publictransit route planner according to its authors.

Finally, the Connection Scan Algorithm (csa) is anapproach for planning that models the timetable data as adirected acyclic graph [9]. By topologically sorting thegraph by departure time, the shortest path algorithm onlyneeds to scan through connections within a limited timewindow. csa can be extended to solving problems where italso keeps the number of transfers limited, as well ascalculating routes with uncertainty. The ideas behind csacan scale up to large networks by using multi-overlay

Chapter 5 — Transport Data 111

Page 114: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

networks [6].It is a challenge to compare route planning algorithms

as they are often built for answering slightly differentquestions and are executed on top of different datamodels. When studying the state of the art in order to finda suitable data publishing model, we were looking for thesmallest building block needed for both preprocessors asactual query execution algorithms (with and withoutpreprocessed data).

2.4. Exchanging route planning data over theWeb

The General Transit Feed Specification (gtfs) is aframework for exchanging data from a public transitagency to third parties. gtfs, at the time of writing, is thede-facto standard for describing and exchanging of transitschedules. It describes the headers of several csv filescombined in a zip-file. Through a calendar.txt file, you areable to specify on which days of the week a certain entryfrom service.txt is going to take place during a part of theyear. In a calendar_dates.txt file, you are able to specifyexceptions on these calendar.txt rules for exampleindicating holidays or extra service days. Using these twofiles, periodic schedules can be described. When anaperiodic schedule needs to be described, mostly only thecalendar_dates.txt file is used to indicate when a certainservice is running. A gtfs:Service contains all rules on whicha gtfs:Trip is taking place, and a gtfs:Trip is a periodicrepetition of a trip as defined earlier. Trips in gtfs containmultiple gtfs:StopTimes and/or gtfs:Frequencies. The former– mind the difference with a connection – describes aperiodic arrival time and a departure time at one certaingtfs:Stop. The latter describes the frequency at which agtfs:Trip passes by a certain stop. gtfs also describes thegeographic shape of trips, fare zones, accessiblity,information about an agency, and so forth.

Other specifications exist, such as the European censpecification Transmodel, the Dutch specification bison,the Belgian specification bltac and the specification fordescribing real-time services The Service Interface for Real

112

Page 115: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Time Information (siri). They each specify a serializationspecific format for exchanging data, and some specify thequestions that should be exposed by a server.

Up to date, route planning solutions exist as services,such as Navitia.io or Plannerstack, in end-user applicationssuch as CityMapper, Ally or Google Maps, or as opensource, such as Open Trip Planner or Bliksemlabs RRRR.Other common practices include that an agency, such as arailway company exposing a route planner over httpthemself. Each of these route planners however have thedisadvantage that they do not allow querying with an openworld assumption: each response is final and is notsupposed to be combined with other responsedocuments.

We have given uris to the terms in the gtfsspecification through the Linked gtfs (base uri: http://vocab.gtfs.org/terms#) vocabulary. The definitions of the aboveterms can be looked up by replacing gtfs: by the base uri.

3. conclusionIf we may conclude anything from our first proof of

concept within the city of Ghent, it is that the field oftransport data is not much different than Open Data ingeneral. If we want automated reuse of datasets withinroute planners, we still need to get some minimumrequirements accepted in the entire transport domain.These requirements are situated on the same

We have done this forboth datex2 as gtfs.

interoperability levels those we were already discussing.For the domain specific parts, we also need to get morespecifications to become Web specifications and defineuris for their terms.

Chapter 5 — Transport Data 113

Page 116: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Figure 1: When publishing data in fragments, ageneric fragmentation strategy can be thought of forcontinuously updating data as well.

One challenge which we were able to overcome, is topublish real-time data over http. The approach was againnot much different for static documents, only now weneed to ensure the latency with the back-end system isminimized, while setting the right caching headers. Real-time data can be put in fragments as well, allowingmultiple observations to be stored in a page. This can beannotated by using a named graph and the prov-ovocabulary. The dataset of parking lots, published asillustrated in Figure 1, is now available at https://linked.open.gent/parking.

high adoption cost

high availability

high bandwidth

high server cost

limited queries

low bandwidth

GTFS(-RT) Route Planning API

Figure 2: The Linked Data Fragments (ldf) axisapplied to public transit route planning: on bothends, centralization is key as the client or the serverwill centralize all data.

Finally, today one can see clear evidence of the twoextremes on the ldf axis – applied to route planning inFigure 2 – within public transit timetables. In the nextchapter, we will explore different trade-offs on this axis.

114

Page 117: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

references[1] Colpaert, P., Ballieu, S., Verborgh, R., Mannens, E.

(2016). The impact of an extra feature on the scalabilityof Linked Connections. In proceedings of iswc2016.

[2] Quercia, D., Schifanella, R., M. Aiello, L. (2014). Theshortest path to happiness: Recommending beautiful,quiet, and happy routes in the city. In proceedings ofThe 25th ACM conference on Hypertext and socialmedia (pp. 116–125). ACM.

[3] Colpaert, P. (2014). Route planning using Linked OpenData. In proceedings of European Semantic WebConference (pp. 827–833).

[4] Verborgh, R., Vander Sande, M., Hartig, O., VanHerwegen, J., De Vocht, L., De Meester, B.,Haesendonck, G., Colpaert, P. (2016, March). TriplePattern Fragments: a Low-cost Knowledge GraphInterface for the Web. Journal of Web Semantics.

[5] Bast, H. (2009). Efficient Algorithms: Essays Dedicated toKurt Mehlhorn on the Occasion of His 60th Birthday.(pp. 355–367). Springer Berlin Heidelberg.

[6] Strasser, B., Wagner, D. (2014). Connection ScanAccelerated. Proceedings of the Meeting on AlgorithmEngineering & Expermiments (pp. 125–137).

[7] Delling, D., Pajor, T., Werneck, R. (2012). Round-BasedPublic Transit Routing. Proceedings of the 14thMeeting on Algorithm Engineering and Experiments(ALENEX’12).

[8] Witt, S. (2016). Trip-Based Public Transit Routing UsingCondensed Search Trees. Computing ResearchRepository (corr).

[9] Dibbelt, J., Pajor, T., Strasser, B., Wagner, D. (2013).Intriguingly Simple and Fast Transit Routing.Experimental Algorithms (pp. 43–54). Springer.

[10] Bast, H., Carlsson, E., Eigenwillig, A., Geisberger, R.,Harrelson, C., Raychev, V., Viger, F. (2010). Fast routingin very large public transportation networks using

Chapter 5 — Transport Data 115

Page 118: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

transfer patterns. Algorithms – esa 2010 (pp.290–301). Springer.

[11] Pyrga, E., Schulz, F., Wagner, D., Zaroliagis, C. (2008).Efficient models for timetable information in publictransportation systems. Journal of ExperimentalAlgorithmics (JEA) (pp. 2–4). ACM.

[12] Bast, H., Hertel, M., Storandt, S. (2016). ScalableTransfer Patterns. In proceedings of the EighteenthWorkshop on Algorithm Engineering andExperiments (alenex). Society for Industrial andApplied Mathematics.

[13] Corsar, D., Markovic, M., Edwards, P., Nelson, J.D.(2015). The Transport Disruption Ontology. Inproceedings of the International Semantic WebConference 2015 (pp. 329–336).

116

Page 119: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Chapter 6Public Transit RoutePlanning Over LightweightLinked Data Interfaces

“ The impact of a feature on a Webapi should be measured across

implementations: measurable evidenceabout features should steer the api

design decision process. ”― Principle 5 of “A Web API ecosystem

through feature-based reuse” by RubenVerborgh and Michel Dumontier.

117

Page 120: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

End-users want to plan public transit journeys based onparameters that exceed a data publisher’simagination. In the previous chapter we have learnedthat today, on the one hand, data publishers provide,rather expensive to host, route planning apis, whichdo not allow a reuser to add functionality to thealgorithm. On the other hand, the open data initiativeadvocates for data publishers to also provide datadumps with real-time changesets, which however iscost intensive to integrate. We want to enable reusersto create route planners with different requirements,as well as enable public transit agencies to publish thedata for unlimited reuse. In order to establish a newtrade-off on the Linked Data Fragments axis, LinkedConnections introduces a light-weight data interface,over which the base route planning algorithm“Connections Scan” [1] can be implemented on theclient-side. In this chapter, we report on testing queryexecution time and cpu usage of three set-ups thatsolve the Earliest Arrival Time (eat) problem usingquery mixes within Belgium: route planning on theserver-side, route planning on the client-side withoutcaching, and route planning on the client-side withcaching. Furthermore, we study how and where newfeatures can be added to the Linked Connectionsframework by studying the feature of wheelchairaccessibility.

When publishing data for maximum reuse, httpcaching can make sure that through a standard adoptedby various clients, servers, as well as intermediary andneighborhood caches [2], can be exploited to reach theuser-perceived performance necessary, as well as thescalability/cost-efficiency of the publishing infrastructureitself. For that purpose, we can take inspiration from thelocality of reference principle. Time schedules areparticularly interesting from both the geospatialperspective as the time locality. Can we come up with afragmentation strategy for time schedules?

When publishing departure-arrival pairs (connections) inchronologically ordered pages – as demonstrated in our2015 demo paper [3] – route planning can be executed atdata integration time by the user agent rather than by the

118

Page 121: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

data publishing infrastructure. This way, Linked Connections(lc) allows for a richer web publishing and queryingecosystem within public transit route planners. It lowersthe cost for reusers to start prototyping – also federatedand multimodal – route planners over multiple datasources. Furthermore, other sources may give moreinformation about a certain specific vehicle that may be ofinterest to this user agent. It is not exceptional that routeplanners also take into account properties such asfares [4], wheelchair accessibility [5], or criminalitystatistics. In this chapter, we test our hypothesis that thisway of publishing is more lightweight: how does ourinformation system scale under more data reusecompared to origin-destination apis?

This rest of the chapter is structured in a traditionalacademic way, first introducing more details on the LinkedConnections framework, to then describe the evaluation’sdesign, made entirely reproducible, and discuss our open-source implementation. Finally, we elaborate on theresults and conclusion.

1. the linked connectionsframework

The time performance of csa, discussed in the previouschapter, is O(n) with n the number of input connections.These input connections are a subset of the set ofconnections in the real world. The smaller this subset, thelower the time consumption of the algorithm will be.Lowering the number of connections can be done usingvarious methods involving preprocessing, end-userpreferences, multi-overlay networks, geo-filtering, or otherheuristics. The csa algorithm allows for an Open WorldAssumption while querying. The input connections that arefed in the algorithm are not all connections in existence,and when fed with other input connections, it may find abetter journey. Furthermore, the stops and trips arediscovered while querying: there is no need to keep a listof all trips and stops in existence. Thanks to the locality,

Chapter 6 — Public Transit Route Planning Over Lightweight Linked DataInterfaces

119

Page 122: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

extensibility and “open world” properties, three propertiesfavorable for a Web environment, csa was chosen as thebase algorithm for our framework.

IN:IN: Ordered list of connections C, query q

arrivalTimes[q.departureStop] ← q.departureTime;arrivalTimes[q.destinationStop] ← ∞;c ← C.next();whilewhile (c.departureTime <= arrivalTimes[q.destinationStop]) {

ifif (arrivalTimes[c.departureStop] < c.departureTime

ANDAND arrivalTimes[c.arrivalStop] > c.arrivalTime) {

arrivalTimes[c.arrivalStop] ← c.arrivalTime;minimumSpanningTree[c.arrivalStop] ← c;

}

c ← C.next();}

returnreturn reconstructRoute(minimumSpanningTree, q);

Code snippet 1: Pseudo code solving the earliest arrival time problem usingthe Connection Scan Algorithm

To solve the eat problem using csa, timetableinformation within a certain time window needs to beretrievable. Other route planning questions need to selectdata within the same time window, and solving these is

Also preprocessingalgorithms [6] scan

through an orderedlist of connections to,

for example, findtransfer patterns [7].

expected to have similar results Instead of exposing anorigin-destination api or a data dump, a LinkedConnections server paginates the list of connections indeparture time intervals and publishes these pages overhttp. A public transit timetable, in the case of the basealgorithm csa, is represented by a list of connections,ordered by departure time. Each page contains a link tothe next and previous one. In order to find the first page aroute planning needs, the document of the entry pointgiven to the client contains a hypermedia description onhow to discover a certain departure time. Bothhypermedia controls are expressed using the Hydra

http://www.hydra-cg.com/spec/latest/core/

vocabulary.The base entities that we need to describe are

connections, which we documented using the lc LinkedData vocabulary (base uri: http://semweb.mmlab.be/linkedco

120

Page 123: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

nnections#). Each connection entity – the smallest buildingblock of time schedules – provides links to an arrival stopand a departure stop, and optionally to a trip. It containstwo literals: a departure time and an arrival time. Linkedgtfs can be used to extend a connection with public transitspecific properties such as a headsign, drop off type, orfare information. Furthermore, it also contains concepts todescribe transfers and their minimum change time. Forinstance, when transfering from one railway platform toanother, Linked gtfs can indicate that the minimumchange time from one stop to another is a certain numberof seconds.

Figure 1: An lc server fragments and publishes along list of connections, ordered by departure time.

I implemented this hypermedia control as a redirectfrom the entry point to the page containing connectionsfor the current time. Then, on every page, the descriptioncan be found of how to get to a page describing anothertime range. In order to limit the amount of possible

X is a configurableamount

documents, we only enable pages for each X minutes, anddo not allow pages describing overlapping time intervals.When a time interval is requested for which a page doesnot exist, the user agent will be redirected to a pagecontaining connections departing at the requested time.The same page also describes how to get to the next orprevious page. This way, the client can be certain aboutwhich page to ask next, instead of constructing a newquery for a new time interval.

{

"@context":{

"lc":"http://semweb.mmlab.be/ns/linkedconnections#",

"hydra":"http://www.w3.org/ns/hydra/core#",

"gtfs":"http://vocab.gtfs.org/terms#",

"cc" : "http://creativecommons.org/ns#"

Chapter 6 — Public Transit Route Planning Over Lightweight Linked DataInterfaces

121

Page 124: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

},

"@id":"http://{host}/2017/apr/2",

"@type":"hydra:PagedCollection",

"cc:license":

"http://creativecommons.org/publicdomain/zero/1.0/",

"hydra:next":"http://{host}/2017/apr/3",

"hydra:previous":"http://{host}/2017/apr/1",

"hydra:search":{

"@type":"hydra:IriTemplate",

"hydra:template":"http://{host}/{?departureTime}",

"hydra:variableRepresentation":

"hydra:BasicRepresentation",

"hydra:mapping":{

"@type":"IriTemplateMapping",

"hydra:variable":"departureTime",

"hydra:required":true,

"hydra:property":"lc:departureTimeQuery"

}

},

"@graph":[

{

"@id":"http://{host}/a24fda19",

"@type":"lc:Connection",

"lc:departureStop":"http://{host}/stp/1",

"lc:departureTime":"2017-04-02T12:00:00Z",

"lc:arrivalStop":"http://{host}/stp/2",

"lc:arrivalTime":"2017-04-02T12:30:00Z",

"gtfs:trip":"http://{host}/trips/2017/1"

},

…]

}

Code snippet 2: Example of a Linked Connections server response in json-ld.The context is, according the the json-ld specification, the part where termsare mapped to uris. After that, the hypermedia section follows, where thecurrent page is linked to the next page and previous page. Furthermore, it isalso described how to query for something starting at a certain departuretime. The remainder of the file provides an example of a connection. Itdescribes both a departure and an arrival, and it is given a unique identifier.Furthermore, it is extended with a trip id from the gtfs vocabulary.

122

Page 125: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

A naive implementation of federated route planning ontop of this interface is also provided, as a client can beconfigured with more than one entry point. The client thenperforms the same procedure multiple times in parallel.The “connections streams” are merged and sorted as theyare downloaded. A Linked Data solution is to ensure thatthe client knows how to link a stop from one agency to astop from another.

2. evaluation designWe implemented different components in JavaScript

for the Node.js platform. We chose JavaScript as it allowsus to use both components both on a command-lineenvironment as well as in the browser.csa.js

A library that calculates a minimum spanning treeand a journey, given a stream of input connectionsand a query

Server.jsPublishes streams of connections in JSON-LD, using aMongoDB to retrieve the connections itself

Client.jsDownloads connections from a configurable set ofservers and executes queries

gtfs2lcA tool to convert existing timetables as open data tothe Linked Connections vocabulary

We set up 2 different servers, each connected to thesame MongoDB database which stores all connections of

The Belgian railwaycompany estimates

delays on its networkeach minute.

the Belgian railway system for October 2015:

1. A Linked Connections server with an nginx proxycache in front, which adds caching headersconfigured to cache each resource for one minuteand compresses the body of the response usinggzip;

Chapter 6 — Public Transit Route Planning Over Lightweight Linked DataInterfaces

123

Page 126: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

2. A route planning server which exposes an origin-destination route planning interface using thecsa.js library. The route planning server uses thecsa.js library to expose a route planning api ontop of data stored in a MongoDB. The code isavailable at https://github.com/linkedconnections/query-server.

In the lc without cache set-up, we introduce a set-upwhere the lc server is used as a data source by the end-user’s machine. A demo of an implementation in a browsercan be viewed at http://linkedconnections.org, where theclient.js was used as a library in the browser. The lc withcache set-up is where one user agent is doing all thequerying on behalf of all end-users. Finally, with thetraditional origin-destination approach, a data maintainerpublishes a route planning service over http where onehttp request equals one route planning question,developed to be able to compare the previous two set-ups.

Figure 2: (a) Client-side route planning; (b) Client-side route planning withclient-side cache; (c) Server-side route planning

These tools are combined into different set-ups:Client-side route planning

The first experiment executes the query mixes byusing the Linked Connections client. Client caching isdisabled, making this simulate the lc without cacheset-up, where every request could originate from anend-user’s device.

124

Page 127: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Client-side route planning with client-side cacheThe second experiment does the same as the firstexperiment, except that it uses a client side cache,and simulates the lc with cache set-up.

Server-side route planningThe third experiment launches the same queriesagainst the full route planning server. The queryserver code is used which relies on the same csa.jslibrary as the client used in the previous twoexperiments.

In order to have an upper and lower bound of a realworld scenario, the first approach assumes every requestcomes from a unique user agent which cannot share acache, and has caching disabled, while the secondapproach assumes one user agent does all requests for allend-users and has caching enabled. Different real worldcaching opportunities – such as user agents for multipleend-users in a software as a service model, or shared peerto peer neighborhood caching [2] – will result in ascalability in-between these two scenarios.

We also need a good set of queries to test our threeset-ups with. The iRail project provides a route planningapi to apps with over 100k installations on Android andiPhone [8]. The api receives up to 400k route planningqueries per month. As the query logs of the iRail projectare published as open data, we are able to create realquery mixes from them with different loads. The first mixcontains half the query load of iRail, during 15 minutes ona peak hour on the first of October. The mix is generatedby taking the normal query load of iRail during the same15 minutes, randomly ordering them, and taking the halfof all lines. Our second mix is the normal query load ofiRail, which we selected out of the query logs, and for eachquery, we calculated on which second after the benchmarkscript starts, the query should be executed. The third mixis double the load of the second mix, by taking the next 15minutes, and subtracting 15 minutes from the requesteddeparture time, and merging it with the second mix. Thesame approach can be applied with limited changes forthe subsequent query mixes, which are taken from thenext days’ rush hours. Our last query mix is 16 times the

Chapter 6 — Public Transit Route Planning Over Lightweight Linked DataInterfaces

125

Page 128: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

query load of iRail on the 1st of October 2015. Theresulting query mixes can be found at https://github.com/linkedconnections/benchmark-belgianrail.

These query mixes are then used to reenact a real-world query load for three different experiments on thethree different architectures. We ran the experiments on aquad core Intel(R) Core(TM) i5-3340M cpu @ 2.70GHz with8GB of RAM. We launched the components in a singlethread as our goal is to see how fast cpu usage increaseswhen trying to answer more queries. The results will thusnot reflect the full capacity of this machine, as inproduction, one would run the applications on differentworker threads.

Our experiments can be reproduced using the code athttps://github.com/linkedconnections/benchmark-belgianrail.There are three metrics we gather with these scripts: thecpu time used of the server instance (http cachesexcluded), the bandwidth used per connection, and thequery execution time per lc connection. For the latter two,we use a per-connection result to remove the influence ofroute complexity, as the time complexity of our algorithmis O(n) with n the total number of connections. We thusstudy the average bandwidth and query execution timeneeded to process one connection per route planning task.

3. results of the overall cost-efficiency test

Figure 3 depicts the percentage of the time over 15minutes the server thread was active on a cpu. The servercpu time was measured with the command pidstat fromthe sysstat package and it indicates the server load. Thefaster this load increases, the quicker extra cpus would beneeded to answer more queries.

126

Page 129: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

0 4 8 12 160%

25%

50%

75%

100%

Query-server

LC with client cache

LC without client cache

x times the query load of iRail

%C

PU

Figure 3: cpu time consumed by the server in 3 different scenarios shows thatLinked Connections, even without caching, is more light-weight than thetraditional approach.

When the query load is half of the real iRail load onOctober 1st 2015, we can see the lowest server load is thelc set-up with client cache. About double of the load isneeded by the lc without client cache set-up, and evenmore is needed by the query server. When doubling thequery load, we can notice the load lowers slightly for lcwith cache, and the other two raise slightly. Continuing thisuntil 16 times the iRail query load, we can see that the loadof the query server raises until almost 100%, while lcwithout cache raises until 30%, while with client cache, 12%is the measured server load.

Chapter 6 — Public Transit Route Planning Over Lightweight Linked DataInterfaces

127

Page 130: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Table 5: Server cpu usage under increasing queryload shows that the Linked Connections server ismore cost-efficient: more queries can be answeredon one core.

queryload

lc nocache

lc withcache

Queryserver

0.5 4.61% 2.44% 4.95%1 4.97% 2.32% 5.62%2 7.21% 3.50% 10.41%4 11.23% 5.39% 21.37%8 17.69% 7.98% 35.77%12 20.34% 9.38% 70.24%16 28.63% 11.71% 97.01%

In Figure 4, we can see the query response time perconnection of 90% of the queries. When the query load ishalf of the real iRail load on October 1st 2015, we noticethat the fastest solution is the query server, followed bythe lc with cache. When doubling the query load, theaverage query execution time is lower in all cases,resulting in the same ranking. When doubling the queryload once more, we see that lc with client cache is now thefastest solution. When doubling the query load 12 times,also the lc without client cache becomes faster. The trendcontinues until the the query server takes remarkablylonger to answer 90% of the queries than the LinkedConnections solutions at 16 times the query load.

128

Page 131: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

0 4 8 12 160

1

2

3

4

Query server

LC with client cache

LC without client cache

X times real iRail query load

ms

quer

y tim

e/co

nnec

tion

Figure 4: This figure shows the average of the response times divided by thenumber of connections that were needed to be processed.

The average bandwidth consumption per connection inbytes shows the price of the decreased server load as thebandwidth consumption of the lc solutions are threeorders of magnitude bigger: lc is 270B, lc with cache is 64Band query-server is 0.8B. The query server only gives oneresponse per route planning question which is small. lcwithout client cache has a bandwidth that is three ordersof magnitude bigger than the query server. The lc with aclient cache has an average bandwidth consumption perconnection that is remarkably lower. On the basis of thesenumbers we may conclude the average cache hit-rate isabout 78%.

Chapter 6 — Public Transit Route Planning Over Lightweight Linked DataInterfaces

129

Page 132: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Table 6: 90% of the journeys will be found with thegiven time in ms per connection under increasingquery load.

queryload

lc nocache

lc withcache

Queryserver

0.5 0.319 0.211 0.1301 0.269 0.165 0.1252 0.266 0.177 0.1514 0.273 0.177 0.1888 0.302 0.211 0.25512 0.389 0.307 0.53916 0.649 0.369 3.703

4. linked connections withwheelchair accessibility

In order to study how to add extra features to thisframework, we performed another experiment, in order totest the impact of this extra feature on both the client andserver. A wheelchair accessible journey has two importantrequirements: first, all vehicles used for the route shouldbe wheelchair accessible. Thus, the trip (a sequence ofstops that are served by the same vehicle), should have awheelchair accessible flag set to true when there is roomto host a wheelchair on board. Secondly, every transferstop, the stop where a person needs to change from onevehicle to another vehicle, should be adapted for peoplewith limited mobility. As the lc server does not knowwhere the traveler is going to hop on as it only publishesthe time tables in pages, it will not be able to filter onwheelchair accessible stops.

When extending the framework of Linked Connectionswith a wheelchair accessibility feature, there are twopossible ways of implementing this:

1. Filter both the wheelchair accessible trips andstops on the client;

130

Page 133: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

2. Only filter the wheelchair accessible trips on theserver while filtering the stops on the client.

4.1. Linked Connections with filtering on theclient

In a first approach, the server does not exposewheelchair accessibility information and only exposes theLinked Data of the train schedules using the LinkedConnections vocabulary. The client still calculateswheelchair accessible journeys on the basis of its ownsources. We thus do not extend the server: the sameserver interface is used as in lc without wheelchairaccessibility. The lc client however is extended with twofilter steps: the first filter removes all connections from theconnections stream whose trip is not wheelchairaccessible. The second filter is added to csa, to filter thetransfers stops. When csa adds a new connection to theminimum spanning tree, it detects whether this would leadto a transfer. csa can use the information from stops inthe Linked Open Data cloud to decide if the transfer can bemade and the connection can be added to the minimumspanning tree. To be able to support dynamic transferstimes, csa can also request Linked Data on “transfers”from another source. When a transfer is detected, csa willadd the transfer time to the departure time of theconnection. The trips, stops, and transfers in ourimplementation are simple json-ld documents. They areconnected to a module called the data fetcher that takescare of the caching and pre-fetching of the data.

This solution provides an example of how the client cannow calculate routes using more data than the datapublished by the transit companies. As wheelchairaccessibility is specified in the gtfs standard, we canexpect that the public transit companies provide this data.In practice we have noticed that the data is mostly notavailable, which can be considered normal for an optionalfield. This makes us believe an external organisation thatrepresents the interests of the less mobile people shouldbe able to publish this data.

Chapter 6 — Public Transit Route Planning Over Lightweight Linked DataInterfaces

131

Page 134: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

4.2. Linked Connections with filtering on theserver and client

In the Linked Connections solution with filtering onboth server and client-side, the server still publishes thetime schedules as Linked Data, yet an extra hypermediacontrol is added to the lc server to enable the trips filteron the resulting connections. The wheelchair accessibilityinformation is directly added to the servers’ database, andan index is configured, so that the lc server can query forthis when asked.

5. wheelchair accessibility featureexperiment

5.1. Evaluation designThe purpose of the evaluation is to compare the two lc

solutions based on the scalability of the server-interface,the query execution time of an eat query, and the cpu timeused by the client. For this evaluation, we used the samequery mixes as in the previous experiment.

Two lc servers, one for the lc with filtering on the clientsetup and one for the lc with trips filtering on the serversetup. The MongoDBs used by the lc servers arepopulated with connections of the Belgian railwaycompany of 2015. Also a transfers, trips and stops resourceas a data source for the wheelchair accessibilityinformation is created.

5.2. ResultsFigure 5 contains the cpu load of the server. The server

interface with filtering on the server has a similarscalability as without the filter functionality. However, anaverage raise of 1.24% in processing power needed, can benoticed.

132

Page 135: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

0 2 4 6 8 10 12 14 160%

5%

10%

15%

20%

LC with accessibility filtering on client

LC with accessibility filtering on server

X times query mix

% C

PU

tim

e on

ser

ver

Figure 5: Server cpu load under increasing query load shows that wheelchairaccessibility filtering on the server takes more effort for the server under allquery loads.

In Figure 6, we can see the cpu load of the clientperforming the algorithm. When the query load is half ofthe real iRail load, we notice that the load is 20% forfiltering on the client and 21% for trips filtering on theserver. Continuously increasing the query load results inan increased client cpu load. The client load of the solutionwith filtering on the client increases from 27% at query mix1 to 93% at query-mix 16 while the solution with filteringon the server-side increases from respectively 30% to 93%.For query loads higher than 12 times the iRail load thedifference between the two solution becomes less than1%.

Chapter 6 — Public Transit Route Planning Over Lightweight Linked DataInterfaces

133

Page 136: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

0 2 4 6 8 10 12 14 160%

25%

50%

75%

100%

LC with accessibility filtering on client

LC with accessibility filtering on server

X times query mix

% C

PU

use

d on

clie

nt

Figure 6: Client cpu load is higher under increasing query load for the lc setupwith filtering on both the client and server-side than only on the client-side.

Figure 7 shows the average query response time perconnection. We notice that the fastest solution is thesolution with filtering on the client for lower query loads,yet for higher query loads, the execution times becomecomparable.

0 2 4 6 8 10 12 14 160.0

0.1

0.2

0.3

0.4

0.5

0.6

LC with accessibility filtering on client

LC with accessibility filtering on server

X times query mix

exec

utio

n tim

e/co

nnec

tion

(ms)

Figure 7: The average time needed to process one connection in millisecondsshows that for the normal query loads the Linked Connections solution withtrip filtering on the server is on average slower than the solution with filteringonly on the client. When the query load is 8 times the normal query load, wecan no longer notice a difference.

134

Page 137: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

When observing half of the normal iRail query load themeasured cache rate is 76% and 70% for respectively withfiltering on the client as with trip filtering on the server.The cache hit rate slowly decreases to respectively 70%and 64% at query mix 8 and finally to 64% and 61% at thehighest query mix 16. When observing all query loads, thecache hit rate measured from the Linked Connectionsframework with filtering on the client is lower than theframework with filtering on both client and server-side.

When comparing the two lc implementations we canobserve that moving the trips filter from the client to theserver-side does not cause an improvement of the queryexecution time. The cache performance is better for theclient-side filtering solution because the hybrid solutionneeds an extra parameter to query the lc server. Thisresults in more unique requests and consequently in alower cache hit rate. The loss of cache performanceincreases the number of requests that the lc server needsto handle, which leads to a higher cpu load at the server-side and an increased query execution time. Thedifference becomes smaller as the query load increases, asmore queries can use the already cached LinkedConnections documents.

6. discussionThe results look promising: thanks to a cache hit-rate of

78%, the Linked Connections server is able to handle morequeries than the traditional query server approach.

6.1. Network latencyIn the evaluation, we tested two extremes: one where

the user agent can cache nothing, being an end-userapplication, and another, where the user agent can cacheall end-user questions. The latter can be an intermediatetraditional api with a back-end that uses something similarto the Linked Connections client. The cost-efficientinterfaces of Linked Connections can this way also be usedas a new way to exchange public transit data, while the

Chapter 6 — Public Transit Route Planning Over Lightweight Linked DataInterfaces

135

Page 138: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

end-users still get small responses.

6.2. Actual query response timesThe average number of connections in the journeys of

the results, multiplied with the query execution time perconnection will give an idea of the average waiting timeper route planning query. For the Belgian railway system,based on the iRail query logs, we scan an average of 2700connections per query. For the entire Belgian railwaysystem as input connections, route planning queries willon average, even in the case of 16 times the iRail queryload, return results under 2 seconds, again networklatency not taken into account.

6.3. More advanced federated route planningThe current approach is simple: when multiple

entrypoints are configured, data will be downloaded byfollowing links in both entrypoints. The stream ofconnections that comes in will be ordered just in time, andprovided to the csa algorithm. Further research andengineering is needed to see how sources would be ableto be selected just in time. E.g., when planning a routefrom one city to another, a planning algorithm may firstprioritize a bus route to border stations in the first city, tothen only query the overarching railway network, to thenonly query the bus company in the destination city.

6.4. Denser public transport networksThe number of connections in a densily populated area

like Paris or London are indeed much higher than in otherareas. In order to scale this up to denser areas, the keywould be to split the dataset in smaller regions. This wouldallow for federated querying techniques to prune the to bequeried parts of the Linked Connections datasets. Thisfragmentation strategy could also allow for parallelprocessing, limiting the time the client is waiting for data tobe downloaded.

A generic approach for fragmentation strategies for136

Page 139: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

data in ordered ranges has been put in themultidimensional interfaces ontology [9], which takesinspiration from B-trees and R-trees. We leave it as achallenge to future research to apply concepts like multi-overlay networks [1] to Linked Connections.

6.5. Real-time updates: accounting for delaysAs the lc pages in the proof of concept are generated

from a Mongodb store, updating a departure time orarrival time will result in the new data being queryableonline. However, user agents querying pages mayexperience problems when connections suddenly shiftpages. One possible solution to this is to use Memento ontop of the http server, making the lc client able to query afixed version. Another solution would be to keep theupdated connection in all pages. When the algorithmdetects a duplicate connection, it can take the last versionof the object or choose to restart the query execution.

6.6. Beyond the eat queryThe basic csa algorithm solving the eat problem can be

extended to various other route planning questions forwhich we refer to the paper in which csa wasintroduced [1]. Other queries may for instance includeMinimum Expected Arrival Times, a set of Pareto optimalroutes [1], isochrone studies, or analytical queries studyingconnections with a certain property. For all of thesequeries, it is never necessary to download the entiredataset: only the window of time that is interesting to thequery can be used.

6.7. On disk space consumptiongtfs contains rules, while the Linked Connections

model contains the result of these rules. In the currentimplementation, the result of a 2MB gtfs can easily be2GB of Linked Connections. However, on the one hand, wecan come up with systems that generate LinkedConnections pages from a different kind of rules

Chapter 6 — Public Transit Route Planning Over Lightweight Linked DataInterfaces

137

Page 140: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

dynamically. On the other hand, keeping an identifier ondisk for each connection that ever happened and isplanned to happen, is an interesting idea for analyticalpurposes. E.g., in Belgium, iRail is now using the LinkedConnections model to keep a history of train delays and anoccupancy score [8] and execute analytical queries over it.

6.8. Non-measurable benefitsWhile bandwidth, cpu load, and query execution times

are measurable, there are also other design aspects totake into account, which are not directly measurable. Forinstance, in the case where end-user machines execute theroute planning algorithm, privacy is engineered by design:a man in the middle would not be able to determine fromwhere to where the end-user is going. As the client is nowin control of the algorithm, we can now give personalizedweights to connections based on subscriptions the end-user has or we can calculate different transfer times basedon how fast you can walk, without this information havingto be shared with the public transit agencies.

7. conclusionWe measured and compared the raise in query

execution time and cpu usage between the traditionalapproach and Linked Connections. We achieved a bettercost-efficiency: when the query-interface becomessaturated under an increasing query load, the lightweightlc interface only reached 1/4th of its capacity, meaningthat the same load can be served with a smaller machine,or that a larger amount of queries can be solved using thesame server. As the server load increases, the lc solutioneven – counter-intuitively – gives faster query results.

These result are strong arguments in favor ofpublishing timetable data in cacheable fragments insteadof exposing origin-destination query interfaces whenpublishing data for maximum reuse is envisioned. Theprice of this decreased server load is however paid by thebandwidth that is needed, which is three orders of

138

Page 141: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

magnitude bigger. When route planning advice needs tobe calculated while on, for example, a mobile phonenetwork, network latency, which was not taken intoaccount during the tests, may become a problem when thecache of the device is empty. An application’s server canhowever be configured with a private origin-destinationapi, which in its turn is a consumer of a LinkedConnections dataset, taking the best of both worlds. Whenexposing a more expressive server interface, caution isadvised. For instance in the case of a wheelchairaccessibility feature, the information system wouldbecome less efficient when exposing this on the serverinterface.

Our goal was to enable a more flexible public transportroute planning ecosystem. While even personalizedrouting is now possible, we also lowered the cost ofhosting the data, and enabled in-browser scripts toexecute the public transit routing algorithm. Furthermore,the query execution times of queries solved by the LinkedConnections framework are competitive. Until now, publictransit route planning was a specialized domain where allprocessing happened in memory on one machine. Wehope that this is a start for a new ecosystem of publictransit route planners.

references[1] Strasser, B., Wagner, D. (2014). Connection Scan

Accelerated. Proceedings of the Meeting on AlgorithmEngineering & Expermiments (pp. 125–137).

[2] Folz, P., Skaf-Molli, H., Molli, P. (2016). CyCLaDEs: ADecentralized Cache for Triple Pattern Fragments. TheSemantic Web. Latest Advances and New Domains:13th International Conference (pp. 455–469).Springer International Publishing.

[3] Colpaert, P., Llaves, A., Verborgh, R., Corcho, O.,Mannens, E., Van de Walle, R. (2015). IntermodalPublic Transit Routing using Linked Connections. Inproceedings of International Semantic Web

Chapter 6 — Public Transit Route Planning Over Lightweight Linked DataInterfaces

139

Page 142: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Conference (Posters & Demos).

[4] Delling, D., Pajor, T., Werneck, R. (2012). Round-BasedPublic Transit Routing. Proceedings of the 14thMeeting on Algorithm Engineering and Experiments(ALENEX’12).

[5] Colpaert, P., Ballieu, S., Verborgh, R., Mannens, E.(2016). The impact of an extra feature on the scalabilityof Linked Connections. In proceedings of iswc2016.

[6] Witt, S. (2016). Trip-Based Public Transit Routing UsingCondensed Search Trees. Computing ResearchRepository (corr).

[7] Bast, H., Carlsson, E., Eigenwillig, A., Geisberger, R.,Harrelson, C., Raychev, V., Viger, F. (2010). Fast routingin very large public transportation networks usingtransfer patterns. Algorithms – esa 2010 (pp.290–301). Springer.

[8] Colpaert, P., Chua, A., Verborgh, R., Mannens, E., Vande Walle, R., Vande Moere, A. (2016, April). Whatpublic transit api logs tell us about travel flows. Inproceedings of the 25th International ConferenceCompanion on World Wide Web (pp. 873–878).

[9] Taelman, R., Colpaert, P., Verborgh, R., Mannens, E.(2016, August). Multidimensional Interfaces forSelecting Data within Ordinal Ranges. Proceedings ofthe 7th International Workshop on ConsumingLinked Data.

140

Page 143: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Chapter 7Conclusion

“ The future is so bright, we will haveto wear shades. ”― Erik Mannens.

141

Page 144: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Lowering the cost to reuse a dataset can be done by raising itsinteroperability to other datasets. A developer thenhas the possibility to automate reuse of more datasetswhen a user-agent can recognize common elements. Inthis dissertation, I studied how this data sourceinteroperability of Open (Transport) Datasets can beraised. I introduced five layers of data sourceinteroperability, that can create a framework to solvethese questions: legal, technical, syntactic, semantic,and querying. On the one hand, these five layers wereused for qualitative research studying publicadministrations in Flanders. On the other hand, theywere used to design a public transport data publishingframework called Linked Connections (lc). With lc, Iresearched a new trade-off for publishing publictransport data by evaluating the cost-efficiency. Thetrade-off chosen allows for flexibility on the client-sidewith a user-perceived performance that is comparableto the state of the art. When publishing data in anydomain – or when creating standards for publishingdata – a similar exercise should be made. I summarizethe key take aways in 8 minimum requirements fordesigning your next http Open Data publishinginterface.

The research question as discussed in Chapter 1 was“How can the data source interoperability of Open(Transport) Data be raised?”. In Chapter 2, I introduced thetheoretical framework to study publishing data formaximum reuse in five data source interoperability layers.Chapter 3 then elaborated on how we can – and whetherwe should – measure the data source interoperability,based on these layers, and discussed three possibleapproaches. In Chapter 4, I applied this to three projects,which have been carried out through the course of thisPhD, and reasoned that qualitatively studyinginteroperability would work best at that time for studyingmaximizing reuse at the Flemish government. In Chapter 5,I then introduced the specifics of data in the transportdomain, which sketched the current state of the art.Finally, Chapter 6 introduced the Linked Connectionsframework, in which the conclusion contains strongarguments – supported by the evaluation – in favor of

142

Page 145: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

Linked Connections for publishing public transport data.In the upcoming

http/2.0 standard, allWeb-communication

need to be secure(https). Using the

http protocol todaythus implies using the

secured https toidentify Web-

resources. A redirectfrom a http url to a

https url stillenables older

identifiers to persist.

Lowering the cost for adoption for public datasets is acomplex cross-cutting concern. Different parties acrossdifferent organizations need to – just to name a few – aligntheir vision on Open Data, need to create and acceptdomain models, need to agree upon legal conditions needto be agreed upon, and need to pick a Linked Datainterface to make their data queryable. To that extent, weneed to identify the minimum requirements that wouldlower the cost for adoption across the entire informationsystem on all interoperability levels. In order to achieve abetter Web ecosystem for sharing data in general, Isummarized a minimum set of extra requirements whenusing the http protocol to publish Open Data.

1. Fragment your datasets and publish thedocuments over http. The way the fragments arechosen depends on the domain model.

2. When you want to enable faster query answering,provide aggregated documents with appropriatelinks (useful for, e.g., time series), or expose morefragments on the server-side.

3. For scalability, add caching headers to eachdocument.

4. For discoverability, add hypermedia descriptionsin the document.

5. A web address (uri) per object you describe, aswell as http uris for the domain model. This way,each data element is documented and there willbe a framework to raise the semanticinteroperability.

6. For the legal interoperability, add a link to amachine readable open license in the document.

7. Add a Cross Origin Resource Sharing httpheader, enabling access from pages hosted ondifferent origins.

8. Finally, provide dcat-ap metadata fordiscoverability in Open Data Portals.

This approach does of course not limit itself to staticdata. The http protocol allows for caching resources forsmaller amounts of time. Even when a document may

Chapter 7 — Conclusion 143

Page 146: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse

change every couple of seconds, the resource can still becached during that period of time, and a maximum load ona back-end system can be calculated.

While economists promise a positive economic impactfrom Open Data, I did not yet see proof of this impact – tothe extent promised – today. Over the next years, we willneed to focus on lowering the cost for adoption if we wantto see true economic impact. For data publishers, thisentails raising the data source interoperability. For datareusers, this entails automating their clients to reuse thesepublic datasets: when a new dataset becomes availableand discoverable, this user agent can automatically benefitfrom this dataset. Today, this is a manual process, wherethe user agent has to store and integrate all data locallyfirst, or where the user agent has to rely merely on theexpressiveness of a certain data publisher’s api. Whenfragmenting datasets in ways similar to Chapter 6, itbecomes entirely up to the http client’s cache to decidewhat data to keep locally, and what data to download justin time.

There are still open research questions questions thatwe are going to tackle in the years to come. For one, I lookforward researching fragmentation strategies within thedomains of geo-spatial data. Our hypothesis is thatintermodal route planning, combining both modes usingroad networks and public transport routing, can beachieved when a similar approach for routable tiles can befound. Moreover, solving full-text search by exposingfragments, inspired by how indexes and tree structuresare built today, may afford a more optimized informationarchitecture for federated full-text search. Finally, alsoorganizational problems still need to be tackled: whatinterfaces need to be hosted by whom?

For the next couple of years, I look forward to buildingand expanding our IDLab team on Linked Data interfaces,and work further on projects with these organizations thatI got to know best. A new generation of PhD researchers isalready working on follow up projects, building a morequeryable future for the many, not the few.

144

Page 147: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse
Page 148: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse
Page 149: Het publiceren van datasets in het transportdomein voor ... · Het publiceren van datasets in het transportdomein voor maximaal hergebruik Publishing Transport Data for Maximum Reuse