En toen was er niets meer ....

83
Herbert Van de Sompel VOGINIP, Amsterdam, Nederland, Maart 9 2017 Herbert Van de Sompel LANL & DANS @hvdsomp En toen was er niets meer …

Transcript of En toen was er niets meer ....

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Herbert Van de SompelLANL & DANS@hvdsomp

En toen was er niets meer …

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

The Web

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

The Web Evolves

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Yet, the Web Exists in a Perpetual Now

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

• Content Management Systems

• Web Archives

• Transactional archives

• Search engine caches

• …

Traces of the Past Web Exist

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

But Past and Current Web(s) are Parallel Universes

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

The Memento Protocol Integrates the Current and Past Web

7http://mementoweb.org/guide/rfc/

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Original Resource and Mementos

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Bridge from Present to Past

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Bridge from Present to Past

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Bridge from Past to Present

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Today Select DateMar 9 1999 Feb 8 1999

Bibliotheca AlexandrinaWeb Archive

Memento: Access Versions via the Original URI and a Datetime

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

vogin.nl in 1999

http://web.archive.bibalex.org/web/19990208021257/http://www.vogin.nl/

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Memento for Chrome

http://bit.ly/memento-­for-­chome

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Hyperlinks

Eric Sieverts (2017) https://vogin-­ip-­lezing.net/2017/01/17/linkrot-­linkroest-­en-­webarchieven/

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Hyperlinks in Theory

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Hyperlinks in Reality

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Hyperlinks in Reality

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Link Rot

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Link Rot

http://404-­resto.com/typo3temp/pics/7580ea80fa.jpg

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Hyperlinks in Reality

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Content Drift

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Content Drift

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Content Drift

http://icecube.wisc.edu/ on May 8 2009 (left) and August 27 2009 (right)

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Content Drift

2000 2004

2005 2008

http://dl00.org in 2000, 2004, 2005, 2008

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

No Content Drift

http://www.ifa.hawaii.edu/~cowie/k_table.html on June 9 1997 (left) and March 2016 (right)

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

The Web, All Hyperlinks Subject to Link Rot, Content Drift

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

The Web, All Hyperlinks Subject to Reference Rot

• Reference Rot hinders our ability to follow links as they were intended when they were put in place:

• Link rot: A link stops working all together

• Content drift: The Linked content changes over time and may eventually no longer be representative of the content that was originally linked

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Creating Pockets of Persistence

• How to maintain the integrity of links?

• This challenge exists for the entire web. Some communities with well managed collections care about addressing it because they consider it a Quality of Service issue:

• Scholarly communication• Cultural heritage• Legal publications• Government communication• Journalism• Wikipedia• …

• What can these communities do to create Pockets of Persistence?

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

A Managed Collection Desires Reliable Outlinks

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Links to another Managed Collection

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Links to Web at Large Resources

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Exploring Link Rot & Content Drift

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Preamble 2 -­ Hiberlink Study of Reference Rot in STM Articles

PMC articles published 1997-­2012 PMCTotal 479,194With links to articles 240,857With links to web-­at-­large resources 156,160

Links PMCTo articles 744,678To web-­at-­large resources 480,853A B

A B

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Number of Articles & Links -­ PMC

Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONEhttps://doi.org/10.1371/journal.pone.0115253

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Links to Articles & to Web At Large Resources -­ PMC

Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONEhttps://doi.org/10.1371/journal.pone.0115253

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Exploring Link Rot & Content Drift

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Links Rot Occurs when B moves to C

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Introduce PID(B)

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Link to PID(B) ;; HTTP Redirect from PID(B) to B

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

When B moves to C: HTTP Redirect from PID(B) to C

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102

Core assumption in the PID solution: PIDs will be used to establish links.

But are they?

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

• When classifying links extracted from PMC as linking to articles, we assumed that filtering on http://dx.doi.org/* would do the trick

• But we found a lot of e.g. http://link.springer.com/article/*

• For example:• http://link.springer.com/article/10.1007%2Fs00799-014-018-0

• Instead of:• http://dx.doi.org/10.1007/s00799-014-0108-0

• We used CrossRef’s Reverse Domain Lookup to classify these extracted links as linking to articles

A Disconcerting Observation

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

URI References -­ PMC

Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102

Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Cartoon by Patrick Hochstenbach

A Proposal to Get PIDs Used: Signposting

http://signposting.org

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Signposting: HTTP Link with identifier Relation Type

http://signposting.org/identifier/

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Signposting: HTTP Link with identifier Relation Type

http://signposting.org/identifier/

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Signposting: Use HTTP Link with identifier Relation Type

curl –I http://www.dlib.org/dlib/november15/vandesompel/11vandesompel.html

HTTP/1.1 200 OKDate: Wed, 26 Oct 2016 12:36:37 GMTServer: Apache/2.2.15 (CentOS)Last-Modified: Thu, 19 Nov 2015 14:50:19 GMTETag: "205a5e-f5ef-524e5e0ab80c0"Accept-Ranges: bytesContent-Length: 62959Content-Type: text/html; charset=UTF-8Link: <http://doi.org/10.1045/november2015-vandesompel> ; rel=“identifier”

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

PID Alternative -­ When B Moves to C: HTTP Redirect from B to C

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

PID Alternative -­ When B Moves to C: HTTP Redirect from B to C

• Custodian of C needs to hold on to domain of B

• Custodian of C needs to establish redirection patterns, often rather simple rules

• No problem with establishing links to PID(B);; the URI in the browser address bar (initially B, later C) is just fine

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Exploring Link Rot & Content Drift

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Content Drift Occurs when B Changes over Time

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Content Drift Occurs when B Changes over Time

• Was not really considered an issue because:• the objects that receive PIDs were typically static, e.g. scientific papers

• when a (substantially) new version of an object is published, a new PID is assigned

• But:• PID links (typically) lead to landing pages, not the identified objects

• increasingly, landing pages are increasingly rich, aggregate comments, discussion, annotations;; they do change over time.

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Content Drift Occurs when B Changes over Time

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Custodian of B Takes Snapshots of B as it Evolves over Time

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Custodian of B Ensures Snapshots of B as it Evolves over Time

• This does not happen for PID-­identified objects, AFAIK

• Version Control Systems (e.g. Wikipedia) hold on to all versions;; snapshots are local.

• Pro-­active archiving solutions for web servers that create snapshots when e.g. new content is published/visited or at regular intervals:• on-­demand archiving of a web server, cf. archiefweb.eu, archive-­it.org

• self-­archiving web server, cf. SiteStory

• How to access the snapshots of B? Memento!

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

SiteStory Transactional Archive & Memento

https://mementoweb.github.io/SiteStory/

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

SiteStory, Wikipedia, Web Archive, Memento in Action

http://lanlsource.lanl.gov/hello

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Exploring Link Rot & Content Drift

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Scholarly Context Not Found

Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONEhttps://doi.org/10.1371/journal.pone.0115253

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Link Rot -­ PMC

Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONEhttps://doi.org/10.1371/journal.pone.0115253

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Exploring Link Rot & Content Drift

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Scholarly Context Adrift

Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

How to Assess Content Drift?

Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Step 1: Find Pre/Post Mementos

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Step 2: Select Representative Mementos

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Text Similarity Measures

• Compute aggregate text similarity scores (values between 0...100) for:• Simhash• Jaccard• Sørensen-­Dice• Cosine

• If the aggregate score is 100, we decide that the Pre/Post Mementos are representative

• We find 313K URI references with representative Mementos

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

URI References without Representative Mementos -­ PMC

Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Step 3: Dereference Live Web Version of URI

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Step 4: Representative Memento vs. Live Version

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Content Drift -­ PMC

Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Exploring Link Rot & Content Drift

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Uncertainty Regarding the Future of B when A Links to It

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Custodian of A Takes a Snapshot of B when Linking to It

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Taking a Snapshots of B: Automation is Key

• Web archive APIs for on-­demand archiving• perma.cc, Internet Archive, archive.is, webcitation

• Amber for Wordpress & Drupal archives resources linked in a page• http://amberlink.org/

• Hiberlink’s experimental Zotero extension archives bookmarked URLs• http://hiberlink.org/zotero.html

• Hiberlink’s experimental HiberActive archives all URLs referenced in a newly submitted paper• https://www.slideshare.net/martinklein0815/hiberactive

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Linking to Snapshot of B = Potentially Creating a Rotten Link

• Existing practice for linking to snapshots:

<a href=“URL of snapshot of B”>

• Problems with existing practice:o Impossible to visit the original URI, if desiredo Requires the permanent existence/uptime of the archive that holds the snapshot-­ One link rot problem replaced by another

http://robustlinks.mementoweb.org/about/

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Permanent Existence/Uptime of Archives?

Capture of http://webcitation.org dated July 17 2013https://archive.today/eAETp

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Permanent Existence/Uptime of Archives?

Remnant of discontinued web archive http://mummify.it captured on February 14 2014https://web.archive.org/web/20140214233752/https://www.mummify.it/

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Permanent Existence/Uptime of Archives?

http://www.themoscowtimes.com/news/article/russia-­bans-­wayback-­machine-­internet-­archive-­over-­islamic-­state-­video/510074.html

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Permanent Existence/Uptime of Archives?

http://web.archive.org/web/20121101043952/http://vogin.nl on March 6 2017 at 15:59 CET

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Link to Snapshot of B and Decorate the Link

• Desired practice for linking to captures is to decorate the link so it provides a variety of options:

<a href=“URL of snapshot of B”data-originalurl=“B”data-versiondate=“datetime of snapshot of B”>

• Supports:o Revisiting the original URLo Finding snapshots in any web archive (original URL)o Finding a temporally appropriate snapshot in any web archive (original URL & snapshot datetime)

o Automatically accessing a temporally appropriate snapshot in any web archive (Memento, original URL & snapshot datetime)

http://robustlinks.mementoweb.org/spec/

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Robust Links: Link Decoration in Action

Van de Sompel H. & Nelson, M.L. (2015) Reminiscing about 15 years of interoperability efforts. In: D-­Lib Magazine. https://doi.org/10.1045/november2015-­vandesompel

JavaScript makes the link decorations actionable

Herbert Van de SompelVOGIN-­IP, Amsterdam, Nederland, Maart 9 2017

Herbert Van de SompelLANL & DANS@hvdsomp

En toen was er niets meer …