Linked Data
Linked Data (LD) is structured information contained in a database or dataset that follows the Resource Description Framework (RDF), the standard model for sharing data on the web and the de facto standard for LD. Put more simply, RDF is a structured way to represent information about resources.
Triples
RDF represents information in a series of “statements” called triples. Triples are called triples because they have three parts: a subject, a predicate, and an object. The subject is the thing being described. The object is the thing that the subject is related to. The predicate describes the relationship between the subject and object.
Let’s look at an example. Imagine you want to express the statement “Margaret Laurence is the author of The Stone Angel.” If you were to use RDF, this statement would be expressed as follows:
Margaret Laurence becomes the subject (the thing being described). The Stone Angel becomes the object (the thing related to the subject). “Is the author of” becomes the predicate (the relationship between the subject and the object).
It is important to note that the relationship between the subject and object is bidirectional, which means that the statement can always be “flipped”:
Here are some more examples:
Now imagine you have many RDF triples and the triples refer to the same things. You can start to link them to create a graph:
At LINCS, subjects and objects are both called entities, while predicates are called properties. Datasets made up of triples are often called triplestores.
Take a second to think about this graph. If The Stone Angel is set in Manitoba, and Manitoba is a province of Canada, you can assume that The Stone Angel is also set in Canada:
The process of generating new facts from existing triples is called inferencing. Inferencing helps to improve the quality of data by discovering new relationships.
Uniform Resource Identifiers (URIs)
As demonstrated above, if many triples refer to the same entity, they can be linked together. Computers, however, are the ultimate literalists. Words are strings (sequences of characters), but the same string can mean different things. For example, look at the following RDF statement:
To us, it might be obvious that Cookie Monster’s favourite food is a baked good. “Cookies,” however, can also refer to the blocks of data that are created when browsing a website. Since you do not want the computer to think that Cookie Monster likes eating blocks of data, you need to uniquely identify each entity.
Enter the Uniform Resource Identifier (URI): a unique, persistent string that consistently represents a distinct person, place, thing, or piece of information. By using a URI to refer to the same entity in multiple triples and multiple datasets, separate clusters of data can then be connected to one another, linking them all together.
But what exactly does a URI look like? One example you are probably familiar with is a Uniform Resource Locator (URL). A URL is a type of URI that describes the location of something on the web. For example, here is a URL for Margaret Laurence that works as a URI:
https://viaf.org/viaf/44317974/
This URI is from the Virtual International Authority File (VIAF). The beginning of this URI (https://viaf.org/viaf/
) gives us context by telling us that it is from VIAF. The following numbers (44317974
) refer specifically to Margaret Laurence, the Canadian author, in VIAF’s system. VIAF also has URIs for Margaret Laurence’s novels, The Stone Angel and The Diviners. You will notice that these URIs also end in unique strings of numbers:
http://viaf.org/viaf/187423217
http://viaf.org/viaf/187586333
These unique strings of numbers are called paths. It is important to note that the paths by themselves are not URIs because they do not uniquely identify the entities. The string “187586333” on its own could refer to many things!
A last few things to note:
- URLs are always URIs, but URIs are not always URLs. URIs do not need to look like URLs, but they need to be reliably unique.
- Some URIs have the same format as a URL, but they do not actually link to a webpage.
- URIs that do link to a webpage are said to be dereferenceable.
- Even though all URLs are URIs, some are less reliable than others. For example, if you wanted to find a URL for an academic article, it is best practice to use the article’s Digital Object Identifier (DOI) than a URL to a PDF on someone’s personal website.
The process of finding and matching a URI from an external authority to an entity in a dataset is called reconciliation.
Serializations
While we now know how RDF triples are structured, what do they actually look like? RDF triples can be “serialized” (i.e., written) in a few different ways, with common syntaxes including Turtle (TTL), Extensible Markup Language (XML), and JSON. The LINCS triplestore ingests data using the TTL format.
For those new to LD, RDF serializations can be intimidating. Examples of triples serialized in TTL, XML, and JSON are provided below, taken from Meindertma (2019) “What’s the Best RDF Serialization Format?” Do not worry—you do not need to be able to parse these statements.
The following snippets state that Tim Berners Lee was born on June 8, 1955 in London, England.
TTL:
@prefix tim: <https://www.w3.org/People/Berners-Lee/>.
@prefix schema: <http://schema.org/>.
@prefix dbpedia: <http://dbpedia.org/resource/>.
<tim> schema:birthDate "1955-06-08"^^<http://www.w3.org/2001/XMLSchema#date>.
<tim> schema:birthPlace <dbpedia:London>.
RDF/XML:
<?xml version="1.0"?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:schema="http://schema.org/">
<rdf:Description rdf:about="https://www.w3.org/People/Berners-Lee/">
<schema:birthDate>1966-06-08</schema:birthDate>
<schema:birthPlace rdf:resource="http://dbpedia.org/resource/London"/>
</rdf:Description>
</rdf:RDF>
JSON-LD:
{
"@context": {
"dbpedia": "http://dbpedia.org/resource/",
"schema": "http://schema.org/"
},
"@id": "https://www.w3.org/People/Berners-Lee/",
"schema:birthDate": "1955-06-08",
"schema:birthPlace": {
"@id": "dbpedia:London"
}
}
Linked Open Data (LOD)
When LD is released under an open license, it is considered to be Linked Open Data (LOD). However, there are some more specific criteria. Data is considered to be true LOD when it:
- Follows a recognized LOD standard format (e.g., RDF)
- Uses URIs to identify entities
- Is published openly or is released under an open license
Tim Berners-Lee, the inventor of the World Wide Web (WWW), put forward the idea of “five-star” LOD that achieves ideal linked-ness and open-ness:
- ★: Your stuff is available on the web in whatever format.
- ★★: Your stuff is available as structured data. For example, your data is available in a spreadsheet, rather than as a screenshot of a spreadsheet.
- ★★★: Your stuff makes use of non-proprietary formats. For example, you use CSV rather than Excel.
- ★★★★: Your stuff uses URLs to identify entities, so people can link to it.
- ★★★★★: Your stuff is linked to other people’s data to provide contextual information.
The more that people create five-star LOD, the richer information on the web will be.
Semantic Web
The Semantic Web is the idea of extending the WWW so computers can make meaningful interpretations of data. For example, imagine that the mother of Tegan Quin is Sonia Clement. The mother of Sara Quin is the same Sonia Clement. Semantic reasoning would suggest that Tegan and Sara are siblings.
Now imagine if a computer was able to do this reasoning. LOD is essential for the realization of the Semantic Web because it creates a web of machine-readable data, converting asemantic “strings” (e.g., “Sonia Clement”) to “things” (e.g., entities) that can be related to one another!
Benefits
LOD has many benefits. In particular, it promotes interoperability, contextualization, and serendipity.
Interoperability
Since LOD is based on a standard developed by the World Wide Web Consortium (W3C) and is structured in a way that is comprehensible to multiple systems, LOD promotes interoperability. This interoperability allows data to be further shared and discovered, as data hosted on various sites can be connected together and then searched at once.
Contextualization
By connecting data from multiple sites and showing the relationships between different entities, LOD promotes contextualization. Importantly, this contextual information allows for inferencing.
Serendipity
LOD promotes serendipity, a quality that is usually associated with physical libraries. By physically moving through a structured knowledge environment like a library and searching the bookstacks, patrons tend to discover resources that they did not know they were looking for. Replicating serendipitous exploration and discovery online is difficult. Most web tools restrict users by having them type a specific search in a search bar to locate resources. LOD, however, provides a way to reconstruct a structured data environment virtually so relationships between entities can be interacted with and visualized.
Summary
- The Resource Description Framework (RDF) represents information as a series of triples that have a subject, predicate, and object.
- Triples that refer to the same entities can be linked.
- To tell a computer that one entity is the same as another entity, you need to use a Uniform Resource Identifier (URI).
- The process of matching a URI from an external authority to an entity in a dataset is called reconciliation.
- RDF triples can be “serialized” in a few different ways, including Turtle (TTL), Extensible Markup Language (XML), and JSON.
- Data is considered to be true LOD when it follows a recognized LOD standard format, uses URIs to identify entities, and is published openly.
- Five-star LOD that achieves ideal linked-ness and open-ness.
- The Semantic Web is the idea of extending the World Wide Web (WWW) so computers can make meaningful interpretations of data.
- LOD promotes interoperability, contextualization, and serendipity.
Resources
To learn more about LD, see the following resources:
- Berners-Lee (2009) “The Next Web” [Video]
- Blaney (2017) “Introduction to the Principles of Linked Open Data”
- Cambridge Semantics (2016) “An Introduction to the Semantic Web” [Video]
- Crompton (2020) “Linked Open Data: Understand it, Use it, Make it!” [Video]
- EuropeanaEU (2012) “Linked Open Data - What is it?” [Video]
- Fullstack Academy (2017) “RDF Tutorial—An Introduction to the Resource Description Framework” [Video]
- Herman (2005) “Tutorial on Semantic Web Technologies” [PowerPoint]
- Jonas (2021) “Introduction to LOD”
- McCrae (2020) “The Linked Open Data Cloud”
- Ontotext (2022) “What is an RDF Triplestore?”
- Ontotext (2022) “What is Inference?”
- Posner (2021) “What is Linked Open Data?” [Video]
- Sanderson (2021) “The Illusion of Grandeur: Trust and Belief in Cultural Heritage Linked Open Data” [Video]
- Uyi Idehen (2019) “What is the Linked Open Data Cloud, and Why is it Important?”
- W3C (2016) “Linked Data”
- 5 ★ Open Data (2015) “5 ★ Open Data”