Skip to main content

Technical Development

LINCS leverages existing solutions and expands the Digital Research Alliance of Canada’s infrastructure to mobilize large-scale, heterogeneous datasets as Linked Open Data (LOD) for humanities research.

The source datasets are converted into LOD via LINCS’s creation tools, which mobilize, enrich, and interlink research data. This LOD is then stored in LINCS’s national Linked Data (LD) storage infrastructure, the triplestore.

The LINCS-built exploration tools allow users to filter, query, analyze, visualize, and annotate cultural materials that have been converted into LOD. Users can modify, evaluate, correct, or reject automated semantic enrichments for their own data and for the data of others.

The LINCS System Diagram is an overview of the LINCS technical system. It shows the path of source data as they move through the process of conversion and onward to storage and access.

LINCS provides links to and user documentation of various tools and interfaces for creating and converting LOD, ranging from accessible ones such as CWRC through to ones requiring greater technical knowledge such as Spyral. LINCS also provides links to technical documentation for the tools and related code used on the project, and for the tools it adopts or adapts.

Source Data

LINCS converts four main types of datasets.

The four main datasets that LINCS converts are Canadian with 5.1 million objects, 40.5 million pages, and 700k triples Cultural with 17 million texts, Research with 80 tb of media, 2 million triples and 300k texts, and lastly Linked which is a global general knowledge base that has 3.4 billion triples.

Canadian

There are large bodies of content on the web related to Canadian culture and history: smaller sets of materials created by researchers and large digitized collections held by memory institutions like Canadiana and Library and Archives Canada. We need better ways of accessing this content.

Cultural

There are also millions of books, periodicals, and other content that has been digitized by groups such as the Internet Archive, The Hathi Trust, and Project Gutenberg, as well as much native web content relevant to cultural research. LINCS provides new ways of discovering and using these kinds of materials.

Linked

LINCS builds on much existing work that has been done towards an open, semantically structured web, from W3C standards and established ontologies to major community projects such as DBpedia and Wikidata. We aim to strengthen the LOD ecology through high quality open content and open-source tools.

Research

Datasets carefully curated by Canadian researchers are at the core of LINCS, which mobilizes this material and interlinks it to other related content. The source datasets are rich and diverse, as are the research themes that are developed by linking them.

Conversion

LINCS converts existing data into LOD by extracting entities and relationships from heterogeneous datasets. This process involves both significant data conversion and tool adaptation. LINCS supports conversion from the most common formats used by the humanities research community: structured, TEI, and natural language. See our conversion workflows for more information.

The processes involved are:

  • Detect entities at the heart of human history and culture
  • Reconcile or link entities to one or more records of those entities (if available) stored in reference knowledge graphs from the LOD cloud
  • Create relationships based on ontologies between entities, either by mapping from the existing structure within the source materials or by using Natural Language Processing (NLP) and machine learning to detect them
  • Validate the results to ensure sufficient precision, where resources and expertise exist

Data Conversion

To quickly mobilize a large set of relevant data, LINCS prioritizes dataset conversion, starting with the most LOD–ready content. Core researcher datasets receive the fullest processing, including human vetting. Others are processed automatically, with confidence levels set for high precision to minimize false positives. The resulting millions of triples make these materials immediately accessible, poised for vetting as researchers engage with them.

Conversion is of two types. The translation of a relational dataset into the Resource Description Framework (RDF) format of the Semantic Web maps from existing structures and points back to an active or archived source dataset. The extraction of LOD from a dataset comprised of natural language creates RDF that points back to the source on the web. In both cases, LINCS data conversion tools track the provenance of the data for scholarly purposes.

LINCS reuses existing ontologies and vocabularies where possible, building on existing LD work and seeking out domain-specific vocabularies to incorporate and link in. It looks to the best practices established by large projects—such as Europeana, the Digital Public Library of America (DPLA), and Linked Data for Production—and from cultural heritage providers in Canada who are experimenting with LOD. LINCS works to ensure that its ontologies can represent non-hegemonic epistemologies and push alternative knowledge representations into the Semantic Web. As such, LINCS ontologies are selected, adopted, and developed with an attention to intersectionality, multiplicity, and difference. For more information, see Ontologies.

Tool Adaptation

LINCS adopts standard algorithms for NLP and entity matching for its conversion processes and tools, and builds on methods used by other large-scale LOD conversion projects including Linked Data for Production. Existing tools and processes are adapted to convert datasets to Semantic Web statements (triples that use RDF).

LINCS builds on award-winning algorithms developed in Alberta that perform Named Entity Recognition (NER) and Named Entity Disambiguation (NED) with respect to one or more project knowledge graphs, incorporate hand-tagged datasets as training data for the models, and provide an interface for tuning parameters. LINCS conversion tools are generic, modular, and work with several open-source algorithms. LINCS also adopts or adapts several existing workflows and tools for data cleanup and vetting, including interfaces suitable for subject matter experts.

Infrastructure

LINCS is building a national LOD store for the dissemination of the data it converts. A triplestore system houses the large RDF datasets that support billions of triples. The storage environment was selected to ensure compatibility with participating datasets, integrating ontologies, tuning the inference functionality, and installing the Application Programming Interfaces (APIs) that enable trusted data providers to push in data on a regular basis.

LINCS infrastructure is hosted on the Digital Research Alliance of Canada cloud. It contains platforms for high performance computing with Apache Spark, multiple web services deployed on Kubernetes, and data storage with an S3-compatible service. LINCS also consults with national research data preservation initiatives regarding long-term data management. The project has established its code repository on GitLab with a continuous integration/continuous deployment pipeline.

LINCS relies on partners in the digital ecosystem for source data storage, management, and preservation. Source collections may be hosted in a range of ways to achieve stable URLs for LINCS metadata: through institutional repositories, stable research sites, or the Internet Archive.

Access

LINCS provides access to the converted data through its exploration tools. Search results are also available in various list formats as well as a graph visualization.

Where possible, tools are implemented as stand-alone web services, with APIs to support third-party use. Components are modular: able to function individually, integrated within a workflow, or built into another system. This architecture supports further tool development for the wide range of use cases emerging from the academy and beyond. Code is open-source. Open design principles allow others to build interfaces for their own data and plug LINCS tools into other environments.

Access to LINCS Data

LINCS is both generic, in permitting different query types on data converted from very different sources, and precise, in allowing researchers to drill down to specific domain vocabularies and highly specialized subsets of content. LINCS builds its access plans on successful models, such as the Situated Networks and Archival Contexts project for an access interface, the DPLA for a developer interface, and the Humanities Networked Infrastructure interface for engaging users with LD.

Through LINCS, Canadian researchers have unparalleled access to cultural heritage content. This includes copyrighted data: our published record is so young that the lack of access to digital collections for analysis has significantly impeded research on Canadian culture, but LINCS is able to elucidate, for instance, the massive but protected HathiTrust Digital Library datasets. The exploration tools ensure data mobilization across the full spectrum of mainstream and technical researchers.