sshoc-marketplace-documentation

(data) source ingestion workflow

See https://marketplace.sshopencloud.eu/about/implementation#data-ingestion-workflow for the generic workflow. SSHOC D7.3 (p.8) and D7.2 (p. 25 and following) are also providing details about the process. The present documentation describes the ingestion workflow steps and their related documents.

1 Suggest and select a new source to harvest

Anyone can suggest a new source to harvest. Suggestions should be sent to the EB members and added to this internal waiting list: mappings - sources to MP data model.

Considering the limited resources available for sources ingestion, a prioritisation of the sources of interest is performed. Based on D7.3 report, 5 criteria are taken into account to evaluate and prioritise the sources:

  1. quality (they should be useful additions),
  2. uniqueness (they should not consist of items mostly already in the Marketplace)
  3. technical interface (they should have metadata that can be relatively easily mapped to the Marketplace data model and should be harvestable via an API),
  4. how much they will enhance the representativity of the various SSH domains within the Marketplace, and how useful they will be to the users of the Marketplace.
  5. Additionally, it will be favourable if a new potential source will bring with it possible contextualization, i.e. relations to other items within itself or the Marketplace.

This prioritisation is performed by the Editorial Board, liaising with the PNSC and ACDH-CH teams in charge of running the DACE pipeline and/or ad-hoc ingestion scripts. For specific cases or difficulties in prioritising, the GB can also be contacted.

1.2 Define mappings

A mapping from the source data model to the Marketplace data model has to be devised in each case. This mapping is prepared in a spreadsheet by one of the Moderators and reviewed by at least another. This mapping represents the prescription for the custom ingestion pipeline. A new tab is opened in this Gsheet - mappings - sources to MP data model - based on the mapping template tab. Creating a mapping requires a good understanding of the MP data model, a few resources in that regard:

But also a good understanding of the source data. The moderator preparing the mapping should add as much info as possible in its mapping tab.

1.3 Implement a custom pipeline via DACE or custom script

Based on the conceptual mapping defined in the previous step, a decision is taken by the ingest team to either use the DACE pipeline or to develop/adapt ad-hoc ingest scripts.

DACE

DACE is a custom ETL (extract, transform, load) pipeline, which fetches the data from the source, iteratively processes all the items and maps the structured data in the source format to the MP data model, ingesting the transformed items expressed as JSON-objects via the API of the Marketplace.

2 Data Ingestion and Update Workflow for SSHOC Marketplace

This guideline outlines the process for ingesting new data sources, updating existing records. All DACE ingests are designed as continuous ingests, i.e. are run regularly and are able to modify MP entries.

2.1 Data Source Identification and Ingestion Process

To ensure a streamlined and consistent ingestion process, each new data source must follow these key steps:

2.1.1 Search for Existing Entries

2.1.2 If No Entry is Found

2.1.3 If an Entry is Found

2.1.4 Suggestions for Review

2.2. New Source and Keyword List Requirement

For each new source being ingested, a keyword list must be prepared and submitted before the ingest is first run on production. This list will be reviewed by moderators to ensure that only relevant, accurate, and appropriate keywords are created. The following steps must be followed for keyword creation:

2.2.1 Keyword List Creation

2.2.2 Review of Keywords

2.2.3 API Integration for Keywords

3. Data Validation and Testing

3.1 Initial Testing

3.2 Review and Refinement

3.3 Final Approval

4. Continuous Ingest

For sources that are expected to be updated regularly, configure the ingestion process to run continuously:

4.1 Capture source updates

4.2 Updates to Existing Entries

4.3 Ingestion Review and Approval


Key API Endpoints to Use)

  1. Search for Existing Record:
    • GET /api/workflows/{persistentId}
    • GET /api/publications/{persistentId}
    • GET /api/tools-services/{persistentId}
    • GET /api/training-materials/{persistentId}
    • GET /api/datasets/{persistentId}
  2. Create New Record:
    • POST /api/workflows
    • POST /api/publications
    • POST /api/datasets
    • POST /api/training-materials
    • POST /api/tools-services/
  3. Update Existing Record:
    • PATCH /api/workflows/{persistentId}
    • PATCH /api/publications/{persistentId}
    • PATCH /api/datasets/{persistentId}
    • PATCH /api/training-materials{persistentId}`
    • PATCH /api/tools-services/
  4. Keyword Creation and Management:
    • POST /api/vocabularies/{vocabulary-code}/concepts (after approval)
  5. Continuous Ingest and Updates:
    • GET /api/workflows/{persistentId}/versions/{versionId}
    • GET /api/publications/{persistentId}/versions/{versionId}
    • GET /api/training-materials/{persistentId}/versions/{versionId}
    • GET /api/tools-services/{persistentId}/versions/{versionId}
    • GET /api/datasets/{persistentId}/versions/{versionId}
    • PATCH /api/workflows/{persistentId}
    • PATCH /api/publications/{persistentId}
    • PATCH /api/training-materials/{persistentId}
    • PATCH /api/datasets/{persistentId}
    • PATCH /api/tools-services/{persistentId}

Custom scripts

The ingestion custom scripts developed by Michael Kurzmeier are following a Python and Jupyter Notebooks approach. They are not primarily meant for continuous ingest., and well suited for the ingestion of a limited number of records (i.e. ~300 records for the sources processed so far). The scripts perform three main operations: source harvesting, transformation, and writing back to the SSH Open MP. Although the transformation and writing back to the MP have been developed to be as standardised (and reusable) as possible, the heterogeneity of potential new MP sources requires that a complete new approach for the harvesting step is taken for each source.