sshoc-marketplace-documentation

(data) source ingestion workflow

See https://marketplace.sshopencloud.eu/about/implementation#data-ingestion-workflow for the generic workflow. SSHOC D7.3 (p.8) and D7.2 (p. 25 and following) are also providing details about the process. The present documentation describes the ingestion workflow steps and their related documents.

1 Suggest and select a new source to harvest

Anyone can suggest a new source to harvest. Suggestions should be sent to the EB members and added to this internal waiting list: mappings - sources to MP data model.

Considering the limited resources available for sources ingestion, a prioritisation of the sources of interest is performed. Based on D7.3 report, 5 criteria are taken into account to evaluate and prioritise the sources:

quality (they should be useful additions),
uniqueness (they should not consist of items mostly already in the Marketplace)
technical interface (they should have metadata that can be relatively easily mapped to the Marketplace data model and should be harvestable via an API),
how much they will enhance the representativity of the various SSH domains within the Marketplace, and how useful they will be to the users of the Marketplace.
Additionally, it will be favourable if a new potential source will bring with it possible contextualization, i.e. relations to other items within itself or the Marketplace.

This prioritisation is performed by the Editorial Board, liaising with the PNSC and ACDH-CH teams in charge of running the DACE pipeline and/or ad-hoc ingestion scripts. For specific cases or difficulties in prioritising, the GB can also be contacted.

1.2 Define mappings

A mapping from the source data model to the Marketplace data model has to be devised in each case. This mapping is prepared in a spreadsheet by one of the Moderators and reviewed by at least another. This mapping represents the prescription for the custom ingestion pipeline. A new tab is opened in this Gsheet - mappings - sources to MP data model - based on the mapping template tab. Creating a mapping requires a good understanding of the MP data model, a few resources in that regard:

D7.2 p.11 - data model v1.5
“Flat” representation of the data model in tabs data model and properties of the mapping spreadsheet.

But also a good understanding of the source data. The moderator preparing the mapping should add as much info as possible in its mapping tab.

1.3 Implement a custom pipeline via DACE or custom script

Based on the conceptual mapping defined in the previous step, a decision is taken by the ingest team to either use the DACE pipeline or to develop/adapt ad-hoc ingest scripts.

DACE

DACE is a custom ETL (extract, transform, load) pipeline, which fetches the data from the source, iteratively processes all the items and maps the structured data in the source format to the MP data model, ingesting the transformed items expressed as JSON-objects via the API of the Marketplace.

Presentation slides of the DACE (Oct. 2024 by Ola Nowak)
DACE repo: https://gitlab.pcss.pl/dl-team/aggregation/dace) and its instance at ACDH-CH https://github.com/SSHOC/data-ingestion/tree/sshoc-dace-to-acdh-ch-k8s)
Esp. list of sources processed: https://gitlab.pcss.pl/dl-team/aggregation/dace/-/wikis/SSHOC-Sources-Configuration)
Specs (json/jolt mappings): https://gitlab.pcss.pl/dl-team/aggregation/dace/-/tree/develop/processors/sshoc-records-processor/src/main/resources/specs) Documentation: https://gitlab.pcss.pl/dl-team/aggregation/dace/-/wikis/Aggregation-in-SSHOC)
UX source manager: https://harvester-manager.acdh-dev.oeaw.ac.at/)
Jolt tutorial https://gitlab.pcss.pl/dl-team/aggregation/dace/-/wikis/JSON-to-JSON-Mapping-tutorial)

2 Data Ingestion and Update Workflow for SSHOC Marketplace

This guideline outlines the process for ingesting new data sources, updating existing records. All DACE ingests are designed as continuous ingests, i.e. are run regularly and are able to modify MP entries.

2.1 Data Source Identification and Ingestion Process

To ensure a streamlined and consistent ingestion process, each new data source must follow these key steps:

2.1.1 Search for Existing Entries

Before creating new entries, always search the Marketplace for existing records. Search across all sources. This ensures no duplicates are created and that we only update records when necessary.
API Endpoint: GET /api/workflows/{persistentId}, GET /api/publications/{persistentId}, or similar endpoints based on the data type.

2.1.2 If No Entry is Found

If no existing record is found for the data source, use the POST method to create a new entry in the system.
API Endpoint: POST /api/workflows, POST /api/publications, or relevant endpoint based on the data type.

2.1.3 If an Entry is Found

If an existing record is found, use the PATCH method to update the existing record. This allows us to keep the Marketplace data up-to-date.
API Endpoint: PATCH /api/workflows/{persistentId}, PATCH /api/publications/{persistentId}, or the appropriate endpoint.

2.1.4 Suggestions for Review

All ingested and updated data must be submitted as suggestions for review by moderators. This ensures that any changes to the data are vetted before becoming publicly available.

2.2. New Source and Keyword List Requirement

For each new source being ingested, a keyword list must be prepared and submitted before the ingest is first run on production. This list will be reviewed by moderators to ensure that only relevant, accurate, and appropriate keywords are created. The following steps must be followed for keyword creation:

2.2.1 Keyword List Creation

For each new source, a list of keywords must be apporoved first by the MP ingest team. This can just be a text file

2.2.2 Review of Keywords

The keyword list will be submitted for moderator review. Only those keywords that are approved during this review will be allowed to be created within the system.
No new keywords will be created until the list is whitelisted by the moderators. This whitelist can be revisited regularly by moderators to check for new keywords.

2.2.3 API Integration for Keywords

Once keywords are approved, they can be created and associated with the data entries.
Use the POST method for creating new keywords once approved, ensuring that they are added to the appropriate data entries.

3. Data Validation and Testing

3.1 Initial Testing

After the mapping and keyword lists are defined, perform an initial test ingest on the STAGE instance of the Marketplace.
Use API Endpoint: GET /api/workflows/{persistentId}, GET /api/publications/{persistentId}, etc., to verify that the correct data is being retrieved.

The test ingest will be reviewed by the Moderators team. They will ensure that the mapping, keywords, and data entries align with the Marketplace standards.
If any issues are found, refinements will be made to the mappings and keywords before re-testing.

3.3 Final Approval

Once the test ingest passes, the data will be reviewed and approved by the moderators.
After approval, use the POST or PATCH methods to create or update the records on the PRODUCTION instance.

4. Continuous Ingest

For sources that are expected to be updated regularly, configure the ingestion process to run continuously:

4.1 Capture source updates

The source needs to be checked for changes at a defined time (3 weeks). Changes need to be captured and written to MP.

4.2 Updates to Existing Entries

When a new or updated record is found, the system will check if it already exists in the Marketplace. If it does, it will use PATCH to update the record.
If no existing record is found, the system will POST a new entry.

4.3 Ingestion Review and Approval

All continuous ingests and updates will be reviewed by the moderators as suggestions before final approval.

Key API Endpoints to Use)

Search for Existing Record:
- GET /api/workflows/{persistentId}
- GET /api/publications/{persistentId}
- GET /api/tools-services/{persistentId}
- GET /api/training-materials/{persistentId}
- GET /api/datasets/{persistentId}
Create New Record:
- POST /api/workflows
- POST /api/publications
- POST /api/datasets
- POST /api/training-materials
- POST /api/tools-services/
Update Existing Record:
- PATCH /api/workflows/{persistentId}
- PATCH /api/publications/{persistentId}
- PATCH /api/datasets/{persistentId}
- PATCH /api/training-materials{persistentId}`
- PATCH /api/tools-services/
Keyword Creation and Management:
- POST /api/vocabularies/{vocabulary-code}/concepts (after approval)
Continuous Ingest and Updates:
- GET /api/workflows/{persistentId}/versions/{versionId}
- GET /api/publications/{persistentId}/versions/{versionId}
- GET /api/training-materials/{persistentId}/versions/{versionId}
- GET /api/tools-services/{persistentId}/versions/{versionId}
- GET /api/datasets/{persistentId}/versions/{versionId}
- PATCH /api/workflows/{persistentId}
- PATCH /api/publications/{persistentId}
- PATCH /api/training-materials/{persistentId}
- PATCH /api/datasets/{persistentId}
- PATCH /api/tools-services/{persistentId}

Custom scripts

The ingestion custom scripts developed by Michael Kurzmeier are following a Python and Jupyter Notebooks approach. They are not primarily meant for continuous ingest., and well suited for the ingestion of a limited number of records (i.e. ~300 records for the sources processed so far). The scripts perform three main operations: source harvesting, transformation, and writing back to the SSH Open MP. Although the transformation and writing back to the MP have been developed to be as standardised (and reusable) as possible, the heterogeneity of potential new MP sources requires that a complete new approach for the harvesting step is taken for each source.