See https://marketplace.sshopencloud.eu/about/implementation#data-ingestion-workflow for the generic workflow. SSHOC D7.3 (p.8) and D7.2 (p. 25 and following) are also providing details about the process. The present documentation describes the ingestion workflow steps and their related documents.
Anyone can suggest a new source to harvest. Suggestions should be sent to the EB members and added to this internal waiting list: mappings - sources to MP data model.
Considering the limited resources available for sources ingestion, a prioritisation of the sources of interest is performed. Based on D7.3 report, 5 criteria are taken into account to evaluate and prioritise the sources:
This prioritisation is performed by the Editorial Board, liaising with the PNSC and ACDH-CH teams in charge of running the DACE pipeline and/or ad-hoc ingestion scripts. For specific cases or difficulties in prioritising, the GB can also be contacted.
A mapping from the source data model to the Marketplace data model has to be devised in each case. This mapping is prepared in a spreadsheet by one of the Moderators and reviewed by at least another. This mapping represents the prescription for the custom ingestion pipeline. A new tab is opened in this Gsheet - mappings - sources to MP data model - based on the mapping template tab. Creating a mapping requires a good understanding of the MP data model, a few resources in that regard:
data model
and properties
of the mapping spreadsheet.But also a good understanding of the source data. The moderator preparing the mapping should add as much info as possible in its mapping tab.
Based on the conceptual mapping defined in the previous step, a decision is taken by the ingest team to either use the DACE pipeline or to develop/adapt ad-hoc ingest scripts.
DACE is a custom ETL (extract, transform, load) pipeline, which fetches the data from the source, iteratively processes all the items and maps the structured data in the source format to the MP data model, ingesting the transformed items expressed as JSON-objects via the API of the Marketplace.
This guideline outlines the process for ingesting new data sources, updating existing records. All DACE ingests are designed as continuous ingests, i.e. are run regularly and are able to modify MP entries.
To ensure a streamlined and consistent ingestion process, each new data source must follow these key steps:
GET /api/workflows/{persistentId}
, GET /api/publications/{persistentId}
, or similar endpoints based on the data type.POST /api/workflows
, POST /api/publications
, or relevant endpoint based on the data type.PATCH /api/workflows/{persistentId}
, PATCH /api/publications/{persistentId}
, or the appropriate endpoint.For each new source being ingested, a keyword list must be prepared and submitted before the ingest is first run on production. This list will be reviewed by moderators to ensure that only relevant, accurate, and appropriate keywords are created. The following steps must be followed for keyword creation:
GET /api/workflows/{persistentId}
, GET /api/publications/{persistentId}
, etc., to verify that the correct data is being retrieved.For sources that are expected to be updated regularly, configure the ingestion process to run continuously:
GET /api/workflows/{persistentId}
GET /api/publications/{persistentId}
GET /api/tools-services/{persistentId}
GET /api/training-materials/{persistentId}
GET /api/datasets/{persistentId}
POST /api/workflows
POST /api/publications
POST /api/datasets
POST /api/training-materials
POST /api/tools-services/
PATCH /api/workflows/{persistentId}
PATCH /api/publications/{persistentId}
PATCH /api/datasets/{persistentId}
PATCH /api/training-materials
{persistentId}`PATCH /api/tools-services/
POST /api/vocabularies/{vocabulary-code}/concepts
(after approval)GET /api/workflows/{persistentId}/versions/{versionId}
GET /api/publications/{persistentId}/versions/{versionId}
GET /api/training-materials/{persistentId}/versions/{versionId}
GET /api/tools-services/{persistentId}/versions/{versionId}
GET /api/datasets/{persistentId}/versions/{versionId}
PATCH /api/workflows/{persistentId}
PATCH /api/publications/{persistentId}
PATCH /api/training-materials/{persistentId}
PATCH /api/datasets/{persistentId}
PATCH /api/tools-services/{persistentId}
The ingestion custom scripts developed by Michael Kurzmeier are following a Python and Jupyter Notebooks approach. They are not primarily meant for continuous ingest., and well suited for the ingestion of a limited number of records (i.e. ~300 records for the sources processed so far). The scripts perform three main operations: source harvesting, transformation, and writing back to the SSH Open MP. Although the transformation and writing back to the MP have been developed to be as standardised (and reusable) as possible, the heterogeneity of potential new MP sources requires that a complete new approach for the harvesting step is taken for each source.