Paul van Genuchten
DevOps engineer at ISRIC - World Soil Information. We maintain a range of datasets and catalogues related to global soil property distribution (chemical, physical and biological)
Sessions
Metadata, YAML files and pipelines? When I try to convince my colleagues that the approach mentioned in this presentation is fun, they look at me alienated.
This presentation will highlight the usage of pygeometa, mdme and DevOps workflow in two projects from different domains of interest.
Land-Soil-Crop data
ISRIC is endorsing the pygeometa MCF format, a YAML-based representation originally developed as a subset of ISO 19115 metadata, advertised by the pygeometa community as 'Metadata Creation for the Rest of Us'. YAML reads much better then XML, and is optimal for content versioning in Git. But YAML comes with its peculiarities, such as strict indenting and reserved characters.
'Average users should not look at code, instead use shiny (web) interfaces' is a quote often used, but we're not used to reverse the quote: "As a DevOps engineer I hate shiny interfaces. I want to look at code, see the history of that code, who changed what, when, and how can I fix it".
This is where the fun part of pygeometa MCF comes in. CI/CD pipelines which run on content changes validate the YAML format and report errors to the submitters.
Should we then fully neglect the basic user? Of course not! So we crafted web based forms that generate mcf (osgeo.github.io/mdme) and have import options for Excel sheets (every column is a metadata field). Consider that many data scientists (fortunately) are used to placing a README.md in any project folder. We just ask them to structure the content using YAML. We added an inheritance mechanism, so common properties (contact details, usage constraints) are inserted only once and inherited by lower levels in the folder hierarchy. And embedded metadata is extracted from data files (bounds, projection, format) or online sources.
All this metadata is crawled to a central search index (pycsw/pygeoapi/geonetwork). To increase the participatory experience we added 'Edit me on GIT' links to each of the records, which brings users back to the original mcf file to suggest changes.
Weather/climate/water metadata
The WMO Information System (WIS2) is the next generation data exchange infrastructure for real-time and archive weather/climate/water data. Discovery metadata is a key component for cataloguing and discovery. An event driven architecture, metadata files are managed on GitHub, which on change, trigger CI/CD workflow to generate compliant WMO discovery metadata, validation and publish to an MQTT broker.
At ISRIC - World Soil Information we increasingly maintain our data services through CI-CD pipelines configured via GIT. Both from the service as well as content perspective. The starting point are metadata records of our datasets being stored on GIT. With every change of a record, the relevant catalogues (pycsw) get updated and any relevant web services (mapserver) are updated.
These pipelines are reproducable and there are never inconsistencies between catalogue content and the services. On top of that our users can directly report issues (or even improvement suggestions) through git.
The stack is build on proven OSGeo components. A tool pyGeoDataCrawler brings the power of GDAL and pygeometa to CI-CD scripting. It crawls files on a folder and extracts relevant metadata, then prepares a mapserver configuration for that folder, while updating the metadata with the relevant service url's.
Typical use cases for this stack are; a search interface to any file based data repository or a participatory data catalogue for a project. At the conference we hope to hear from you if any of these components could be relevant to your cases or if there are similar initiatives we can contribute to or benefit from.
What's next? At ISRIC we receive and ingest a lot of soil data from partners. To harmonize this data is a huge effort. Via automated pipelines and interaction with the submitters via git comments, we hope to improve also this aspect of the data management cycle.