BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//talks.staging.osgeo.org//foss4g-europe-2025//talk//HNZK3
 7
BEGIN:VTIMEZONE
TZID:CET
BEGIN:STANDARD
DTSTART:20001029T040000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:CET
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:CEST
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-foss4g-europe-2025-HNZK37@talks.staging.osgeo.org
DTSTART;TZID=CET:20250717T120000
DTEND;TZID=CET:20250717T123000
DESCRIPTION:The Open Container Initiative (OCI) 1.1 specification has expan
 ded container registries beyond traditional software images\, enabling the
 m to store and distribute a wide variety of digital artifacts\, from softw
 are build artifacts to machine learning (ML) models and arbitrarily large 
 data blobs. As the volume of Earth Observation (EO) data generated by sate
 llites and remote sensing applications continues to increase\, scalable an
 d efficient distribution methods are becoming essential. OCI registries ar
 e well-suited for end-to-end supply chains due to their built-in capabilit
 ies for integrity verification and attestations\, such as quality assuranc
 e\, allowing for the application of common tooling and best practices acro
 ss various steps in the supply chain. Their layered design allows for sele
 ctive retrieval of specific parts\, and optimizations like compression and
  deduplication can be applied individually\, making them ideal for managin
 g EO data of arbitrary size.\n## Challenges in Optimizing OCI Registries f
 or EO Data\nHowever\, despite these advantages\, significant challenges re
 main in optimizing both client-side parallelization and addressing server-
 side limitations of existing OCI registries. A critical research question 
 arises: How should EO data be structured within OCI registries to maximize
  performance? While OCI registries support multiple storage layers and opt
 imizations\, the practical implications of storing EO data in this format 
 have not been thoroughly explored. Key concerns include whether OCI regist
 ries can effectively support arbitrarily sized EO data and how different s
 torage layouts affect retrieval speed and storage efficiency.\n## Investig
 ating Best Practices for EO Data Storage in OCI Registries\nThis research 
 paper investigates how to structure EO data within OCI registries to optim
 ize performance. By examining various physical data layouts—such as chun
 king data into blocks or organizing data into multiple layers—the goal i
 s to identify best practices for storing and accessing large EO datasets. 
 Benchmarking common OCI client tooling against a variety of OCI-compliant 
 registries\, including public offerings like DockerHub and Quay.io\, manag
 ed services like AWS ECR\, and bespoke cloud-based implementations\, will 
 help evaluate retrieval latency\, throughput\, and parallelization techniq
 ues to enhance the efficiency of EO data distribution at scale. The resear
 ch paper will also examine the impact of different compression\, deduplica
 tion\, and data layout strategies on storage efficiency and retrieval perf
 ormance.\n## Advantages of OCI Registries for EO Data Storage and Distribu
 tion\nAn OCI image\, as specified through the Linux Foundation's Open Cont
 ainer Interface\, is actually a collection of multiple components. At the 
 top level is an index of all the other included components. It references\
 , in a JSON format\, all the other layers with their digest or the cryptog
 raphic hash of the content itself. The OCI distribution spec describes how
  clients pull images from a registry\, which is done layer by layer.\nOCI 
 registries offer several inherent benefits that make them attractive for E
 O data storage and distribution. These include:\n- Layered Storage Model: 
 OCI artifacts utilize a layered approach\, allowing incremental and block-
 wise storage and retrieval\, enabling efficient updates and minimizing red
 undant data transfers.\n- Efficient Distribution: Content-addressable stor
 age allows fetching only changed layers\, which minimizes bandwidth and st
 orage costs and supports incremental updates.\n- Versioning and Tagging: V
 ersion control is inherent in OCI registries\, enabling precise tracking o
 f updates. This is crucial as data moves through various stages of process
 ing\, validation\, and final distribution.\n- Attestation and Integrity: D
 ata integrity is ensured using cryptographic hashes\, verifying the authen
 ticity and trustworthiness of the supply chain\, from raw input to final p
 roducts.\n## Addressing Practical Limitations in OCI Registries\nDespite t
 hese benefits\, the practical limitations of OCI registries for handling l
 arge-scale EO data are not fully understood. Specifically\, the impact of 
 physical data layout on retrieval speed and storage efficiency requires fu
 rther investigation. This research paper will explore several strategies f
 or structuring EO data within OCI registries:\n- Chunking and Layering Str
 ategies: Investigating whether data should be stored in large monolithic l
 ayers or smaller\, granular chunks\, and evaluating the effects of compres
 sion and deduplication on retrieval performance.\n- Client-Side Paralleliz
 ation: Analyzing the impact of parallelized downloads on pull speeds and c
 omparing performance improvements with different concurrent retrieval conf
 igurations.\n- Server-Side Constraints: Assessing registry performance lim
 its\, including bandwidth throttling and API rate limits\, and comparing d
 ifferent OCI registry offerings and implementations.\n## Benchmarking and 
 Evaluation Metrics\nThe research paper will employ a benchmark-based appro
 ach to evaluate different storage layouts and retrieval optimizations. Key
  metrics for evaluation include:\n- Latency: Measuring the time required t
 o pull (and extract) EO datasets from OCI registries.\n- Throughput: Asses
 sing how registry performance scales with concurrent downloads.\n- Storage
  Overhead: Analyzing the efficiency of deduplication and compression techn
 iques.\n\nTest datasets will include EO imagery and EO time-series data st
 ored in cloud-native formats like COGs and Zarrs\, which inherently suppor
 t chunked data structures (compressed and uncompressed). By comparing diff
 erent layouts and access patterns\, insights will be derived into the most
  effective way to structure EO data within OCI registries.\n## Research Qu
 estions and Expected Contributions\nThis research paper seeks to establish
  best practices for storing EO data in OCI registries by answering the fol
 lowing questions:\n- What are the practical limitations of OCI registries 
 for handling arbitrarily large EO datasets?\n- How should EO data be physi
 cally structured within OCI to optimize performance?\n- What are the trade
 -offs between different storage layouts in terms of retrieval speed\, stor
 age efficiency\, and scalability?\n\nBy systematically evaluating these as
 pects\, this research paper will contribute to the broader adoption of OCI
  registries for EO data management\, ensuring efficient\, scalable\, and i
 nteroperable distribution. The findings will also guide future optimizatio
 ns in registry implementations to better support large-scale geospatial da
 tasets.\n## Goal\nOCI registries offer a promising avenue for distributing
  EO data at scale. However\, the performance implications of storing large
  datasets in this format remain underexplored. This research paper will be
 nchmark various OCI registry implementations\, investigating the impact of
  data structuring\, parallelization\, and registry limitations. By identif
 ying best practices for EO data storage in OCI registries\, the efficiency
  of geospatial data distribution can be enhanced while leveraging the robu
 st ecosystem of container registries already in place.
DTSTAMP:20260527T062058Z
LOCATION:PA01 (Quarticle)
SUMMARY:Towards Standardization of the EO Data Product Supply Chain – Are
  OCI Artifacts the Key to Ubiquitous and Scalable EO Data Handling? - Stef
 an Achtsnit
URL:https://talks.staging.osgeo.org/foss4g-europe-2025/talk/HNZK37/
END:VEVENT
END:VCALENDAR
