BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//talks.staging.osgeo.org//foss4g-2022//speaker//3VB9ZB
BEGIN:VTIMEZONE
TZID:CET
BEGIN:STANDARD
DTSTART:20001029T040000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:CET
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:CEST
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-foss4g-2022-DABTGG@talks.staging.osgeo.org
DTSTART;TZID=CET:20220826T100000
DTEND;TZID=CET:20220826T103000
DESCRIPTION:Geospatial datacubes--large\, complex\, interrelated multidimen
 sional arrays with rich metadata--arise in analysis-ready geopspatial imag
 ery\, level 3/4 satellite products\, and especially in ocean / weather / c
 limate simulations and [re]analyses\, where they can reach Petabytes in si
 ze. The scientific python community has developed a powerful stack for fle
 xible\, high-performance analytics of databcubes in the cloud. Xarray prov
 ides a core data model and API for analysis of such multidimensional array
  data. Combined with Zarr or TileDB for efficient storage in object stores
  (e.g. S3) and Dask for scaling out compute\, these tools allow organizati
 ons to deploy analytics and machine learning solutions for both explorator
 y research and production in any cloud platform. Within the geosciences\, 
 the Pangeo open science community has advanced this architecture as the 
 “Pangeo platform” (http://pangeo.io/).\n\nHowever\, there is a major b
 arrier preventing the community from easily transitioning to this cloud-na
 tive way of working: the difficulty of bringing existing data into the clo
 ud in analysis-ready\, cloud-optimized (ARCO) format. Typical workflows fo
 r moving data to the cloud currently consist of either bulk transfers of f
 iles into object storage (with a major performance penalty on subsequent a
 nalytics) or bespoke\, case-by-case conversions to cloud optimized formats
  such as TileDB or Zarr. The high cost of this toil is preventing the scie
 ntific community from realizing the full benefits of cloud computing. More
  generally\, the outputs of the toil of preparing scientific data for effi
 cient analysis are rarely shared in an open\, collaborative way.\n\nTo add
 ress these challenges\, we are building Pangeo Forge ( https://pangeo-forg
 e.org/)\, the first open-source cloud-native ETL (extract / transform / lo
 ad) platform focused on multidimensional scientific data. Pangeo Forge con
 sists of two main elements. An open-source python package--pangeo_forge_re
 cipes--makes it simple for users to define “recipes” for extracting ma
 ny individual files\, combining them along arbitrary dimensions\, and depo
 siting ARCO datasets into object storage. These recipes can be “compiled
 ” to run on many different distributed execution engines\, including Das
 k\, Prefect\, and Apache Beam. The second element of Pangeo Forge is an or
 chestration backend which integrates tightly with GitHub as a continuous-i
 ntegration-style service.\n\nWe are using Pangeo Forge to populate a multi
 -petabyte-scale shared library of open-access\, analysis-ready\, cloud-opt
 imized ocean\, weather\, and climate data spread across a global federatio
 n of public cloud storage–not a “data lake” but a “data ocean”. 
 Inspired directly by the success of Conda Forge\, we aim to leverage the e
 nthusiasm of the open science community to turn data preparation and clean
 ing from a private chore into a shared\, collaborative activity. By only c
 reating ARCO datasets via version-controlled recipe feedstocks (GitHub rep
 os)\, we also maintain perfect provenance tracking for all data in the lib
 rary.\n\nYou will leave this talk with a clear understanding of how to acc
 ess this data library\, craft your own Pangeo Forge recipe\, and become a 
 contributor to our growing collection of community-sourced recipes.
DTSTAMP:20260404T005622Z
LOCATION:Room 9
SUMMARY:Pangeo Forge: Crowdsourcing Open Data in the Cloud - Ryan Abernathe
 y\, Charles Stern
URL:https://talks.staging.osgeo.org/foss4g-2022/talk/DABTGG/
END:VEVENT
END:VCALENDAR
