BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//talks.staging.osgeo.org//foss4g-2024//talk//MWYXS7
BEGIN:VTIMEZONE
TZID:-03
BEGIN:STANDARD
DTSTART:20000101T000000
RRULE:FREQ=YEARLY;BYMONTH=1
TZNAME:-03
TZOFFSETFROM:-0300
TZOFFSETTO:-0300
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-foss4g-2024-MWYXS7@talks.staging.osgeo.org
DTSTART;TZID=-03:20241205T160000
DTEND;TZID=-03:20241205T163000
DESCRIPTION:The IS_Agro project is an initiative focused on the critical ev
 aluation and subsequent adaptation of methodologies designed in global for
 ums\, with a view to their application in the national context based on th
 e development of new agro-socio-environmental metrics and indicators (IASs
 ) that aim to provide a more accurate and authentic representation of the 
 agricultural landscape in the national territory. IASs are measures used t
 o monitor and evaluate agricultural performance related to social\, econom
 ic and environmental aspects\, thus having great importance in guiding mor
 e sustainable political strategies and agricultural practices\, whether by
  the public or private entity\, serving “to evaluate the performance of 
 agriculture in terms of its environmental\, social and economic performanc
 e\, providing comparative data and information between federative entities
  or countries\, among several other applications” (EMBRAPA SOLOS\, 2023)
 . In this project\, IASs are developed by different teams specialized in t
 he proposed themes\, whose works are previously approved and published in 
 the scientific arena. To automate data collection\, allocation\, calculati
 ons and constant updates of the IASs\, there is a team called the Digital 
 Module\, which develops solutions for each indicator\, transforming them i
 nto digital algorithms. Structured\, semi-structured and unstructured regi
 stration data are collected and stored in a data lakehouse\, requiring a g
 reat deal of organization within the repository so that the data is always
  available and easily accessible. It was decided to implement the medallio
 n architecture (medal architecture)\, which consists of allocating data in
  three layers with different purposes\, while an open source platform was 
 used for pipeline management and automation.\n\nThe conception of this pro
 ject as a digital platform linked to the Brazilian Agricultural Observator
 y aims to publish indicators and parameters derived from well-founded tech
 nical and scientific data\, capable of evaluating the effective performanc
 e of the national agricultural sector at the municipal or state level\, co
 ntributing to sectoral policies and planning and management processes aime
 d at building sustainable agriculture and the correct positioning of the c
 ountry on the international scene. Thus\, the general objective is to deve
 lop an intelligent environment that automates and manages the IAS pipeline
 s in a data storage organization environment based on the medallion archit
 ecture to be the basis of the data panel for publishing the indicators.\n\
 nA data pipeline is a succession of connected phases that enable the colle
 ction\, storage\, modification\, analysis\, and representation of data\, w
 ith the purpose of acquiring meaningful insights and supporting informed c
 hoices (CALANCA\, 2023). A data lakehouse\, the destination of the project
  pipelines\, is “like a modern data platform built from a combination of
  a data lake and a data warehouse” (ORACLE CLOUD INFRASTRUCTURE\, 2023)\
 , using “the flexible storage of unstructured data from a data lake and 
 the management capabilities and tools of data warehouses\, and then strate
 gically deploying them together as a larger system” (ORACLE CLOUD INFRAS
 TRUCTURE\, 2023). The medallion architecture is the sequential structuring
  of data storage that aims to logically organize the data in the lakehouse
 \, aiming to incrementally and progressively improve the structure and qua
 lity of the data as it flows through the three layers of the architecture 
 (ARQUITETURA medallion\, 2024). The terms bronze (raw data from the source
 )\, silver (transformation and validation of the data)\, and gold (refined
  and enriched data for use in projects) describe the quality of the data d
 uring the process (SKAYA et al\, 2024) . Pipeline management is performed 
 by Apache Airflow (version 2.44)\, an open-source platform for developing\
 , scheduling\, and monitoring batch-oriented workflows based on the Python
  programming language\, which allows you to create workflows connected to 
 virtually any technology (WHAT is Airflow™?\, 2023). The Airflow executi
 on environment was structured in Docker\, an open-source platform that all
 ows you to create and manage containers as modular virtual machines that c
 ontain the essentials for their execution. The developed image is availabl
 e on GitHub.\nTo be confirmed\, the routines will be executed once a month
 . Raw data is collected by downloading and maintaining its original format
 \, with a hash of each file being recorded to indicate that the data has b
 een updated and download it again in the event of a change. This data is c
 leaned and processed as needed. At the end of the silver phase\, a tabular
  structure will be created with geocode (integer\, IBGE code of municipali
 ties or states)\, date (timestamp\, ISO 8601)\, source (text) and value (f
 loating point\, real number) and will be saved in the data lakehouse as .p
 arquet\, an open-source columnar storage format designed for highly compre
 ssed storage and efficient data retrieval\, providing improved performance
  for handling complex mass data (OVERVIEW\, 2022). The .parquet files save
 d in the data lake are available for use in the gold tier with one-to-many
  cardinality. In this last phase of the architecture\, the necessary calcu
 lations are performed for each source of the indicators\, with some source
 s that do not require calculations. The final phase is with the export of 
 the gold data to tables in a project database in PostgreSQL\, being ready 
 for use by an API developed internally that allows the provision of data f
 or the data panel to be developed (by another team) and published to socie
 ty from the project website.\n\nThis model has been adjusted and corrected
  throughout the development of the project in the Digital Module. Flexible
 \, it is now considered ready to receive any indicator developed by other 
 teams\, as well as the development of the data panel for publication for u
 se by society.
DTSTAMP:20260516T153756Z
LOCATION:Room I
SUMMARY:The Digital Module of the IS_Agro Project: Using the medallion arch
 itecture as a basis for automating pipeline execution routines in Apache A
 irflow - Carlos Eduardo Mota
URL:https://talks.staging.osgeo.org/foss4g-2024/talk/MWYXS7/
END:VEVENT
END:VCALENDAR
