BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//talks.staging.osgeo.org//foss4g-2022//speaker//S9CLC8
BEGIN:VTIMEZONE
TZID:CET
BEGIN:STANDARD
DTSTART:20001029T040000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:CET
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:CEST
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-foss4g-2022-WY8NQP@talks.staging.osgeo.org
DTSTART;TZID=CET:20220825T175000
DTEND;TZID=CET:20220825T175500
DESCRIPTION:The Data Operations Systems and Analytics team at NYC DOT’s p
 rimary mission is to support the data analysis and data product needs rela
 ting to transportation safety for the Agency. The team’s work producing 
 safety analysis for projects and programs typically involves merging data 
 from a variety of sources with collision data\, asset data\, and/or progra
 m data. The bulk of the analysis is performed in PostgreSQL databases all 
 with a geospatial component. The work necessitates ingesting input data fr
 om other databases\, csv/excel files\, and various geospatial data formats
 . It is critical that the analysis be documented and repeatable. \n\nMovin
 g data around\, getting external data into the database\, transforming it\
 , geocoding it etc.\, previously occupied the bulk of the team’s time be
 fore\, reducing capacity for the actual analysis. Additionally the volume 
 of one-off and exploratory analyses resulted in a cluttered database envir
 onment with multiple versions of datasets with unclear lineage and state o
 f completeness. \nModeled on the infrastructure as code idea\, we began bu
 ilding a python library that would allow us to preserve the entire analysi
 s workflow from data ingestion to analysis and to output generation in a s
 ingle python file or Jupyter notebook.  The library began as a way to redu
 ce the friction and standardize the process of ingesting external data int
 o the various database environments utilized. It has since grown into the 
 primary method to facilitate reproducible data analysis processes that inc
 ludes the data ingestion\, transformation\, analysis\, and output generati
 on.\n\nThe library includes basic database connections\, and facilitates q
 uick and easy import and export from flat files\, geospatial data files\, 
 and other databases. It provides both inferred and defined schemas\, to al
 low both quick exploration and more thoroughly defined data pipeline proce
 sses.  The library includes standardization of column naming\, comments\, 
 and permissions. There are built in database cleaning processes\, geocodin
 g processes\, and we have started building simple geospatial data display 
 functions for exploratory analysis. The code is heavily reliant on numpy\,
  pandas\, GDAL/ogr2ogr\, pyodbc\, psycopg2\, shapely\, and basic sql and p
 ython. The library is not an ORM\, but occupies a similar role\, but geare
 d towards analytic workflows.\n\nThe talk will discuss how the library has
  evolved over time\, the functionality and use cases in the team’s daily
  workflows as well as where we would like to extend the functionality and 
 open it up for contributions.  While the library is not currently open sou
 rce\, we are actively working on creating an open version and migrating to
  Python 3.x. This library has greatly improved the speed and simplicity of
  conducting exploratory analysis and enhanced the quality and completeness
  of the documentation of our more substantial data analytics and research.
 \nThe library should be of interest and utility for anyone working with da
 ta without the support of a dedicated data engineering team to facilitate 
 the collection of multiple datasets from a variety of formats\, as well as
  anyone looking to standardize their data analysis workflows from beginnin
 g to end.
DTSTAMP:20260404T032756Z
LOCATION:Room 4
SUMMARY:Building a data analytics library in Python - seth hosteter
URL:https://talks.staging.osgeo.org/foss4g-2022/talk/WY8NQP/
END:VEVENT
END:VCALENDAR
