2025-11-04 –, Lake Audubon
The h3-indexer is an open-source cloud-native Python package that converts geospatial geometries (points, lines, polygons) into standardized H3 hexagonal grids using PySpark and Apache Sedona, enabling scalable spatial analysis and data harmonization across massive, disparate geospatial datasets.
With the rise of new cloud-native geospatial tools, an increase in data formats and even more ways to store your data, the h3-indexer addresses the growing challenge of efficiently converting diverse spatial data formats into a unified, analysis-ready representation at scale. Originally developed for Amazon's Air Quality initiatives to calculate emissions globally at highly granular levels, this open-source tool fills a significant gap in the cloud-native geospatial technology landscape.
The geospatial community faces a persistent challenge in standardizing spatial data processing workflows, particularly in cloud environments. Traditional GIS tools often lack the scalability needed for modern big data applications, while cloud-native solutions frequently require custom compute environments and complex builds that are too complicated for most use cases. Organizations working with diverse geospatial datasets spanning applications from environmental monitoring to urban planning, struggle to harmonize data from multiple sources into a consistent, comparable structure in order to calculate key metrics.
The h3-indexer leverages Uber's H3 global grid system to transform any number of point, line, and polygon geometries into standardized hexagonal grids. Built on PySpark and Apache Sedona, the tool is designed for cloud-native deployment and can process massive geospatial datasets in distributed computing environments. Through JSON-based configuration, users can version, share, and reproduce spatial processing workflows across different environments and datasets.
The H3-Indexer's modular architecture makes it highly extensible for community contributions and customization. The tool is built around a flexible data model system using pydantic that defines clear interfaces for different input types (VectorTable, RasterFile), geometry types (POINT, LINE, POLYGON) and processing methods (WITHIN, PCT_LENGTH, PCT_AREA). This design allows developers to easily add new geometry processing methods by simply extending the existing enums and implementing corresponding router functions, or introduce entirely new data source types by creating new base model classes that follow the established patterns. The configuration-driven approach means that new attribution methods, data connectors, or processing algorithms can be integrated without modifying the core indexing and resolution logic. This modularity extends to data sources as well. Current support for S3, Glue Catalog, and various file formats demonstrates how new input connectors can be integrated into the existing validation and processing framework, making the tool adaptable to diverse organizational needs and emerging geospatial data standards.