FOSS4G 2022 academic track

Andrea Folini

My name is Andrea Folini and I am a recent graduate at Politecnico di Milano with a Master's degree in Computer Science and Engineering. Currently I am doing an internship at Department of Civil and Environmental Engineering (Politecnico di Milano) on the application of the blockchain to geospatial data.


Sessions

08-24
15:15
30min
Cluster Analysis: a comprehensive and versatile QGIS plugin for pattern recognition in geospatial data
Andrea Folini

As geospatial data continuously grows in complexity and size, the application of Machine Learning and Data Mining techniques to geospatial analysis is increasingly more essential to solve real-world problems. Although, in the last two decades, the research in this field produced innovative methodologies, they are usually applied to specific situations and not automatized for general use. Therefore, both generalization and integration of these methods with Geographic Information Systems (GIS) are necessary to support researchers and organizations in data exploration, pattern recognition, and prediction in the various applications of geospatial data. The lack of machine learning tools in GIS is especially clear for what concerns unsupervised learning and clustering. The most used clustering plugins in QGIS [1] contain few functionalities beyond the basic application of a clustering algorithm.

In this work we present Cluster Analysis, a Python plugin that we developed for the open-source software QGIS and offers functionalities for the entire clustering process: from (i) pre-processing, to (ii) feature selection and clustering, and finally (iii) cluster evaluation. Our tool provides different improvements from the current solutions available in QGIS, but also in other widespread GIS software. The expanded features provided by the plugin allow the users to deal with some of the most challenging problems of geospatial data, such as high dimensional space, poor quality of data, and large size of data.

In particular, the plugin is composed of three main sections:

  • feature cleaning: This part aims to provide some options to reduce the dimensionality of the dataset by removing the attributes that are most likely bad for the clustering process. This is important to achieve better results and faster execution time, avoiding the problems of clustering in high dimensionality. The first filter removes the features that are correlated above a user-defined threshold, since highly correlated features usually provide redundant information and can lead to overweight of some characteristics. The other two filters identify the attributes with constant values for all the data points or with few outliers differentiating from them. These types of features don’t provide any valuable information and can worsen the performance of clustering. To identify quasi-constant features, we use two different parameters introduced in the function NearZeroVar() from the Caret package developed for R [2]: the ratio between the two most frequent values and the number of unique values relative to the number of samples.

  • clustering: This section is used to perform clustering on the chosen vector layer. First of all, the user needs to select the features to use in the process. It is possible to select the features both manually and automatically. The automatic feature selection is done using an entropy-based algorithm [3] presented in two versions with different computational complexities. The currently available algorithms for clustering are K-Means and Agglomerative Hierarchical, and the users can select the one that best suits their needs. Before performing clustering, the plugin offers the possibility to scale the datasets with standardization or normalization, and to plot two different graphs to facilitate the choice of the number of clusters.

  • evaluation: In this section we show all the experiments carried out in the current session, with a recap of the settings and performances of the experiments and the possibility to save and load them with text files. To evaluate the quality of the experiments we calculate two indexes and the comparisons among experiments on the same dataset. The indexes are the internal metrics Silhouette coefficient and Davies-Bouldin index. To directly compare the clusters formed by two or more experiments we compute the score [4], which evaluates how many couples of data points are grouped together in all of the experiments or in none of them. Every experiment completed in the current session can be stored in a text file, and the experiments saved in previous sessions can be loaded in the plugin and are shown in the evaluation section along with the other ones.

One of the major challenges during development has been allowing most of the functionalities on large datasets as well, both from the point of view of the number of samples and the number of dimensions. To achieve this, we also implemented algorithm options with good time complexities, as in the case of entropy with sampling and K-Means. Moreover, for all the data storage and manipulation done in the system, we use the data structures and functions provided by the libraries pandas and NumPy to guarantee high performance.

Another important objective of the research is the accessibility and ease of use of the plugin since the general user of GIS is often lacking a machine learning and computer science background. To guarantee this, the User Interface is simple and self-explanatory, and each section contains a brief guide to explain all the functionalities. Furthermore, some algorithm parameters that cannot be modified via the interface are stored in an external configuration file, and can be modified via this. This is done to avoid confusing the less experienced users.

Along with the implementation, the research is integrated with a considerable experimental phase, both during and after the development phase. This phase is essential to highlight both the potential of the plugin and its limitations in real-world scenarios. The great volume of experiments is conducted on data about the city of Milan, describing social-demographics, urban and climatic characteristics and with different granularities (ranging from less than 100 data points to almost 70000, and with a large number of numerical attributes, up to 109). Overall, the experimental phase shows good and adequate flexibility of the plugin, and outlines the possibilities for future developments that can be provided also by the QGIS community, given the open-source nature of the project.

The stable version of the plugin is available on the QGIS Python Plugins Repository (https://plugins.qgis.org/plugins/Cluster-Analysis-plugin-main/) while the development version as well as documentation are available on GitHub (https://github.com/folini96/Cluster-Analysis-plugin).

Room Modulo 3
08-24
16:15
30min
Collaborative validation of user-contributed data using a geospatial blockchain approach: the SIMILE case study
Andrea Folini, Jesus Rodrigo Cedeno Jimenez

Decentralized applications are a fundamental element for internet development, not only because they are safer but also because they make data accessible to more people than centralized applications. One of the most important architectures of decentralized applications is blockchain, a computing infrastructure capable of sharing data obeying consensus and in an immutable way. The most popular blockchain applications belong to the financial sector, and developments are still missing in other areas that can take advantage of
this technology. An area that can benefit from blockchain characteristics is citizen science, which, as its name specifies, is the research activity performed by a community of citizens. Due to the requirements to this extent, this work studies the feasibility to use a blockchain architecture in citizen science, specifically for ecosystem monitoring. Additional to this, this work helped to understand the advantages and disadvantages of using this technology in this area.

Current state-of-the-art applications that propose partially a solution to citizen science are FOAM and CryptoSpatial Coordinates. FOAM [1] is a geospatial web application that builds a consensus-driven globe map using the blockchain Ethereum protocol. To achieve network verification, it employs a cryptographic software utility token, where cartographers verify if points added to the network are false or correct. This removes the need for a central authority to regulate and verify the points. The voting mechanism uses FOAM tokens to avoid spamming from the participants. The system works by mapping a blockchain address to a physical location, which can be registered with a spatial resolution of 1m by 1m. CryptoSpatial Coordinates (CSC) [2] is an Ethereum smart-contract library that can be used for developing geospatially enabled decentralized apps. It uses Blockchain technology to store, retrieve, and process vector geographic data.

In our approach, we were only “inspired” by the previous solutions, but we decided to develop something new and original. The system is developed in Solidity programming language. This allows usage on every blockchain that supports the Ethereum Virtual Machine and guarantees extended flexibility. Moreover, this choice is justified by the expanded ecosystem that Ethereum offers. The architecture of Smart Contracts is completely open-source and developed with a focus on the reusability of the components for other applications in the same field. The two main parts of the architecture are the Cell Smart Contracts and the Registry Smart Contracts. This is based on the mapping of a Discrete Global Grid System (DGGS) [3] with Smart Contracts. As a DGGS we choose S2 [4], which is an open-source library developed by Google that offers good processing functionalities and a grid with a fine-grained resolution. Each Smart Contract representing a Cell is used to keep track of the hash of the observations collected in the application. The hashes are used to locate and retrieve the stored files in the decentralized storage InterPlanetary File System (IPFS). This structure also allows to store metadata about the observations, for example, their quality decided through a peer voting mechanism or with some other system. The Registry Contracts are linked to a resolution of the DGGS and have the duty to keep track of the mapping between the DGGS cells of that resolution and their respective Smart Contract.

The prototype platform is developed in Velas, a blockchain architecture with a strong focus on fast transaction speed and low costs of fees compared to other blockchains (e.g. Ethereum, Cardano, Solana). The use-case for this work was the Informative System for the Integrated Monitoring of Insubric Lakes and their Ecosystems (SIMILE) project. SIMILE is a cross-border Italian-Swiss project with the aim to improve the collaboration between public administrations and stakeholders for the management of the Insubric lakes (Lugano, Como and Maggiore) and their ecosystems, as well as monitoring water resources quality [5]. One of the main sources of data in SIMILE is collected with a Citizen Science approach, meaning that the data is collected from normal citizens through their smartphones. The observations of this type include data about water quality, climatic parameters, and multimedia files such as images can be included. The collected data can be currently validated by the public authorities managing the platform but this requires time which is not always available to technicians. In our system, the observations are instead validated through a mixed rating system that allows both users and admins to evaluate each entry. Furthermore, the use of the proposed blockchain architecture allows access to the collected data without relying on the currently existing Web Application.

The practical importance of this work is to fill the gaps currently present in citizen science applications, by proposing an innovative system that works with the blockchain infrastructure. The result of this work and the technological development performed, demonstrate that citizen science applications can be, as a matter of fact, developed as a decentralized infrastructure. The main advantages with respect to other systems are data immutability, security and no single point of failure. Future work can include the implementation of a system to further incentivize the collection of data. This will work with a reward system in the form of a Utility Token. This token could be accepted by the public administrations benefitting from the data, in exchange for some form of compensation such as discounts on public services.

Room Modulo 3