From Points to Predictions: Data Curation for Geospatial Machine Learning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The quality of training datasets can have a large impact on Machine Learning (ML) models, yet this aspect of the pipeline frequently receives less scrutiny than it should. In the context of geospatial mapping from point-scale field data, quality control strategies to remove erroneous or misleading data can be applied prior to model training to improve performance. However, such strategies and their resulting impact are rarely reported, compared to extensive discussions of model selection and tuning. To investigate the potential for spatial data error correction, we examine the case of peatland mapping from peat core samples. We assess several curation strategies and compare fully automated filters against filters that require monitoring by domain experts. We find that cleaning strategies based on location precision and landcover classification filtering to detect mismatches can significantly improve performance metrics. We also find that blind reliance on fully automated classification may lead to worse results. Despite the additional effort required, we conclude that manual spatial data quality control processes are an important component of large-scale spatial modelling and discuss recommended approaches to scale them effectively for large datasets.

Article activity feed