AI & machine learning
Our AI Focus delivers science-driven, practical solutions through three pillars:
ERP-GPT: Data + Knowledge for Users A domain-grounded AI assistant that connects our soil data and expertise directly to farmers, land managers, and soil scientists—turning complex insights into clear, actionable guidance.
Physics-Informed, Multi-Modal ML Models that combine wave physics with sensor, spatial, and field data for accuracy, robustness, and interpretability.
ML-Ready Global Soil Data Clean, harmonised global soil datasets built for reliable, scalable machine learning and Retrieval Augmented Generation.
Together, these pillars create a unified, user-centred approach to next-generation soil intelligence.
ERP TEAM

Prof Tarje Nissen-Meyer

Dr Kuangdai Leng

Dr Matteo Bagagli

Dr Joe Collins
Deliverables
The repository provides integrated sample–feature representations in both tabular and dictionary formats, along with metadata and associated asset files. It also includes intermediate outputs from the data standardization process, as well as the scripts and schema definitions used to construct LUCAS-MEGA.
This repository contains the architecture, pretrained weights, and training scripts for SoilFormer. It enables reproduction of the representation learning experiments and supports further development on LUCAS-MEGA.
This repository provides APIs, prompt templates, and related resources for ERP-GPT-EU, supporting tool-augmented interaction with the dataset through natural language.
SoilGPT:
We unify Soil Science and Seismology through generalised Masked Language Modelling (MLM) on multi-modal tokens, integrating data-driven learning with soil & wave physics.
Global Soil Data
We develop a multi-agent system that automatically processes and fuses the vast range of soil datasets available worldwide, covering Europe, Africa, North America, South America, and beyond. This system brings together multidisciplinary information—including physical, chemical, and biological properties; nutrients; soil functions; and contamination or threat indicators—to create a comprehensive global resource.
The result is an ML-ready corpus with a unified, sample-based structure, accessible through a consistent API. All data undergoes strict human sanity checks and is enriched with rich natural-language annotations, ensuring both technical reliability and clarity for downstream applications.



