Geospatial Machine Learning in Python
Geospatial machine learning combines spatial data analysis with predictive modeling. By incorporating geographic features such as coordinates, distances, and environmental variables, machine learning models can uncover spatial patterns and make predictions about geographic phenomena.
Applications include:
- Species distribution modeling
- Land cover classification
- Climate prediction
- Urban growth modeling
- Environmental risk analysis
This guide introduces workflows for building spatial machine learning models using Python.
Key Libraries
Common Python libraries used for geospatial machine learning:
| Library | Purpose |
|---|---|
| geopandas | Vector spatial data |
| rasterio | Raster data processing |
| numpy | Numerical operations |
| pandas | Tabular data manipulation |
| scikit-learn | Machine learning models |
| xarray | Multidimensional geospatial data |
Install required packages:
pip install geopandas rasterio scikit-learn xarray numpy pandas
Import libraries:
import geopandas as gpd
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
Spatial Features in Machine Learning
Spatial models often use features such as:
- Latitude and longitude
- Distance to features
- Elevation
- Climate variables
- Land cover classifications
Example dataset:
| latitude | longitude | elevation | temperature | species_presence |
|---|---|---|---|---|
| 40.58 | -105.08 | 1500 | 12.4 | 1 |
Preparing Spatial Data
Load spatial dataset:
gdf = gpd.read_file("species_observations.shp")
Extract coordinates:
gdf["lon"] = gdf.geometry.x
gdf["lat"] = gdf.geometry.y
Convert to machine learning table:
df = pd.DataFrame(gdf.drop(columns="geometry"))
Feature Engineering
Create spatial features to improve model performance.
Example: distance to nearest river.
gdf["distance_to_river"] = gdf.geometry.distance(river_geometry)
Example: spatial clustering features.
from sklearn.cluster import KMeans
coords = df[["lat","lon"]]
kmeans = KMeans(n_clusters=5)
df["region_cluster"] = kmeans.fit_predict(coords)
Splitting Data
Split dataset into training and testing sets:
X = df[["lat","lon","elevation","temperature"]]
y = df["species_presence"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Training a Model
Train a Random Forest classifier:
model = RandomForestClassifier(n_estimators=200)
model.fit(X_train, y_train)
Make predictions:
predictions = model.predict(X_test)
Evaluating Model Performance
Measure accuracy:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)
print(accuracy)
Other useful metrics include:
- Precision
- Recall
- F1-score
- ROC-AUC
Spatial Prediction
Use trained models to predict across geographic space.
Example workflow:
- Generate prediction grid
- Calculate spatial features
- Apply trained model
Example:
prediction_grid["predicted_presence"] = model.predict(grid_features)
These predictions can be visualized as maps.
Mapping Machine Learning Results
Convert predictions to a GeoDataFrame:
results = gpd.GeoDataFrame(prediction_grid, geometry="geometry")
Plot prediction map:
results.plot(column="predicted_presence", legend=True)
Spatial Cross Validation
Standard machine learning validation may produce biased results because nearby observations are similar.
Spatial cross-validation splits data by geographic region instead of random sampling.
Example strategy:
- Divide study area into spatial blocks
- Train model on some regions
- Test on withheld regions
This produces more realistic accuracy estimates.
Working with Raster Features
Machine learning models often use raster datasets such as:
- Elevation
- Land cover
- Climate variables
- Satellite imagery
Extract raster values at observation locations:
import rasterio
with rasterio.open("elevation.tif") as src:
gdf["elevation"] = [
x[0] for x in src.sample(
[(geom.x, geom.y) for geom in gdf.geometry]
)
]
Example Workflow
A simplified spatial ML workflow:
import geopandas as gpd
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
data = gpd.read_file("observations.shp")
data["lon"] = data.geometry.x
data["lat"] = data.geometry.y
X = data[["lat","lon","elevation"]]
y = data["target"]
model = RandomForestClassifier()
model.fit(X,y)
Summary
Geospatial machine learning combines:
- Geographic data
- Environmental variables
- Machine learning models
These approaches allow researchers to:
- Predict environmental phenomena
- Model spatial processes
- Analyze geographic patterns
Python provides a flexible ecosystem for building advanced geospatial models.
Next Steps
Advanced topics to explore include:
- Spatial deep learning
- Remote sensing classification
- Spatial autocorrelation modeling
- Geostatistics and kriging
- Spatiotemporal forecasting