Preamble¶
Tools: QGIS, Python (Jupyter Notebook), scikit-learn
Data: Sentinel-2 multi-date imagery and spectral indices computed in previous sessions
Goal: learn how to use supervised machine learning to map vegetation water stress from satellite imagery.
In this session we will move from manual interpretation and threshold-based analysis to data-driven classification using machine learning.
Learning Objectives¶
By the end of this session, students will be able to:
Understand the difference between rule-based classification and machine learning classification
Create training data using manually digitized reference polygons in QGIS
Build a machine learning dataset from raster variables
Train a Random Forest classifier
Evaluate model performance using confusion matrices and classification metrics
From spectral indices to Machine Learning¶
During Session 5, you explored Sentinel-2 imagery and computed several spectral indices such as:
NDVI
NDMI
NDWI
additional spectral band combinations
You also experimented with threshold-based classification.
Example: NDVI > threshold → vegetation NDVI < threshold → non vegetation
Machine Learning workflow in Earth Observation¶
The workflow used in this session follows the standard pipeline used in many remote sensing applications:
Satellite images
↓
Spectral bands & indices
↓
Training polygons (reference data)
↓
Pixel extraction
↓
Machine Learning model
↓
Classification map
↓
Performance assessmentStep 1 — Satellite data¶
We start from Sentinel-2 images and derived variables:
spectral bands
spectral indices
multi-date information
These variables will be used as input features for the model.
Example feature table:
| B4_oct | B8_oct | NDVI_oct | NDVI_feb | NDMI_oct | NDMI_feb |
|---|
Step 2 — Training polygons¶
Supervised machine learning requires reference data.
These are often called:
training data
ground truth
labelled data
In many real-world applications, reference data come from:
field measurements
agricultural surveys
in-situ sensors
expert interpretation
You will draw training polygons in QGIS corresponding to different land cover/vegetation status classes.
Creating training polygons in QGIS: live session
Training data are created by digitizing polygons representing homogeneous areas in the image.
In QGIS:
Create a new vector layer
Layer → Create Layer → New GeoPackage Layer
Set the parameters
Geometry type: Polygon
CRS: same as the raster (EPSG: 32629)
Add a field:
classans specify itsTypeas Integer
This attribute will store the land cover class associated with the polygon.
Start editing
Toggle Editing → Add Polygon Feature
Draw polygons over homogeneous areas corresponding to one class (healthy vegetation, stressed vegetation, soil, water, etc.).
When saving a polygon, fill the class attribute with the observed class.
Use numeric class labels as for machine learning it is often simpler to use numeric class codes.
The class attribute of the polygon layer is defined as an integer field, so you can use codes such as:
1 = healthy vegetation
2 = stressed vegetation
3 = bare soil
4 = urban
5 = water
▶️ Video tutorial (optional)
Watch this short tutorial if you need a reminder on how to digitize polygons in QGIS:
🎬 https://
Step 3 — Pixel extraction¶
For each training polygon, we extract the value of every input raster variable.
Each pixel inside a polygon becomes one training sample.
To simplify this step, it is recommended to first create a multiband raster stack containing all input variables (spectral bands and indices).
This stack will be used as the input raster from which pixel values are extracted.
The extraction is performed using the following notebook:
📓 Notebook to extract raster values from polygons
or open it directly in Google Colab:
After extraction, we obtain a reference dataset where:
X → input variables (features)
y → target variable (class)Example dataset:
| NDVI_oct | NDVI_feb | NDMI_oct | NDMI_feb | class |
|---|---|---|---|---|
| 0.63 | 0.57 | 0.19 | 0.13 | 1 |
| 0.32 | 0.21 | 0.06 | 0.02 | 2 |
Each row corresponds to one pixel sample extracted from the training polygons.
Step 4 — Train / Test split¶
Once the reference dataset has been created, we need to split it into two subsets:
a training set used to train the model
a test set used to evaluate the model on unseen data
This step is essential to assess whether the model generalizes well to new data.
In practice we usually split the dataset into: 70 % training data and 30 % test data
In Python this can be done using the train_test_split function from scikit-learn.
Documentation:
https://
Cross-validation (optional but recommended)
A more robust evaluation method consists in using cross-validation.
Cross-validation splits the dataset into several folds and trains the model multiple times using different training/test combinations.
Common strategies include:
k-fold cross-validation
stratified cross-validation (keeps class proportions balanced)
grouped cross-validation (useful when samples are spatially correlated)
More information:
https://
Step 5 — Random Forest classification¶
In this session we use a Random Forest classifier.
Random Forest is an ensemble method composed of many decision trees.
Each tree learns a different decision rule, and the final prediction is obtained by combining the predictions of all trees.
Advantages of Random Forest for remote sensing:
robust to noise
handles nonlinear relationships
works well with many input variables
widely used in Earth Observation applications
Documentation:
https://
Step 6 — Hyperparameter tuning (Grid Search)¶
Machine learning models have parameters that control their behavior.
Examples for Random Forest:
number of trees
maximum tree depth
minimum samples per leaf
Instead of choosing these parameters manually, we can use Grid Search.
Grid Search tests several parameter combinations and automatically selects the configuration that produces the best performance.
Documentation:
https://
Step 7 — Model evaluation¶
Once the model is trained, we evaluate its performance on the test dataset.
The most common evaluation tool for classification is the confusion matrix.
As a reminder, the confusion matrix compares thetrue classes versus the predicted classes
From this matrix we can compute several metrics:
overall accuracy
precision
recall
F1-score
These metrics help quantify how well the model performs for each class.
Confusion matrix interpretation
The confusion matrix allows us to identify:
correctly classified samples
misclassified classes
possible confusion between similar land cover types
For example, the model may confuse:
stressed vegetation with bare soil
sparse vegetation with urban surfaces
Visualizing the confusion matrix helps understand where the model makes mistakes.
Last step¶
Step 7 — Produce a classification map and model confidence¶
Once the model has been trained and evaluated, it can be applied to the entire raster image (all pixels of the satellite image) in order to generate a vegetation stress map.
The trained model predicts the class of each pixel based on its spectral variables.
This produces a classification map, where each pixel is assigned a land cover class.
The model can also produce a confidence score for each prediction.
This value corresponds to the probability estimated by the model for the predicted class.
For example:
| Pixel | Predicted class | Confidence |
|---|---|---|
| pixel 1 | healthy vegetation | 0.92 |
| pixel 2 | stressed vegetation | 0.61 |
| pixel 3 | bare soil | 0.48 |
A confidence raster can therefore be generated alongside the classification map.
Running the Machine Learning notebook
All machine learning steps are implemented in the following notebook:
📓 Machine Learning notebook
This notebook performs the following steps:
Load the training dataset
Split training and test data
Train a Random Forest classifier
Tune model parameters using Grid Search
Evaluate the model using a confusion matrix
Analyze feature importance
Use the best model to predict all image pixels and produce a classification map with its confidence map