Session 6 — Machine Learning for water stress mapping

Preamble¶

Tools: QGIS, Python (Jupyter Notebook), scikit-learn
Data: Sentinel-2 multi-date imagery and spectral indices computed in previous sessions
Goal: learn how to use supervised machine learning to map vegetation water stress from satellite imagery.

In this session we will move from manual interpretation and threshold-based analysis to data-driven classification using machine learning.

Learning Objectives¶

By the end of this session, students will be able to:

Understand the difference between rule-based classification and machine learning classification
Create training data using manually digitized reference polygons in QGIS
Build a machine learning dataset from raster variables
Train a Random Forest classifier
Evaluate model performance using confusion matrices and classification metrics

From spectral indices to Machine Learning¶

During Session 5, you explored Sentinel-2 imagery and computed several spectral indices such as:

NDVI
NDMI
NDWI
additional spectral band combinations

You also experimented with threshold-based classification.

Example: NDVI > threshold → vegetation NDVI < threshold → non vegetation

Machine Learning workflow in Earth Observation¶

The workflow used in this session follows the standard pipeline used in many remote sensing applications:

Satellite images
      ↓
Spectral bands & indices
      ↓
Training polygons (reference data)
      ↓
Pixel extraction
      ↓
Machine Learning model
      ↓
Classification map
      ↓
Performance assessment

Step 1 — Satellite data¶

We start from Sentinel-2 images and derived variables:

spectral bands
spectral indices
multi-date information

These variables will be used as input features for the model.

Example feature table:

B4_oct	B8_oct	NDVI_oct	NDVI_feb	NDMI_oct	NDMI_feb

Step 2 — Training polygons¶

Supervised machine learning requires reference data.

These are often called:

training data
ground truth
labelled data

In many real-world applications, reference data come from:

field measurements
agricultural surveys
in-situ sensors
expert interpretation

You will draw training polygons in QGIS corresponding to different land cover/vegetation status classes.

Creating training polygons in QGIS: live session

Training data are created by digitizing polygons representing homogeneous areas in the image.

In QGIS:

Create a new vector layer

Layer → Create Layer → New GeoPackage Layer

Set the parameters

Geometry type: Polygon
CRS: same as the raster (EPSG: 32629)
Add a field: class ans specify its Type as Integer

This attribute will store the land cover class associated with the polygon.

Start editing

Toggle Editing → Add Polygon Feature

Draw polygons over homogeneous areas corresponding to one class (healthy vegetation, stressed vegetation, soil, water, etc.).

When saving a polygon, fill the class attribute with the observed class.

Use numeric class labels as for machine learning it is often simpler to use numeric class codes.

The class attribute of the polygon layer is defined as an integer field, so you can use codes such as:

1 = healthy vegetation
2 = stressed vegetation
3 = bare soil
4 = urban
5 = water

▶️ Video tutorial (optional)
Watch this short tutorial if you need a reminder on how to digitize polygons in QGIS:

🎬 https://www.youtube.com/watch?v=ec_E9N_kN5M

Step 3 — Pixel extraction¶

For each training polygon, we extract the value of every input raster variable.

Each pixel inside a polygon becomes one training sample.

To simplify this step, it is recommended to first create a multiband raster stack containing all input variables (spectral bands and indices).

This stack will be used as the input raster from which pixel values are extracted.

The extraction is performed using the following notebook:

📓 Notebook to extract raster values from polygons

📥 Download the notebook

or open it directly in Google Colab:

After extraction, we obtain a reference dataset where:

X → input variables (features)
y → target variable (class)

Example dataset:

NDVI_oct	NDVI_feb	NDMI_oct	NDMI_feb	class
0.63	0.57	0.19	0.13	1
0.32	0.21	0.06	0.02	2

Each row corresponds to one pixel sample extracted from the training polygons.

Step 4 — Train / Test split¶

Once the reference dataset has been created, we need to split it into two subsets:

a training set used to train the model
a test set used to evaluate the model on unseen data

This step is essential to assess whether the model generalizes well to new data.

In practice we usually split the dataset into: 70 % training data and 30 % test data

In Python this can be done using the train_test_split function from scikit-learn.

Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Cross-validation (optional but recommended)

A more robust evaluation method consists in using cross-validation.

Cross-validation splits the dataset into several folds and trains the model multiple times using different training/test combinations.

Common strategies include:

k-fold cross-validation
stratified cross-validation (keeps class proportions balanced)
grouped cross-validation (useful when samples are spatially correlated)

More information:
https://scikit-learn.org/stable/modules/cross_validation.html

Step 5 — Random Forest classification¶

In this session we use a Random Forest classifier.

Random Forest is an ensemble method composed of many decision trees.

Each tree learns a different decision rule, and the final prediction is obtained by combining the predictions of all trees.

Advantages of Random Forest for remote sensing:

robust to noise
handles nonlinear relationships
works well with many input variables
widely used in Earth Observation applications

Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Step 6 — Hyperparameter tuning (Grid Search)¶

Machine learning models have parameters that control their behavior.

Examples for Random Forest:

number of trees
maximum tree depth
minimum samples per leaf

Instead of choosing these parameters manually, we can use Grid Search.

Grid Search tests several parameter combinations and automatically selects the configuration that produces the best performance.

Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

Step 7 — Model evaluation¶

Once the model is trained, we evaluate its performance on the test dataset.

The most common evaluation tool for classification is the confusion matrix.

As a reminder, the confusion matrix compares thetrue classes versus the predicted classes

From this matrix we can compute several metrics:

overall accuracy
precision
recall
F1-score

These metrics help quantify how well the model performs for each class.

Confusion matrix interpretation

The confusion matrix allows us to identify:

correctly classified samples
misclassified classes
possible confusion between similar land cover types

For example, the model may confuse:

stressed vegetation with bare soil
sparse vegetation with urban surfaces

Visualizing the confusion matrix helps understand where the model makes mistakes.

Last step¶

Step 7 — Produce a classification map and model confidence¶

Once the model has been trained and evaluated, it can be applied to the entire raster image (all pixels of the satellite image) in order to generate a vegetation stress map.

The trained model predicts the class of each pixel based on its spectral variables.

This produces a classification map, where each pixel is assigned a land cover class.

The model can also produce a confidence score for each prediction.

This value corresponds to the probability estimated by the model for the predicted class.

For example:

Pixel	Predicted class	Confidence
pixel 1	healthy vegetation	0.92
pixel 2	stressed vegetation	0.61
pixel 3	bare soil	0.48

A confidence raster can therefore be generated alongside the classification map.

Running the Machine Learning notebook

All machine learning steps are implemented in the following notebook:

📓 Machine Learning notebook

📥 Download the notebook

This notebook performs the following steps:

Load the training dataset
Split training and test data
Train a Random Forest classifier
Tune model parameters using Grid Search
Evaluate the model using a confusion matrix
Analyze feature importance
Use the best model to predict all image pixels and produce a classification map with its confidence map