Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Preamble

In this session we will move from manual interpretation and threshold-based analysis to data-driven classification using machine learning.


Learning Objectives

By the end of this session, students will be able to:


From spectral indices to Machine Learning

During Session 5, you explored Sentinel-2 imagery and computed several spectral indices such as:

You also experimented with threshold-based classification.

Example: NDVI > threshold → vegetation NDVI < threshold → non vegetation

Machine Learning workflow in Earth Observation

The workflow used in this session follows the standard pipeline used in many remote sensing applications:

Satellite images
      ↓
Spectral bands & indices
      ↓
Training polygons (reference data)
      ↓
Pixel extraction
      ↓
Machine Learning model
      ↓
Classification map
      ↓
Performance assessment

Step 1 — Satellite data

We start from Sentinel-2 images and derived variables:

These variables will be used as input features for the model.

Example feature table:

B4_octB8_octNDVI_octNDVI_febNDMI_octNDMI_feb

Step 2 — Training polygons

Supervised machine learning requires reference data.

These are often called:

In many real-world applications, reference data come from:

You will draw training polygons in QGIS corresponding to different land cover/vegetation status classes.

Creating training polygons in QGIS: live session

Training data are created by digitizing polygons representing homogeneous areas in the image.

In QGIS:

  1. Create a new vector layer

Layer → Create Layer → New GeoPackage Layer

  1. Set the parameters

This attribute will store the land cover class associated with the polygon.

  1. Start editing

Toggle Editing → Add Polygon Feature

Draw polygons over homogeneous areas corresponding to one class (healthy vegetation, stressed vegetation, soil, water, etc.).

When saving a polygon, fill the class attribute with the observed class.

Use numeric class labels as for machine learning it is often simpler to use numeric class codes.

The class attribute of the polygon layer is defined as an integer field, so you can use codes such as:

1 = healthy vegetation
2 = stressed vegetation
3 = bare soil
4 = urban
5 = water

▶️ Video tutorial (optional)
Watch this short tutorial if you need a reminder on how to digitize polygons in QGIS:

🎬 https://www.youtube.com/watch?v=ec_E9N_kN5M


Step 3 — Pixel extraction

For each training polygon, we extract the value of every input raster variable.

Each pixel inside a polygon becomes one training sample.

To simplify this step, it is recommended to first create a multiband raster stack containing all input variables (spectral bands and indices).

This stack will be used as the input raster from which pixel values are extracted.

The extraction is performed using the following notebook:

📓 Notebook to extract raster values from polygons

📥 Download the notebook

or open it directly in Google Colab: Open In Colab


After extraction, we obtain a reference dataset where:

X → input variables (features)
y → target variable (class)

Example dataset:

NDVI_octNDVI_febNDMI_octNDMI_febclass
0.630.570.190.131
0.320.210.060.022

Each row corresponds to one pixel sample extracted from the training polygons.


Step 4 — Train / Test split

Once the reference dataset has been created, we need to split it into two subsets:

This step is essential to assess whether the model generalizes well to new data.

In practice we usually split the dataset into: 70 % training data and 30 % test data

In Python this can be done using the train_test_split function from scikit-learn.

Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


Cross-validation (optional but recommended)

A more robust evaluation method consists in using cross-validation.

Cross-validation splits the dataset into several folds and trains the model multiple times using different training/test combinations.

Common strategies include:

More information:
https://scikit-learn.org/stable/modules/cross_validation.html


Step 5 — Random Forest classification

In this session we use a Random Forest classifier.

Random Forest is an ensemble method composed of many decision trees.

Each tree learns a different decision rule, and the final prediction is obtained by combining the predictions of all trees.

Advantages of Random Forest for remote sensing:

Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


Machine learning models have parameters that control their behavior.

Examples for Random Forest:

number of trees
maximum tree depth
minimum samples per leaf

Instead of choosing these parameters manually, we can use Grid Search.

Grid Search tests several parameter combinations and automatically selects the configuration that produces the best performance.

Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html


Step 7 — Model evaluation

Once the model is trained, we evaluate its performance on the test dataset.

The most common evaluation tool for classification is the confusion matrix.

As a reminder, the confusion matrix compares thetrue classes versus the predicted classes

From this matrix we can compute several metrics:

These metrics help quantify how well the model performs for each class.


Confusion matrix interpretation

The confusion matrix allows us to identify:

For example, the model may confuse:

Visualizing the confusion matrix helps understand where the model makes mistakes.


Last step

Step 7 — Produce a classification map and model confidence

Once the model has been trained and evaluated, it can be applied to the entire raster image (all pixels of the satellite image) in order to generate a vegetation stress map.

The trained model predicts the class of each pixel based on its spectral variables.

This produces a classification map, where each pixel is assigned a land cover class.

The model can also produce a confidence score for each prediction.

This value corresponds to the probability estimated by the model for the predicted class.

For example:

PixelPredicted classConfidence
pixel 1healthy vegetation0.92
pixel 2stressed vegetation0.61
pixel 3bare soil0.48

A confidence raster can therefore be generated alongside the classification map.



Running the Machine Learning notebook

All machine learning steps are implemented in the following notebook:

📓 Machine Learning notebook

📥 Download the notebook

This notebook performs the following steps:

  1. Load the training dataset

  2. Split training and test data

  3. Train a Random Forest classifier

  4. Tune model parameters using Grid Search

  5. Evaluate the model using a confusion matrix

  6. Analyze feature importance

  7. Use the best model to predict all image pixels and produce a classification map with its confidence map