Skip to content

Data Management Sub-system

This sub-system is in charge of producing the datasets used to train and evaluate the project's models starting from satellite imagery data and human annotation of the looting activity. Most machine learning models for computer vision tasks, such as the ones developed for this project, work directly on raster data. This pre-processing procedure consists of tools executed in a precise sequence. Such a sequence is called a pipeline.

Site data directory

For each site, a pre-processing pipeline has been configured. All the site data and the pipeline configuration are stored inside the data/site/<SITE NAME> folder. The data pipeline's main objective is the production of all intermediate artifacts for the final dataset, which depends on the site's images and annotations.

The setup of a site's data directory starts with the following components: 1. Area of Interest GeoJSON (e.g., area_of_interest.geojson) describes the site's boundaries that the sub-system should process.
2. Annotations GeoJSON containing the polygons delimiting the looting pits. Each pit should have a "Day_Month_Year" string to identify the date the looting pit was visible, alceo.preprocessing.change_from_annotation script will also use this date. (e.g., annotations/pits.geojson)

  1. Geo-referenced satellite image. The image's supported format is GeoTIFF. Using a folder for each GeoTIFF can be helpful to keep a tidy folder structure when other tools (e.g., QGIS) create meta-data files.
  2. dvc.yaml file describing the site's data pipeline stages. The following section describes setting it up using the dvc stage command.

The resulting example site data directory follows:

data/
    sites/
        <SITE NAME>/
            annotations/ # contains annotation files (i.e. pits.geojson).
                pits.geojson
            images/ # contains images products.
                <IMAGE NAME>/
                    image_name.tiff # image file with an example name.
            dvc.yaml # configuration of the site level pipeline.

Site data pipeline

The site data pipeline is in charge of splitting the areas of interest into geo-referenced tiles, computing from the site annotations the change in looting pits (appearance and disappearance) between two dates and cutting all rasters (images and change binary masks) in tiles of the exact resolution. DVC is the tool we've chosen to produce this pipeline.

Produce a vectorial representation of tiles.

The script alceo.processing.produce_tiles produces a GeoJSON containing polygons georeferencing all the vectorial tiles of a site's areas of interest. The produced GeoJSON includes metadata about the resulting raster resolution and a generated identifier of the tile.

Creating the stage using the dvc stage add command:

cd data/sites/${site_name} # Get into the site data directory to generate the dvc.yaml file. 

# The -w flag is needed to give the stage the project root as work directory.
dvc stage add -n produce_vectorial_tiles \
-w ../../../. \
-d src/alceo/processing/produce_tiles.py \
-d data/sites/${site_name}/images/${image_path} \
-d data/sites/${site_name}/area_of_interest.geojson \
-o data/sites/${site_name}/tiles.geojson \
python src/alceo/processing/produce_tiles.py \
      -i data/sites/${site_name}/images/${images_path} \
      -p ${tile_prefix} \
      -a data/sites/${site_name}/area_of_interest.geojson \
      -tw 256 \
      -th 256 \
      -o data/sites/${site_name}/tiles.geojson

Executing this command will result in the generation of the following dvc.yaml file:

stages:
  produce_vectorial_tiles:
    wdir: ../../../.
    cmd: python src/alceo/processing/produce_tiles.py
      -i data/sites/${site_name}/images/${images_path}
      -p ${tile_prefix}
      -a data/sites/${site_name}/area_of_interest.geojson
      -tw 256
      -th 256
      -o data/sites/${site_name}/tiles.geojson
    deps:
      - src/alceo/processing/produce_tiles.py
      - data/sites/${site_name}/images/${images_path}
      - data/sites/${site_name}/area_of_interest.geojson
    outs:
      - data/sites/${site_name}/tiles.geojson

The outs field of a stage definition represents the files generated by the stage, whereas deps are the files on which the command's output depends. In this way, DVC can compute how the stages depend on each other in the pipeline and avoid re-computing parts for which the dependencies did not change (and thus, the outputs should remain the same).

DVC has templating functionalities that allow the parametric definition of stages. In the previous stage, ${site_name}, ${image_path}, and ${tile_prefix} are examples of DVC templating.

Computing change from site annotations

A crucial step in the pipeline is the computation of the appearance and the disappearance of looting pits from the annotations of images taken on two dates. The script alceo.preprocessing.change_from_annotations.py takes the annotations GeoJSON for a site and the two dates of interest. Then the script computes features corresponding to looting pits' appearance, disappearance, and permanence and saves them into a target folder as three files in GeoJSON format.

In the site's dvc.yaml file this stage is defined as:

stage:
    ...
    produce_vectorial_change:
        wdir: ../../../. # workdir is the project root directory.
        cmd: python src/alceo/processing/change_from_annotations.py
            -i data/sites/${site_name}/annotations/pits.geojson
            -f ${first.date} -s ${second.date}
            -o data/sites/${site_name}/change/${first.image_folder}/${second.image_folder}/vectorial # arbitrary folder path
            --crs ${crs}
        deps:
            - src/alceo/processing/change_from_annotations.py
            - data/sites/${site_name}/annotations/pits.geojson
        outs:
            - data/sites/${site_name}/change/${first.image_folder}/${second.image_folder}/vectorial

If the site's images compose a time series of more than two dates, multiple pipeline stages of this kind can be defined using templating's foreach stages construct.

vars:
  - site_name: SITE
  - images:
      - name: DATE_A
        path: DATE_A/DATE_A_QB_NN_diffuse_geo.tif
      - name: DATE_B
        path: DATE_B/DATE_B_NN_diffuse_geo.tif
      - name: DATE_C
        path: DATE_C/DATE_C_NN_diffuse.tif
  - change_steps:
      - change: DATE_A/DATE_B
        crs: "EPSG:32636"
        first:
          date: "04/08/2005"
          image_folder: DATE_A
        second:
          date: "05/10/2015"
          image_folder: DATE_B
      - change: DATE_B/DATE_C
        crs: "EPSG:32636"
        first:
          date: "05/10/2015"
          image_folder: DATE_B
        second:
          date: "20/03/2018"
          image_folder: DATE_C
      - change: DATE_A/DATE_C
        crs: "EPSG:32636"
        first:
          date: "04/08/2005"
          image_folder: DATE_A
        second:
          date: "20/03/2018"
          image_folder: DATE_C
stages:
    ...
    produce_vectorial_change:
        foreach: ${change_steps}
        do:
        wdir: ../../../.
        cmd: python src/alceo/processing/change_from_annotations.py
            -i data/sites/${site_name}/annotations/pits.geojson
            -f ${item.first.date} -s ${item.second.date}
            -o data/sites/${site_name}/change/${item.first.image_folder}/${item.second.image_folder}/vectorial
            --crs ${item.crs}
        deps:
            - src/alceo/processing/change_from_annotations.py
            - data/sites/${site_name}/annotations/pits.geojson
        outs:
            - data/sites/${site_name}/change/${item.first.image_folder}/${item.second.image_folder}/vectorial

Rasterizing vectorial change representations

The next stage of the pipeline is to obtain a raster from the change vectorial features just computed. Rasterio's rio rasterize command allows for the rasterization of GeoJSON features. In compliance with the SECOND's structure, a stage is defined for rasterizing each change feature into a binary mask.

vars:
  - change_kinds:
    - change: DATE_A/DATE_B
        base_image: DATE_A/DATE_A_QB_NN_diffuse_geo
        kind: pits.appeared
    - change: DATE_A/DATE_B
        base_image: DATE_A/DATE_A_QB_NN_diffuse_geo
        kind: pits.disappeared
    - change: DATE_B/DATE_C
        base_image: DATE_B/DATE_B_NN_diffuse_geo
        kind: pits.appeared
    - change: DATE_B/DATE_C
        base_image: DATE_B/DATE_B_NN_diffuse_geo
        kind: pits.disappeared
    - change: DATE_A/DATE_C
        base_image: DATE_A/DATE_A_QB_NN_diffuse_geo
        kind: pits.appeared
    - change: DATE_A/DATE_C
        base_image: DATE_A/DATE_A_QB_NN_diffuse_geo
        kind: pits.disappeared 
stages:
  ...
  rasterize_change:
    foreach: ${change_kinds}
    do:
      wdir: ../../../.
      cmd:
        - mkdir -p data/sites/${site_name}/change/${item.change}/raster/
        - rio rasterize
          data/sites/${site_name}/change/${item.change}/vectorial/${item.kind}.geojson
          --like data/sites/${site_name}/images/${item.base_image}.tif
          --output data/sites/${site_name}/change/${item.change}/raster/${item.kind}.tif
      deps:
        - data/sites/${site_name}/change/${item.change}/vectorial/${item.kind}.geojson
        - data/sites/${site_name}/images/${item.base_image}.tif
      outs:
        - data/sites/${site_name}/change/${item.change}/raster/${item.kind}.tif

Rasters tilization

The last step in the data pre-processing pipeline is to split all the rasters (images and annotated features) into georeferenced tiles. The alceo.processing.rasterize_tiles script tilizes and resamples an input raster following the vectorial representation of tiles of the site's areas of interest.

This is done with the following stages:

vars:
  ...
stages:
  ...
  tilize_image:
    foreach: ${images}
    do:
      wdir: ../../../.
      cmd: python src/alceo/processing/rasterize_tiles.py
        -t data/sites/${site_name}/tiles.geojson
        -i data/sites/${site_name}/images/${item.path}
        -o data/sites/${site_name}/tiles/${item.name}
      deps:
        - src/alceo/processing/rasterize_tiles.py
        - data/sites/${site_name}/tiles.geojson
        - data/sites/${site_name}/images/${item.path}
      outs:
        - data/sites/${site_name}/tiles/${item.name}/
  tilize_change:
    foreach: ${change_kinds}
    do:
      wdir: ../../../.
      cmd:
        - mkdir -p data/sites/${site_name}/change/${item.change}/raster/${item.kind}/tiles
        - python src/alceo/processing/rasterize_tiles.py
          -t data/sites/${site_name}/tiles.geojson
          -i data/sites/${site_name}/change/${item.change}/raster/${item.kind}.tif
          -o data/sites/${site_name}/change/${item.change}/raster/${item.kind}/tiles
      deps:
        - src/alceo/processing/rasterize_tiles.py
        - data/sites/${site_name}/tiles.geojson
        - data/sites/${site_name}/change/${item.change}/raster/${item.kind}.tif
      outs:
        - data/sites/${site_name}/change/${item.change}/raster/${item.kind}/tiles

Dataset compilation

The last stage in the Data Management Sub-system pipeline is compiling the various datasets for change detection. The alceo.processing.pits_site_dataset script takes all the pre-processed data in the data directory and outputs a dataset compliant with the SECOND structure with the addition of a GeoJSON with the vectorial representation of the change features and a tiles_meta.csv file containing metadata about the resulting change detection tiles. The repository dataset/pits folder contains a dvc.yaml file containing the configuration for some change detection datasets for some studied sites.

For example, a geographical train/test split of a site data is configured as:

stages:
  build_train_SITE:
    wdir: ../../.
    cmd:
      - rm -rf dataset/pits/train_SITE
      - mkdir -p dataset/pits/train_SITE
      - python src/alceo/processing/pits_site_dataset.py
        -s data/sites/SITE
        -o dataset/pits/train_SITE
        -a data/sites/SITE/train_area.geojson
    deps:
      - src/alceo/processing/pits_site_dataset.py
      - data/sites/SITE/tiles
      - data/sites/SITE/change
      - data/sites/SITE/train_area.geojson
    outs:
      - dataset/pits/train_SITE
  build_test_SITE:
    wdir: ../../.
    cmd:
      - rm -rf dataset/pits/test_SITE
      - mkdir -p dataset/pits/test_SITE
      - python src/alceo/processing/pits_site_dataset.py
        -s data/sites/SITE
        -o dataset/pits/test_SITE
        -a data/sites/SITE/test_area.geojson
    deps:
      - src/alceo/processing/pits_site_dataset.py
      - data/sites/SITE/tiles
      - data/sites/SITE/change
      - data/sites/SITE/test_area.geojson
    outs:
      - dataset/pits/test_SITE