Data Management Sub-system
This sub-system is in charge of producing the datasets used to train and evaluate the project's models starting from satellite imagery data and human annotation of the looting activity. Most machine learning models for computer vision tasks, such as the ones developed for this project, work directly on raster data. This pre-processing procedure consists of tools executed in a precise sequence. Such a sequence is called a pipeline.
Site data directory
For each site, a pre-processing pipeline has been configured. All the site data and the pipeline configuration are stored inside the data/site/<SITE NAME>
folder.
The data pipeline's main objective is the production of all intermediate artifacts for the final dataset, which depends on the site's images and annotations.
The setup of a site's data directory starts with the following components:
1. Area of Interest GeoJSON (e.g., area_of_interest.geojson
) describes the site's boundaries that the sub-system should process.
2. Annotations GeoJSON containing the polygons delimiting the looting pits. Each pit should have a "Day_Month_Year" string to identify the date the looting pit was visible, alceo.preprocessing.change_from_annotation
script will also use this date. (e.g., annotations/pits.geojson
)
- Geo-referenced satellite image. The image's supported format is GeoTIFF. Using a folder for each GeoTIFF can be helpful to keep a tidy folder structure when other tools (e.g., QGIS) create meta-data files.
dvc.yaml
file describing the site's data pipeline stages. The following section describes setting it up using thedvc stage
command.
The resulting example site data directory follows:
data/
sites/
<SITE NAME>/
annotations/ # contains annotation files (i.e. pits.geojson).
pits.geojson
images/ # contains images products.
<IMAGE NAME>/
image_name.tiff # image file with an example name.
dvc.yaml # configuration of the site level pipeline.
Site data pipeline
The site data pipeline is in charge of splitting the areas of interest into geo-referenced tiles, computing from the site annotations the change in looting pits (appearance and disappearance) between two dates and cutting all rasters (images and change binary masks) in tiles of the exact resolution. DVC is the tool we've chosen to produce this pipeline.
Produce a vectorial representation of tiles.
The script alceo.processing.produce_tiles
produces a GeoJSON containing polygons georeferencing all the vectorial tiles of a site's areas of interest.
The produced GeoJSON includes metadata about the resulting raster resolution and a generated identifier of the tile.
Creating the stage using the dvc stage add
command:
cd data/sites/${site_name} # Get into the site data directory to generate the dvc.yaml file.
# The -w flag is needed to give the stage the project root as work directory.
dvc stage add -n produce_vectorial_tiles \
-w ../../../. \
-d src/alceo/processing/produce_tiles.py \
-d data/sites/${site_name}/images/${image_path} \
-d data/sites/${site_name}/area_of_interest.geojson \
-o data/sites/${site_name}/tiles.geojson \
python src/alceo/processing/produce_tiles.py \
-i data/sites/${site_name}/images/${images_path} \
-p ${tile_prefix} \
-a data/sites/${site_name}/area_of_interest.geojson \
-tw 256 \
-th 256 \
-o data/sites/${site_name}/tiles.geojson
Executing this command will result in the generation of the following dvc.yaml
file:
stages:
produce_vectorial_tiles:
wdir: ../../../.
cmd: python src/alceo/processing/produce_tiles.py
-i data/sites/${site_name}/images/${images_path}
-p ${tile_prefix}
-a data/sites/${site_name}/area_of_interest.geojson
-tw 256
-th 256
-o data/sites/${site_name}/tiles.geojson
deps:
- src/alceo/processing/produce_tiles.py
- data/sites/${site_name}/images/${images_path}
- data/sites/${site_name}/area_of_interest.geojson
outs:
- data/sites/${site_name}/tiles.geojson
The outs
field of a stage definition represents the files generated by the stage, whereas deps
are the files on which the command's output depends. In this way, DVC can compute how the stages depend on each other in the pipeline and avoid re-computing parts for which the dependencies did not change (and thus, the outputs should remain the same).
DVC has templating functionalities that allow the parametric definition of stages. In the previous stage, ${site_name}
, ${image_path}
, and ${tile_prefix}
are examples of DVC templating.
Computing change from site annotations
A crucial step in the pipeline is the computation of the appearance and the disappearance of looting pits from the annotations of images taken on two dates.
The script alceo.preprocessing.change_from_annotations.py
takes the annotations GeoJSON for a site and the two dates of interest. Then the script computes features corresponding to looting pits' appearance
, disappearance
, and permanence
and saves them into a target folder as three files in GeoJSON format.
In the site's dvc.yaml
file this stage is defined as:
stage:
...
produce_vectorial_change:
wdir: ../../../. # workdir is the project root directory.
cmd: python src/alceo/processing/change_from_annotations.py
-i data/sites/${site_name}/annotations/pits.geojson
-f ${first.date} -s ${second.date}
-o data/sites/${site_name}/change/${first.image_folder}/${second.image_folder}/vectorial # arbitrary folder path
--crs ${crs}
deps:
- src/alceo/processing/change_from_annotations.py
- data/sites/${site_name}/annotations/pits.geojson
outs:
- data/sites/${site_name}/change/${first.image_folder}/${second.image_folder}/vectorial
If the site's images compose a time series of more than two dates, multiple pipeline stages of this kind can be defined using templating's foreach stages construct.
vars:
- site_name: SITE
- images:
- name: DATE_A
path: DATE_A/DATE_A_QB_NN_diffuse_geo.tif
- name: DATE_B
path: DATE_B/DATE_B_NN_diffuse_geo.tif
- name: DATE_C
path: DATE_C/DATE_C_NN_diffuse.tif
- change_steps:
- change: DATE_A/DATE_B
crs: "EPSG:32636"
first:
date: "04/08/2005"
image_folder: DATE_A
second:
date: "05/10/2015"
image_folder: DATE_B
- change: DATE_B/DATE_C
crs: "EPSG:32636"
first:
date: "05/10/2015"
image_folder: DATE_B
second:
date: "20/03/2018"
image_folder: DATE_C
- change: DATE_A/DATE_C
crs: "EPSG:32636"
first:
date: "04/08/2005"
image_folder: DATE_A
second:
date: "20/03/2018"
image_folder: DATE_C
stages:
...
produce_vectorial_change:
foreach: ${change_steps}
do:
wdir: ../../../.
cmd: python src/alceo/processing/change_from_annotations.py
-i data/sites/${site_name}/annotations/pits.geojson
-f ${item.first.date} -s ${item.second.date}
-o data/sites/${site_name}/change/${item.first.image_folder}/${item.second.image_folder}/vectorial
--crs ${item.crs}
deps:
- src/alceo/processing/change_from_annotations.py
- data/sites/${site_name}/annotations/pits.geojson
outs:
- data/sites/${site_name}/change/${item.first.image_folder}/${item.second.image_folder}/vectorial
Rasterizing vectorial change representations
The next stage of the pipeline is to obtain a raster from the change vectorial features just computed. Rasterio's rio rasterize
command allows for the rasterization of GeoJSON features. In compliance with the SECOND's structure, a stage is defined for rasterizing each change feature into a binary mask.
vars:
- change_kinds:
- change: DATE_A/DATE_B
base_image: DATE_A/DATE_A_QB_NN_diffuse_geo
kind: pits.appeared
- change: DATE_A/DATE_B
base_image: DATE_A/DATE_A_QB_NN_diffuse_geo
kind: pits.disappeared
- change: DATE_B/DATE_C
base_image: DATE_B/DATE_B_NN_diffuse_geo
kind: pits.appeared
- change: DATE_B/DATE_C
base_image: DATE_B/DATE_B_NN_diffuse_geo
kind: pits.disappeared
- change: DATE_A/DATE_C
base_image: DATE_A/DATE_A_QB_NN_diffuse_geo
kind: pits.appeared
- change: DATE_A/DATE_C
base_image: DATE_A/DATE_A_QB_NN_diffuse_geo
kind: pits.disappeared
stages:
...
rasterize_change:
foreach: ${change_kinds}
do:
wdir: ../../../.
cmd:
- mkdir -p data/sites/${site_name}/change/${item.change}/raster/
- rio rasterize
data/sites/${site_name}/change/${item.change}/vectorial/${item.kind}.geojson
--like data/sites/${site_name}/images/${item.base_image}.tif
--output data/sites/${site_name}/change/${item.change}/raster/${item.kind}.tif
deps:
- data/sites/${site_name}/change/${item.change}/vectorial/${item.kind}.geojson
- data/sites/${site_name}/images/${item.base_image}.tif
outs:
- data/sites/${site_name}/change/${item.change}/raster/${item.kind}.tif
Rasters tilization
The last step in the data pre-processing pipeline is to split all the rasters (images and annotated features) into georeferenced tiles. The alceo.processing.rasterize_tiles
script tilizes and resamples an input raster following the vectorial representation of tiles of the site's areas of interest.
This is done with the following stages:
vars:
...
stages:
...
tilize_image:
foreach: ${images}
do:
wdir: ../../../.
cmd: python src/alceo/processing/rasterize_tiles.py
-t data/sites/${site_name}/tiles.geojson
-i data/sites/${site_name}/images/${item.path}
-o data/sites/${site_name}/tiles/${item.name}
deps:
- src/alceo/processing/rasterize_tiles.py
- data/sites/${site_name}/tiles.geojson
- data/sites/${site_name}/images/${item.path}
outs:
- data/sites/${site_name}/tiles/${item.name}/
tilize_change:
foreach: ${change_kinds}
do:
wdir: ../../../.
cmd:
- mkdir -p data/sites/${site_name}/change/${item.change}/raster/${item.kind}/tiles
- python src/alceo/processing/rasterize_tiles.py
-t data/sites/${site_name}/tiles.geojson
-i data/sites/${site_name}/change/${item.change}/raster/${item.kind}.tif
-o data/sites/${site_name}/change/${item.change}/raster/${item.kind}/tiles
deps:
- src/alceo/processing/rasterize_tiles.py
- data/sites/${site_name}/tiles.geojson
- data/sites/${site_name}/change/${item.change}/raster/${item.kind}.tif
outs:
- data/sites/${site_name}/change/${item.change}/raster/${item.kind}/tiles
Dataset compilation
The last stage in the Data Management Sub-system pipeline is compiling the various datasets for change detection. The alceo.processing.pits_site_dataset
script takes all the pre-processed data in the data directory and outputs a dataset compliant with the SECOND structure with the addition of a GeoJSON with the vectorial representation of the change features and a tiles_meta.csv
file containing metadata about the resulting change detection tiles. The repository dataset/pits
folder contains a dvc.yaml
file containing the configuration for some change detection datasets for some studied sites.
For example, a geographical train/test split of a site data is configured as:
stages:
build_train_SITE:
wdir: ../../.
cmd:
- rm -rf dataset/pits/train_SITE
- mkdir -p dataset/pits/train_SITE
- python src/alceo/processing/pits_site_dataset.py
-s data/sites/SITE
-o dataset/pits/train_SITE
-a data/sites/SITE/train_area.geojson
deps:
- src/alceo/processing/pits_site_dataset.py
- data/sites/SITE/tiles
- data/sites/SITE/change
- data/sites/SITE/train_area.geojson
outs:
- dataset/pits/train_SITE
build_test_SITE:
wdir: ../../.
cmd:
- rm -rf dataset/pits/test_SITE
- mkdir -p dataset/pits/test_SITE
- python src/alceo/processing/pits_site_dataset.py
-s data/sites/SITE
-o dataset/pits/test_SITE
-a data/sites/SITE/test_area.geojson
deps:
- src/alceo/processing/pits_site_dataset.py
- data/sites/SITE/tiles
- data/sites/SITE/change
- data/sites/SITE/test_area.geojson
outs:
- dataset/pits/test_SITE