Add methods for extracting true footprint for sampling valid data only #1881

adriantre · 2024-02-14T13:36:54Z

RasterDatasets may contain nodata regions due to projecting all file to the same CRS, and due to eventual inherit nodata regions in the images.
When IntersectionDataset joins this with VectorDataset, this may yield

false positive samples (bad for learning)
empty negative samples (may be bad for learning)

The solution can be summarised as:

In RasterDataset, when opening each file, extract footprint and add to rtree index object
In IntersectionDataset._merge_dataset_indices copy over the footprint to the new rtree index.
In the same method, could optimise by minimizing bbox to cover only actual intersection of valid data.
In RandomGeoSampler.__iter__, use this footprint to validate that sample bbox actually overlaps, and don't yield until a valid box is found.
Enable the same for GridGeoSampler (probably other PR)
Remove label mask for eventual nodata-regions that outside regions in boundary. (As the criteria above is overlaps and not contains, corners of the resulting sample may still contain nodata, while the label mask still may cover this.) (probably other PR)
Add ability to balance positive and negative samples. The VectorData can be intersected with the raster valid data footprint in the GeoSampler to facilitate balancing positives and negatives. Right now torchgeo gives the user no control of this. (probably other PR)

Useful resources:
Rasterio nodata masks:
https://rasterio.readthedocs.io/en/latest/topics/masks.html#nodata-masks

Extract valid data footprint as vector
https://gist.github.com/sgillies/9713809

Reproject valid data footprint with rasterio
https://geopandas.org/en/stable/docs/user_guide/reproject_fiona.html#rasterio-example

torchgeo/datasets/utils.py

johnnv1 · 2024-02-14T23:16:06Z

torchgeo/datasets/utils.py

+    # Read valid/nodata-mask
+    mask = src.read_masks()
+    # Close holes
+    sieved_mask = sieve(mask, 500)


we didn't have to know the minimum size accepted for any raster... Maybe it can have a factor to compute this based on the mask shape?

The target here is a polygon with no holes. Probably 500 is never too big (22 x 22 pixels). Could increase it, too.

If there are more (bigger) holes left, we could close them using shapely after converting to vector. What do you think?

something I thought of -- it if possible -- was to use the size of the window to compute the size to close the polygons

Hmm, you are probably on to something. I'm struggling to decide what effect it might have if we set size too big or too small

I thought of something like: closing holes that are bigger than the window size, we can still be getting some cases retrieving samples with just nodata ... considering the multi-polygon thing here.

One example is if we beforehand masks Sentinel-2 clouds as no data when closing the holes considering a size bigger than the window size we still can get random samples inside/within this nodata regions

Sounds smart!

One thing is that the desired patch_size to be used by sampler is not available at this point in the code. This happens on RasterDataset init, separate from the Sampler init.

torchgeo/datasets/utils.py

torchgeo/samplers/single.py

This is required for the footprint-extraction to work

adriantre · 2024-02-15T14:14:31Z

torchgeo/datasets/geo.py

+                # Get the first valid nodata value.
+                # Usualy the same value for all bands
+                nodata = valid_nodatavals[0]
+            vrt = WarpedVRT(src, nodata=nodata, crs=self.crs)


Sentinel-2 has as far I can see no value set for nodata. I looked everywhere. Even enabling alpha-layer in the Sentinel-2 gdal driver, and looking through the MSK_QUALIT-file I found nothing.

This change will set the nodata-value. Some datasets have other nodata-values, and we should probably let the user overwrite this, for example in their subclass of RasterDataset.

Currently, the nodata is only overridden for the warped datasources. The non-warped (already correct CRS) are opened as is, but would also need to have the nodata overridden.

adriantre · 2024-02-15T14:51:13Z

torchgeo/samplers/single.py

            hit = self.hits[idx]
            bounds = BoundingBox(*hit.bounds)


Only the first hit (file) is chosen. Currently I only use the footprint previously extracted for this file. But the sample is read from the merged raster, and the footprint for this one file might not cover the other.

Is my understanding correct?

In that case we would need to fetch all hits that overlaps with the randomly chosen hits bounds, and combine their footprints, crop it to the bounds, and pass the resulting footprint to get_random_bounding_box_check_valid_overlap.

The footprint for a hit is static. So this could be joined and cropped during init of IntersectionDataset._merge_dataset_indices

I assume I have understood this correctly, and added this functionality in _merge_dataset_indices.

torchgeo/datasets/geo.py

adriantre · 2024-02-16T06:47:26Z

torchgeo/datasets/utils.py

+    # Read valid/nodata-mask
+    mask = src.read_masks()
+    # Close holes
+    sieved_mask = sieve(mask, 500)


Sounds smart!

Add methods for extracting true footprint for sampling valid data only

9d2a497

github-actions bot added datasets Geospatial or benchmark datasets samplers Samplers for indexing datasets labels Feb 14, 2024

adamjstewart added this to the 0.6.0 milestone Feb 14, 2024

Add max_retries to get_random_bounding_box_check_valid_overlap

1e43522

johnnv1 reviewed Feb 14, 2024

View reviewed changes

adriantre added 2 commits February 15, 2024 15:09

Set nodata value for raster if None

4432803

This is required for the footprint-extraction to work

Handle spatial shift between band an multipolygonal footprint

580c6c2

adriantre commented Feb 15, 2024

View reviewed changes

adriantre mentioned this pull request Feb 15, 2024

Add BalancedRandomGeoSampler balancing positives and negatives #1883

Open

When merging dataset indices, also merge raster footprints

92bb008

adriantre commented Feb 16, 2024

View reviewed changes

adriantre added 2 commits February 16, 2024 09:30

Use correct bounds in merge_indices

d04451d

Add missing check for masks dimension

bbb21d5

adamjstewart mentioned this pull request Apr 1, 2024

I/O Bench: add new dataset #1972

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add methods for extracting true footprint for sampling valid data only #1881

Add methods for extracting true footprint for sampling valid data only #1881

adriantre commented Feb 14, 2024 •

edited

johnnv1 Feb 14, 2024

adriantre Feb 15, 2024

johnnv1 Feb 15, 2024

adriantre Feb 15, 2024

johnnv1 Feb 15, 2024

adriantre Feb 16, 2024

adriantre Feb 16, 2024

adriantre Feb 15, 2024 •

edited

adriantre Feb 15, 2024 •

edited

adriantre Feb 15, 2024

adriantre Feb 15, 2024 •

edited

adriantre Feb 15, 2024 •

edited

adriantre Feb 15, 2024

adriantre Feb 16, 2024

Add methods for extracting true footprint for sampling valid data only #1881

Are you sure you want to change the base?

Add methods for extracting true footprint for sampling valid data only #1881

Conversation

adriantre commented Feb 14, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriantre Feb 15, 2024 • edited

Choose a reason for hiding this comment

adriantre Feb 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriantre Feb 15, 2024 • edited

Choose a reason for hiding this comment

adriantre Feb 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriantre commented Feb 14, 2024 •

edited

adriantre Feb 15, 2024 •

edited

adriantre Feb 15, 2024 •

edited

adriantre Feb 15, 2024 •

edited

adriantre Feb 15, 2024 •

edited