-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add methods for extracting true footprint for sampling valid data only #1881
base: main
Are you sure you want to change the base?
Add methods for extracting true footprint for sampling valid data only #1881
Conversation
# Read valid/nodata-mask | ||
mask = src.read_masks() | ||
# Close holes | ||
sieved_mask = sieve(mask, 500) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we didn't have to know the minimum size
accepted for any raster... Maybe it can have a factor
to compute this based on the mask shape?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The target here is a polygon with no holes. Probably 500 is never too big (22 x 22 pixels). Could increase it, too.
If there are more (bigger) holes left, we could close them using shapely after converting to vector. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something I thought of -- it if possible -- was to use the size of the window to compute the size to close the polygons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you are probably on to something. I'm struggling to decide what effect it might have if we set size too big or too small
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought of something like: closing holes that are bigger than the window size, we can still be getting some cases retrieving samples with just nodata ... considering the multi-polygon thing here.
One example is if we beforehand masks Sentinel-2 clouds as no data when closing the holes considering a size bigger than the window size we still can get random samples inside/within this nodata regions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds smart!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing is that the desired patch_size to be used by sampler is not available at this point in the code. This happens on RasterDataset init, separate from the Sampler init.
This is required for the footprint-extraction to work
# Get the first valid nodata value. | ||
# Usualy the same value for all bands | ||
nodata = valid_nodatavals[0] | ||
vrt = WarpedVRT(src, nodata=nodata, crs=self.crs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentinel-2 has as far I can see no value set for nodata. I looked everywhere. Even enabling alpha-layer in the Sentinel-2 gdal driver, and looking through the MSK_QUALIT-file I found nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will set the nodata-value. Some datasets have other nodata-values, and we should probably let the user overwrite this, for example in their subclass of RasterDataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the nodata is only overridden for the warped datasources. The non-warped (already correct CRS) are opened as is, but would also need to have the nodata overridden.
hit = self.hits[idx] | ||
bounds = BoundingBox(*hit.bounds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the first hit (file) is chosen. Currently I only use the footprint previously extracted for this file. But the sample is read from the merged raster, and the footprint for this one file might not cover the other.
Is my understanding correct?
In that case we would need to fetch all hits that overlaps with the randomly chosen hits bounds, and combine their footprints, crop it to the bounds, and pass the resulting footprint to get_random_bounding_box_check_valid_overlap
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The footprint for a hit is static. So this could be joined and cropped during init of IntersectionDataset._merge_dataset_indices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume I have understood this correctly, and added this functionality in _merge_dataset_indices
.
# Read valid/nodata-mask | ||
mask = src.read_masks() | ||
# Close holes | ||
sieved_mask = sieve(mask, 500) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds smart!
Fix #1330
RasterDatasets may contain nodata regions due to projecting all file to the same CRS, and due to eventual inherit nodata regions in the images.
When IntersectionDataset joins this with VectorDataset, this may yield
The solution can be summarised as:
RasterDataset
, when opening each file, extract footprint and add to rtree index objectIntersectionDataset._merge_dataset_indices
copy over the footprint to the new rtree index.RandomGeoSampler.__iter__
, use this footprint to validate that sample bbox actually overlaps, and don't yield until a valid box is found.GridGeoSampler
(probably other PR)overlaps
and notcontains
, corners of the resulting sample may still contain nodata, while the label mask still may cover this.) (probably other PR)Useful resources:
Rasterio nodata masks:
https://rasterio.readthedocs.io/en/latest/topics/masks.html#nodata-masks
Extract valid data footprint as vector
https://gist.github.com/sgillies/9713809
Reproject valid data footprint with rasterio
https://geopandas.org/en/stable/docs/user_guide/reproject_fiona.html#rasterio-example