Xarray Dataset Support #1490

nilsleh · 2023-07-20T14:04:13Z

This PR aims to add support for dataloading based on xarray datasets as the discussion began in #1486.

The application I have for my project is taking in CMIP6 data from separate files that each define a climate variable and I need to extract sampled time-series. The sampling procedure can come in form of certain samplers as I started working on in #877 .

The current draft is based on the closed #509 , but instead of a single xr.DataArray takes in a directory to gather files like RasterDataset does.

The climate data for example does not have a CRS but they do come in different resolutions. Additionally, it could be that some data is time-series but others is not, so this needs to be merged correctly. Other climate data could also have a "depth" component.

Since there are a lot of things to consider, I would definitely appreciate input from more experienced users that have a better scope of common required work flows.

Closes #1486

adamjstewart · 2023-07-22T04:58:39Z

We discussed this over zoom, but my basic comments were something along the lines of:

This base class should be generic enough to support any NetCDF-based dataset
Most NetCDF files won't have a default CRS, we should default to lat/long (EPSG 4326 I believe)
Not sure if files have a way of expressing their bounds, we should default to the entire Earth
Subclasses can override the defaults to change to a different CRS or extent. This may require a custom __getitem__ that parses the extent from a particular layer

Other than that, it's generally difficult to design a useful base class without lots of examples. I didn't write RasterDataset until we already had several examples of the same pattern in TorchGeo. The base class was an attempt to reduce code duplication and share some of the more complicated data access stuff. I think it's fine to design this base class with the few example datasets you have in mind and change it to add new features later.

nilsleh · 2023-07-28T08:36:08Z

I tested the xarray case I have with time-series indexing from #877 and I think this requires a discussion about assumptions about dateformats which is a bit tricky. Currently, the assumption is that the index and the sampler is working with datetime.timestamps() and that the xarray dataset are then indexed with a datetime object.

adamjstewart

Difficult to review without several examples of what these datasets look like in real life. Maybe do a quick literature review and find a few good examples of popular NetCDF datasets from different sources? I want to see the diversity of file metadata. Like, do all datasets encode CRS/res/time/bounds in the same way or are they all different? This greatly influences how much customizability the base class requires.

requirements/datasets.txt

adamjstewart · 2023-08-16T16:23:13Z

requirements/min-reqs.old

@@ -31,6 +31,7 @@ radiant-mlhub==0.3.0
 rarfile==4.0
 scikit-image==0.18.0
 scipy==1.6.2
+xarray


Will need to determine the minimum version that works before merging

torchgeo/datasets/rioxarray.py

tests/data/rioxarray/data.py

adamjstewart · 2023-08-16T16:25:17Z

torchgeo/datasets/rioxarray.py

+from datetime import datetime
+from typing import Any, Callable, Optional, cast
+
+import netCDF4  # noqa: F401


This isn't used at the moment and could probably be removed

I was running into this issue.

I'd rather ignore that warning in pyproject.toml than add a fake import

adamjstewart · 2023-08-16T16:36:00Z

torchgeo/datasets/rioxarray.py

+                f"query: {query} not found in index with bounds: {self.bounds}"
+            )
+
+        data_arrays: list["np.typing.NDArray[np.float32]"] = []


Will they always be float32?

I suppose there could be cases where you also have integers but I would expect most datasets to have float values.

Probably best to keep it dynamic if we can't predict 100% of the time.

adamjstewart · 2023-08-16T16:37:50Z

torchgeo/datasets/rioxarray.py

+        for item in items:
+            with xr.open_dataset(item, decode_cf=True) as ds:
+                if not ds.rio.crs:
+                    ds.rio.write_crs(self._crs, inplace=True)


Does rioxarray automatically reproject to the right CRS if files are in multiple or if the user chooses a different CRS than the default (or if IntersectionDataset changes it)?

Good Point, I think it does not automatically reproject. So far I have only used climate data which doesn't explicitly encode or use a CRS, I should check with some MODIS files.

So I am just doing this with the climate data, where there is no explicit CRS and ds.rio.reproject() also only works for 2D and 3D arrays, whereas the CMIP data I have has more dimensions so I get rioxarray.exceptions.TooManyDimensions: Only 2D and 3D data arrays supported. Data variable: tos.

Actually, trying it with MODIS files which are .hdf files, I can only open them with rioxarray.open_rasterio() and not xr.open_dataset(engine="rasterio") so maybe one base class is too ambitious and ugly to support climate and satellite data at once.

adamjstewart · 2023-08-16T16:38:41Z

torchgeo/datasets/rioxarray.py

+                    if hasattr(clipped, variable):
+                        data_arrays.append(clipped[variable].data.squeeze())
+
+        sample = {"image": torch.from_numpy(np.stack(data_arrays)), "bbox": query}


We should support both images and masks. See the is_image attribute in RasterDataset for how we do this there. You can also copy the dtype property to automatically choose what dtype to cast to.

I'm not convinced this works correctly for multiple overlapping files. We shouldn't be stacking, we should be merging.

You are right, maybe have to rethink the base class thing and have one for xarray climate type data that is not using crs explicitly (class XarrayDataset(GeoDataset)) and one that is intended for crs depending data sources like MODIS and more similar to RasterDataset and using rioxarray so the current naming class RioXarrayDataset(GeoDataset) but with the required functionality to handle overlapping files etc.

adamjstewart · 2023-08-16T16:41:10Z

tests/data/rioxarray/data.py

+    os.makedirs(DIR)
+    for var_name in VAR_NAMES:
+        for lats, lons, cf_time in zip(LATS, LONS, CF_TIME):
+            path = os.path.join(DIR, f"{var_name}_{lats}_{lons}.nc")


Never thought about the possibility of the filename containing the bounds/res/crs. Have you seen this in the wild? If so, we could add a check for this in the regex and extract it from the filename like we do for time.

No this was just a dummy naming scheme for test data. Maybe we should explicitly create test data for some of the different data cases like CMIP, ERA5, MODIS and others and write test cases for those.

We can do that once we have subclasses for each of those datasets

adamjstewart · 2023-08-16T16:42:30Z

torchgeo/datasets/rioxarray.py

I would recommend renaming this file. Otherwise the following become very different things:

import rioxarray import .rioxarray

Maybe call it rioxr.py? Or just throw it in geo.py with the other base classes.

nilsleh · 2023-08-17T08:07:13Z

Difficult to review without several examples of what these datasets look like in real life. Maybe do a quick literature review and find a few good examples of popular NetCDF datasets from different sources? I want to see the diversity of file metadata. Like, do all datasets encode CRS/res/time/bounds in the same way or are they all different? This greatly influences how much customizability the base class requires.

This is probably not exhaustive but I tried finding some sources from climate data and satellite optical data to cover some bases.

Xarray supports these conventions for weather and climate data as suggested on their docs.
For a tutorial notebook that works with the climate data that I am also using maybe this notebook gives insights into the meta data
This is a notebook working with ERA reanalysis data which is also very popular
this and this are examples of xarray data from MODIS

I suppose it could potentially be useful to have sub classes for these specific modalities, meaning a class CMIPData(), class ERA5Data() etc. MODIS has a PR open but should be refactored. That would be similar to how RasterDataset has Landsat, Sentinel etc. as sub classes.

nilsleh · 2023-08-29T13:39:53Z

I found one approach that merges the data back again, but I think it can be quiet slow potentially. Might want to explore that.

adamjstewart

Tests are failing. Also curious if we can add a dep on only xarray or rioxarray but not both.

torchgeo/datasets/rioxr.py

nilsleh · 2023-09-11T08:35:51Z

Currently getting a rasterio.errors.WindowError: Bounds and transform are inconsistent error with the synthetic test data and trying to figure out why that is to fix the failing test. There should also be more diverse test cases given the variety of available datasets.

nilsleh · 2024-02-02T12:31:05Z

Potentially interesting utility for data loading

first draft

bf1cd30

nilsleh marked this pull request as draft July 20, 2023 14:04

github-actions bot added the datasets Geospatial or benchmark datasets label Jul 20, 2023

adamjstewart added this to the 0.5.0 milestone Jul 20, 2023

nilsleh changed the title ~~Xarray Dataset~~ Xarray Dataset Support Jul 21, 2023

nilsleh added 2 commits July 28, 2023 09:04

progress

9cb8efe

test with time-step indexing

064cad6

add basic unit tests

28766a4

github-actions bot added the testing Continuous integration testing label Jul 28, 2023

nilsleh marked this pull request as ready for review July 28, 2023 12:02

nilsleh added 2 commits July 28, 2023 14:05

time step optional

5ffd160

xarray as dependency

1459b12

github-actions bot added the dependencies Packaging and dependencies label Jul 28, 2023

try to get tests with deps to work

8f93e07

adamjstewart reviewed Aug 16, 2023

View reviewed changes

nilsleh added 2 commits August 17, 2023 10:49

some requested changes

dc4cc8c

rioxarray merge arrays

c58972f

nilsleh added 2 commits August 29, 2023 15:41

resolve conflicts

6d0dc39

make data_variables optional and infer spatial coordinates automatically

0b7e0b5

adamjstewart reviewed Sep 7, 2023

View reviewed changes

torchgeo/datasets/rioxr.py Show resolved Hide resolved

torchgeo/datasets/rioxr.py Outdated Show resolved Hide resolved

nilsleh added 4 commits September 7, 2023 20:52

working tests locally

27999f3

tests remote

ff49477

merge main

1e774fb

bounds and transform inconsisten error for synthetic data

a06ab61

latest attempt

702dfef

netcdf4 req

67954d7

adamjstewart removed this from the 0.5.0 milestone Sep 28, 2023

merge main

5c99a2f

adamjstewart mentioned this pull request Nov 26, 2023

Add PRISMA dataset #1743

Merged

4 tasks

nilsleh added 2 commits December 15, 2023 15:35

store changes

48cdbc8

store changes

f025a84

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xarray Dataset Support #1490

Xarray Dataset Support #1490

nilsleh commented Jul 20, 2023 •

edited by adamjstewart

adamjstewart commented Jul 22, 2023

nilsleh commented Jul 28, 2023

adamjstewart left a comment

adamjstewart Aug 16, 2023

adamjstewart Aug 16, 2023

nilsleh Aug 17, 2023 •

edited

adamjstewart Aug 18, 2023

adamjstewart Aug 16, 2023

nilsleh Aug 17, 2023

adamjstewart Aug 18, 2023

adamjstewart Aug 16, 2023

nilsleh Aug 17, 2023

nilsleh Aug 17, 2023

nilsleh Aug 17, 2023

adamjstewart Aug 16, 2023

adamjstewart Aug 16, 2023

nilsleh Aug 17, 2023 •

edited

adamjstewart Aug 16, 2023

nilsleh Aug 17, 2023

adamjstewart Aug 18, 2023

adamjstewart Aug 16, 2023

nilsleh commented Aug 17, 2023 •

edited

nilsleh commented Aug 29, 2023

adamjstewart left a comment

nilsleh commented Sep 11, 2023

nilsleh commented Feb 2, 2024 •

edited

Xarray Dataset Support #1490

Are you sure you want to change the base?

Xarray Dataset Support #1490

Conversation

nilsleh commented Jul 20, 2023 • edited by adamjstewart

adamjstewart commented Jul 22, 2023

nilsleh commented Jul 28, 2023

adamjstewart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilsleh Aug 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilsleh Aug 17, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nilsleh commented Aug 17, 2023 • edited

nilsleh commented Aug 29, 2023

adamjstewart left a comment

Choose a reason for hiding this comment

nilsleh commented Sep 11, 2023

nilsleh commented Feb 2, 2024 • edited

nilsleh commented Jul 20, 2023 •

edited by adamjstewart

nilsleh Aug 17, 2023 •

edited

nilsleh Aug 17, 2023 •

edited

nilsleh commented Aug 17, 2023 •

edited

nilsleh commented Feb 2, 2024 •

edited