Datasets

A dataset represents a group of 1 or more related STAC Collections. They group together any source imagery Collections with the associated label Collections to provide a convenient mechanism for accessing all of these data together. For instance, the bigearthnet_v1_source Collection contains the source imagery for the BigEarthNet training dataset and, likewise, the bigearthnet_v1_labels Collection contains the annotations for that same dataset. These 2 collections are grouped together into the bigearthnet_v1 dataset.

The Radiant MLHub Training Data Registry provides an overview of the datasets available through the Radiant MLHub API along with dataset metadata and a listing of the associated Collections.

To discover and fetch datasets you can either use the low-level client methods from radiant_mlhub.client or the Dataset class. Using the Dataset class is the recommended approach, but both methods are described below.

Note

The objects returned by the Radiant MLHub API dataset endpoints are not STAC-compliant objects and therefore the Dataset class described below is not a PySTAC object.

Discovering Datasets

The Radiant MLHub /datasets endpoint returns a list of objects describing the available datasets and their associated collections. You can use the low-level list_datasets() function to work with these responses as native Python data types (list and dict). This function is a generator that yields a dict for each dataset.

>>> from radiant_mlhub.client import list_datasets
>>> from pprint import pprint
>>> datasets = list_datasets()
>>> first_dataset = datasets[0]
>>> pprint(first_dataset)
{'collections': [{'id': 'bigearthnet_v1_source', 'types': ['source_imagery']},
             {'id': 'bigearthnet_v1_labels', 'types': ['labels']}],
 'id': 'bigearthnet_v1',
 'title': 'BigEarthNet V1'}

You can also discover datasets using the Dataset.list method. This is the recommended way of listing datasets. This method returns a list of Dataset instances.

>>> from radiant_mlhub import Dataset
>>> datasets = Dataset.list()
>>> first_dataset = datasets[0]
>>> first_dataset.id
'bigearthnet_v1'
>>> first_dataset.title
'BigEarthNet V1'

Each of these functions/methods also accepts tags and text arguments that can be used to filter datasets by their tags or a free text search, respectively. The tags argument may be either a single string or a list of strings. Only datasets that contain all of provided tags will be returned and these tags must be an exact match. The text argument may, similarly, be either a string or a list of strings. These will be used to search all of the text-based metadata fields for a dataset (e.g. description, title, citation, etc.). Each argument is treated as a phrase by the text search engine and only datasets with matches for all of the provided phrases will be returned. So, for instance, text=["land", "cover"] will return all datasets with either "land" or "cover" somewhere in their text metadata, while text="land cover" will return all datasets with the phrase "land cover" in their text metadata.

Fetching a Dataset

The Radiant MLHub /datasets/{dataset_id} endpoint returns an object representing a single dataset. You can use the low-level get_dataset() function to work with this response as a dict.

>>> from radiant_mlhub.client import get_dataset_by_id
>>> dataset = get_dataset_by_id('bigearthnet_v1')
>>> pprint(dataset)
{'collections': [{'id': 'bigearthnet_v1_source', 'types': ['source_imagery']},
             {'id': 'bigearthnet_v1_labels', 'types': ['labels']}],
 'id': 'bigearthnet_v1',
 'title': 'BigEarthNet V1'}

You can also fetch a dataset from the Radiant MLHub API based on the dataset ID using the Dataset.fetch method. This is the recommended way of fetching a dataset. This method returns a Dataset instance.

>>> dataset = Dataset.fetch_by_id('bigearthnet_v1')
>>> dataset.id
'bigearthnet_v1'

If you would rather fetch the dataset using its DOI you can do so as well:

>>> from radiant_mlhub.client import get_dataset_by_doi
>>> # Using the client...
>>> dataset = get_dataset_by_doi("10.6084/m9.figshare.12047478.v2")
>>> # Using the model classes...
>>> dataset = Dataset.fetch_by_doi

You can also use the more general get_dataset() and Dataset.fetch methods to get a dataset using either method:

>>> from radiant_mlhub.client import get_dataset
>>> # These will all return the same dataset
>>> dataset = get_dataset("ref_african_crops_kenya_02")
>>> dataset = get_dataset("10.6084/m9.figshare.12047478.v2")
>>> dataset = Dataset.fetch("ref_african_crops_kenya_02")
>>> dataset = Dataset.fetch("10.6084/m9.figshare.12047478.v2")

Dataset Collections

If you are using the Dataset class, you can list the Collections associated with the dataset using the Dataset.collections property. This method returns a modified list that has 2 additional attributes: source_imagery and labels. You can use these attributes to list only the collections of a the associated type. All elements of these lists are instances of Collection. See the Collections documentation for details on how to work with these instances.

>>> len(first_dataset.collections)
2
>>> len(first_dataset.collections.source_imagery)
1
>>> first_dataset.collections.source_imagery[0].id
'bigearthnet_v1_source'
>>> len(first_dataset.collections.labels)
1
>>> first_dataset.collections.labels[0].id
'bigearthnet_v1_labels'

Warning

There are rare cases of collections that contain both source_imagery and labels items (e.g. the SpaceNet collections). In these cases, the collection will be listed in both the dataset.collections.labels and dataset.collections.source_imagery lists, but will only appear once in the main ``dataset.collections`` list. This may cause what appears to be a mismatch in list lengths:

>>> len(dataset.collections.source_imagery) + len(dataset.collections.labels) == len(dataset.collections)
False

Note

Both the low-level client functions and the class methods also accept keyword arguments that are passed directly to get_session() to create a session. See the Authentication documentation for details on how to use these arguments or configure the client to read your API key automatically.

Downloading a Dataset

The Radiant MLHub /archive/{archive_id} endpoint allows you to download an archive of all assets associated with a given collection. The Dataset.download method provides a convenient way of using this endpoint to download the archives for all collections associated with a given dataset. This method downloads the archives for all associated collections into the given output directory and returns a list of the paths to these archives.

If a file of the same name already exists for any of the archives, this method will check whether the downloaded file is complete by comparing its size against the size of the remote file. If they are the same size, the download is skipped, otherwise the download will be resumed from the point where it stopped. You can control this behavior using the if_exists argument. Setting this to "skip" will skip the download for existing files without checking for completeness (a bit faster since it doesn’t require a network request), and setting this to "overwrite" will overwrite any existing file.

>>> dataset = Collection.fetch('bigearthnet_v1')
>>> archive_paths = dataset.download('~/Downloads')
>>> len(archive_paths)
2

To check the total size of the download archives for all collections in the dataset without actually downloading it, you can use the Dataset.total_archive_size property.

>>> dataset.total_archive_size
71311240007

Collection archives are gzipped tarballs. You can read more about the structure of these archives in this Medium post.