Datasets
A dataset represents a group of 1 or more related STAC Collections. They group together any source imagery Collections with the associated
label Collections to provide a convenient mechanism for accessing all of these data together. For instance, the bigearthnet_v1_source
Collection contains the source imagery for the BigEarthNet training dataset and, likewise, the
bigearthnet_v1_labels
Collection contains the annotations for that same dataset. These 2 collections are grouped together into the
bigearthnet_v1
dataset.
The Radiant MLHub Training Data Registry provides an overview of the datasets available through the Radiant MLHub API along with dataset metadata and a listing of the associated Collections.
To discover and fetch datasets you can either use the low-level client methods from radiant_mlhub.client
or the
Dataset
class. Using the Dataset
class is the recommended approach, but
both methods are described below.
Note
The objects returned by the Radiant MLHub API dataset endpoints are not STAC-compliant objects and therefore the Dataset
class described below is not a PySTAC object.
Discovering Datasets
The Radiant MLHub /datasets
endpoint returns a list of objects describing the available datasets and their associated collections. You
can use the low-level list_datasets()
function to work with these responses as native Python data types
(list
and dict
). This function is a generator that yields a dict
for each dataset.
>>> from radiant_mlhub.client import list_datasets
>>> from pprint import pprint
>>> datasets = list_datasets()
>>> first_dataset = datasets[0]
>>> pprint(first_dataset)
{'collections': [{'id': 'bigearthnet_v1_source', 'types': ['source_imagery']},
{'id': 'bigearthnet_v1_labels', 'types': ['labels']}],
'id': 'bigearthnet_v1',
'title': 'BigEarthNet V1'}
You can also discover datasets using the Dataset.list
method. This is the recommended way of
listing datasets. This method returns a list of Dataset
instances.
>>> from radiant_mlhub import Dataset
>>> datasets = Dataset.list()
>>> first_dataset = datasets[0]
>>> first_dataset.id
'bigearthnet_v1'
>>> first_dataset.title
'BigEarthNet V1'
Each of these functions/methods also accepts tags
and text
arguments that can be used to filter
datasets by their tags or a free text search, respectively. The tags
argument may be either a
single string or a list of strings. Only datasets that contain all of provided tags will be returned
and these tags must be an exact match. The text argument may, similarly, be either a string or a
list of strings. These will be used to search all of the text-based metadata fields for a dataset
(e.g. description, title, citation, etc.). Each argument is treated as a phrase by the text search
engine and only datasets with matches for all of the provided phrases will be returned. So, for
instance, text=["land", "cover"]
will return all datasets with either "land"
or "cover"
somewhere in their text metadata, while text="land cover"
will return all datasets with the
phrase "land cover"
in their text metadata.
Fetching a Dataset
The Radiant MLHub /datasets/{dataset_id}
endpoint returns an object representing a single dataset. You can use the low-level
get_dataset()
function to work with this response as a dict
.
>>> from radiant_mlhub.client import get_dataset_by_id
>>> dataset = get_dataset_by_id('bigearthnet_v1')
>>> pprint(dataset)
{'collections': [{'id': 'bigearthnet_v1_source', 'types': ['source_imagery']},
{'id': 'bigearthnet_v1_labels', 'types': ['labels']}],
'id': 'bigearthnet_v1',
'title': 'BigEarthNet V1'}
You can also fetch a dataset from the Radiant MLHub API based on the dataset ID using the Dataset.fetch
method. This is the recommended way of fetching a dataset. This method returns a Dataset
instance.
>>> dataset = Dataset.fetch_by_id('bigearthnet_v1')
>>> dataset.id
'bigearthnet_v1'
If you would rather fetch the dataset using its DOI you can do so as well:
>>> from radiant_mlhub.client import get_dataset_by_doi
>>> # Using the client...
>>> dataset = get_dataset_by_doi("10.6084/m9.figshare.12047478.v2")
>>> # Using the model classes...
>>> dataset = Dataset.fetch_by_doi
You can also use the more general get_dataset()
and Dataset.fetch
methods to get a dataset using either method:
>>> from radiant_mlhub.client import get_dataset
>>> # These will all return the same dataset
>>> dataset = get_dataset("ref_african_crops_kenya_02")
>>> dataset = get_dataset("10.6084/m9.figshare.12047478.v2")
>>> dataset = Dataset.fetch("ref_african_crops_kenya_02")
>>> dataset = Dataset.fetch("10.6084/m9.figshare.12047478.v2")
Dataset Collections
If you are using the Dataset
class, you can list the Collections associated with the dataset using the
Dataset.collections
property. This method returns a modified list
that has
2 additional attributes: source_imagery
and labels
. You can use these attributes to list only the collections of a the associated type.
All elements of these lists are instances of Collection
. See the Collections documentation for
details on how to work with these instances.
>>> len(first_dataset.collections)
2
>>> len(first_dataset.collections.source_imagery)
1
>>> first_dataset.collections.source_imagery[0].id
'bigearthnet_v1_source'
>>> len(first_dataset.collections.labels)
1
>>> first_dataset.collections.labels[0].id
'bigearthnet_v1_labels'
Warning
There are rare cases of collections that contain both source_imagery
and labels
items (e.g. the SpaceNet collections). In these cases, the
collection will be listed in both the dataset.collections.labels
and dataset.collections.source_imagery
lists, but will only appear once
in the main ``dataset.collections`` list. This may cause what appears to be a mismatch in list lengths:
>>> len(dataset.collections.source_imagery) + len(dataset.collections.labels) == len(dataset.collections)
False
Note
Both the low-level client functions and the class methods also accept keyword arguments that are passed directly to
get_session()
to create a session. See the Authentication documentation for details on how to
use these arguments or configure the client to read your API key automatically.
Downloading a Dataset
The Radiant MLHub /archive/{archive_id}
endpoint allows you to download an archive of all assets associated with a given collection.
The Dataset.download
method provides a convenient way of using this endpoint to download
the archives for all collections associated with a given dataset. This method downloads the archives for all associated collections
into the given output directory and returns a list of the paths to these archives.
If a file of the same name already exists for any of the archives, this method will check whether the downloaded file is complete by
comparing its size against the size of the remote file. If they are the same size, the download is skipped, otherwise the download
will be resumed from the point where it stopped. You can control this behavior using the if_exists
argument. Setting this to
"skip"
will skip the download for existing files without checking for completeness (a bit faster since it doesn’t require a
network request), and setting this to "overwrite"
will overwrite any existing file.
>>> dataset = Collection.fetch('bigearthnet_v1')
>>> archive_paths = dataset.download('~/Downloads')
>>> len(archive_paths)
2
To check the total size of the download archives for all collections in the dataset without actually
downloading it, you can use the Dataset.total_archive_size
property.
>>> dataset.total_archive_size
71311240007
Collection archives are gzipped tarballs. You can read more about the structure of these archives in this Medium post.