Datasets
A dataset represents a group of one or more related STAC Collections. They
group together any source imagery collections with the associated label
collections to provide a convenient mechanism for accessing all of these data
together. For instance, the bigearthnet_v1_source
collection contains the
source imagery for the BigEarthNet training dataset
and, likewise, the bigearthnet_v1_labels
collection contains the
annotations for that same dataset. These two collections are grouped together
into the bigearthnet_v1
dataset.
Radiant MLHub provides an overview of the datasets available through the Radiant MLHub API along with dataset metadata and a listing of the associated collections.
To list and fetch datasets, the Dataset
class is
the recommended approach, but there are also low-level client methods from
radiant_mlhub.client
. Both methods are described below.
Hint
The Radiant MLHub web application provides an overview of all the datasets and collections available through the Radiant MLHub API.
Note
The objects returned by the Radiant MLHub API Dataset endpoints are not
STAC-compliant objects and therefore the
Dataset
class described below is not
a PySTAC object.
Discovering Datasets
You can discover datasets using the
Dataset.list
method. This method
returns a list of Dataset
instances.
>>> from radiant_mlhub import Dataset
>>> datasets = Dataset.list()
>>> for dataset in datasets[0:5]: # print first 5 datasets, for example
>>> print(dataset)
umd_mali_crop_type: 2019 Mali CropType Training Data
idiv_asia_crop_type: A crop type dataset for consistent land cover classification in Central Asia
dlr_fusion_competition_germany: A Fusion Dataset for Crop Type Classification in Germany
ref_fusion_competition_south_africa: A Fusion Dataset for Crop Type Classification in Western Cape, South Africa
bigearthnet_v1: BigEarthNet
The list()
method also accepts tags
and text
arguments that can be
used to filter datasets by their tags or a free text search, respectively. The
tags
argument may be either a single string or a list of strings. Only
datasets that contain all of provided tags will be returned and these tags must
be an exact match. The text argument may, similarly, be either a string or
a list of strings. These will be used to search all of the text-based metadata
fields for a dataset (e.g. description, title, citation, etc.). Each argument
is treated as a phrase by the text search engine and only datasets with matches
for all of the provided phrases will be returned. So, for instance,
text=["maize", "rice"]
will return all datasets with either "maize"
or
"rice"
somewhere in their text metadata, while text=["maize rice"]
will
not match any datasets. The search text="land cover"
will return all
datasets with the phrase "land cover"
in their text metadata.
Low-level client
The Radiant MLHub /datasets
endpoint returns a list of objects describing
the available datasets and their associated collections. You can use the
low-level list_datasets()
function to work with
these responses as native Python data types (list
and dict
).
>>> from radiant_mlhub.client import list_datasets
>>> from pprint import pprint
>>> datasets = list_datasets()
>>> first_dataset = datasets[0]
>>> pprint(first_dataset)
{'id': 'umd_mali_crop_type',
'title': '2019 Mali CropType Training Data',
...
Fetching Dataset Metadata
You can fetch a dataset from the Radiant MLHub API based on the dataset ID using the Dataset.fetch
method. This method returns a Dataset
instance.
Fetching returns the metadata but does not download assets.
>>> dataset = Dataset.fetch_by_id('bigearthnet_v1')
>>> print(dataset.id)
bigearthnet_v1: BigEarthNet
If you would rather fetch the dataset using its DOI you can do so as well:
dataset = Dataset.fetch_by_doi("10.6084/m9.figshare.12047478.v2")
You can also use the more general Dataset.fetch
method to get a dataset using either ID or DOI.
from radiant_mlhub.client import get_dataset
# These will all return the same dataset
dataset = Dataset.fetch("ref_african_crops_kenya_02")
dataset = Dataset.fetch("10.6084/m9.figshare.12047478.v2")
Low-level client
The Radiant MLHub /datasets/{dataset_id}
endpoint returns an object
representing a single dataset. You can use the low-level
get_dataset()
function to work with this response
as a dict
.
>>> from radiant_mlhub.client import get_dataset_by_id
>>> dataset = get_dataset_by_id('bigearthnet_v1')
>>> pprint(dataset)
{'collections': [{'id': 'bigearthnet_v1_source', 'types': ['source_imagery']},
{'id': 'bigearthnet_v1_labels', 'types': ['labels']}],
'id': 'bigearthnet_v1',
'title': 'BigEarthNet V1'}
Dataset Collections
If you are using the Dataset
class, you can list the Collections associated with the dataset using the
Dataset.collections
property. This method returns a modified list
that has
2 additional attributes: source_imagery
and labels
. You can use these attributes to list only the collections of a the associated type.
All elements of these lists are instances of Collection
. See the Collections documentation for
details on how to work with these instances.
>>> len(first_dataset.collections)
2
>>> len(first_dataset.collections.source_imagery)
1
>>> first_dataset.collections.source_imagery[0].id
'umd_mali_crop_type_source'
>>> len(first_dataset.collections.labels)
1
>>> first_dataset.collections.labels[0].id
'umd_mali_crop_type_source'
Warning
There are rare cases of collections that contain both source_imagery
and labels
items (e.g. the SpaceNet collections). In these cases, the
collection will be listed in both the dataset.collections.labels
and
dataset.collections.source_imagery
lists, but will only appear once in
the main ``dataset.collections`` list. This may cause what appears to be a
mismatch in list lengths:
>>> len(dataset.collections.source_imagery) + len(dataset.collections.labels) == len(dataset.collections)
False
Note
Both the class methods and the low-level client functions
accept keyword arguments that are passed directly to
get_session()
to create a session. See the
Authentication documentation for details on how to use these
arguments or configure the client to read your API key automatically.
Downloading Datasets
The dataset downloader offers download of STAC catalog archives, linked dataset assets, as well as partial downloads with filtering options.
- Robustness
Asset download resuming.
Retry and backoff for http error conditions.
Error reporting for unrecoverable download errors.
- Performance
Scales to millions of assets.
Multithreaded workers: parallel downloads.
- Convenience
STAC collection_id and item asset key filter
Temporal filter
Bounding box filter
GeoJSON intersection filter
Download All Assets
The most basic usage is to fetch a dataset, and then call it’s download method. The output directory is the current working directory (by default).
>>> from radiant_mlhub import Dataset
>>> nasa_marine_debris = Dataset.fetch_by_id('nasa_marine_debris')
>>> print(nasa_marine_debris)
nasa_marine_debris: Marine Debris Dataset for Object Detection in Planetscope Imagery
>>> nasa_marine_debris.download()
nasa_marine_debris: fetch stac catalog: 258KB [00:00, 412.53KB/s]
INFO:radiant_mlhub.client.catalog_downloader:unarchive nasa_marine_debris.tar.gz ...
unarchive nasa_marine_debris.tar.gz: 100%|████████████████████| 2830/2830 [00:00<00:00, 5772.09it/s]
INFO:radiant_mlhub.client.catalog_downloader:create stac asset list (please wait) ...
INFO:radiant_mlhub.client.catalog_downloader:2825 unique assets in stac catalog.
download assets: 100%|██████████████████████| 2825/2825 [03:27<00:00, 13.62it/s]
INFO:radiant_mlhub.client.catalog_downloader:assets saved to nasa_marine_debris
Download STAC Catalog Archive Only
If you want to inspect the STAC catalog or write your own download client for
the assets just pass the catalog_only
option to the download method:
>>> sen12floods.download(catalog_only=True)
sen12floods: fetch stac catalog: 2060KB [00:00, 127903.52KB/s]
INFO:radiant_mlhub.client.catalog_downloader:unarchive sen12floods.tar.gz...
unarchive sen12floods.tar.gz: 100%|█████████████████████████████████████████| 22278/22278 [00:01<00:00, 14284.65it/s]
INFO:radiant_mlhub.client.catalog_downloader:catalog saved to /home/user/sen12floods
Logging
The Python logging module can be used to control the verbosity of the downloader. The default log level is INFO.
Turn on WARNING level to see fewer log messages.
Set DEBUG level to see more messages. This includes verbose HTTP-level log messages.
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> nasa_marine_debris.download()
...
DEBUG:radiant_mlhub.client.catalog_downloader:(thread id: 123145809592320) https://radiantearth.blob.core.windows.net/mlhub/nasa-marine-debris/labels/20170326_153234_0e26_17069-29758-16.npy -> .../nasa_marine_debris/nasa_marine_debris_labels/nasa_marine_debris_labels_20170326_153234_0e26_17069-29758-16/pixel_bounds.npy
...
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): radiantearth.blob.core.windows.net:443
DEBUG:urllib3.connectionpool:https://radiantearth.blob.core.windows.net:443 "HEAD /mlhub/nasa-marine-debris/labels/20181031_095925_103b_32713-31765-16.npy HTTP/1.1" 200 0
...
(omitted many log messages here)
Output Directory
The output directory is by default, the current working directory. The
output_dir
parameter takes a str
or pathlib.Path
. It will be
created if it does not exist.
# output_dir as string
nasa_marine_debris.download(output_dir='/tmp')
# output_dir as Path object
from pathlib import Path
nasa_marine_debris.download(output_dir=Path.home() / 'my_projects' / 'ml_datasets')
Large Dataset Performance
Let’s try a bit larger dataset (tens of thousands of assets). After downloading the complete dataset, we’ll explore all of the options for filtering assets. Filtering lets you limit the items and assets to those you are interested in prior to downloading.
This download example was run on a compute-optimized 16-core virtual machine in the MS Azure West-Europe region. You would likely experience slower download performance on your machine, depending on number of cores and network bandwidth.
>>> sen12floods = Dataset.fetch_by_id('sen12floods')
>>> %%time
>>> sen12floods.download()
sen12floods: fetch stac catalog: 2060KB [00:00, 127699.36KB/s]
INFO:radiant_mlhub.client.catalog_downloader:unarchive sen12floods.tar.gz...
unarchive sen12floods.tar.gz: 100%|█████████████████████████████████████████| 22278/22278 [00:01<00:00, 14239.53it/s]
INFO:radiant_mlhub.client.catalog_downloader:create stac asset list...
INFO:radiant_mlhub.client.catalog_downloader:39063 unique assets in stac catalog.
download assets: 100%|███████████████████████████████████████████████████████████| 39063/39063 [06:26<00:00, 101.06it/s]
INFO:radiant_mlhub.client.catalog_downloader:assets saved to /home/user/sen12floods
CPU times: user 11min 44s, sys: 2min 15s, total: 14min
Wall time: 6min 40s
15GB of assets were downloaded into the sen12floods/
directory.
You may not necessarily want to download all of the assets in a dataset.
In the following sections, all the filtering options are explained.
Hint
Download filters may be freely combined, except bbox
and intersects
which are independent options.
Checking Dataset Size
Consider checking the dataset size before downloading.
>>> dataset = Dataset.fetch('nasa_marine_debris')
>>> print(dataset)
nasa_marine_debris: Marine Debris Dataset for Object Detection in Planetscope Imagery
>>> print(dataset.stac_catalog_size) # OK the STAC catalog archive is only ~260KB
263582
>>> print(dataset.estimated_dataset_size) # OK the total dataset assets are ~77MB
77207762
Filter by Collection and Asset Keys
To download only the specified STAC collection ids and STAC item asset keys,
create a dictionary in this format and pass it to the collection_filter
parameter:
{ collection_id1: [ asset_key1, asset_key2, ...], collection_id2: [asset_key1, asset_key2, ...] , ... }
For example, using the sen12floods
dataset, if we only wanted to download
four bands of the source imagery:
my_filter = dict(
sen12floods_s2_source=['B02', 'B03', 'B04', 'B08'], # Red, Green, Blue, NIR
sen12floods_s2_labels=['labels', 'documentation'],
)
sen12floods.download(collection_filter=my_filter)
Filter by Temporal Range
To download only STAC assets within a temporal range, use datetime
parameter
to specify a datetime range (tuple of datetime
objects), or a single datetime
object).
from dateutil.parser import parse
my_start_date=parse("2019-04-01T00:00:00+0000")
my_end_date=parse("2019-04-07T00:00:00+0000")
sen12floods.download(datetime=(my_start_date, my_end_date))
Filter by Bounding Box
To download only STAC assets with a spatial bounding box, use the bbox
parameter to specify a bounding box in lat/lng (CRS EPSG:4326). This performs a
spatial intersection test with each STAC item’s bounding box.
my_bbox = [-13.278254, 8.447033,
-13.231551, 8.493532]
sen12floods.download(bbox=my_bbox)
Hint
The bbox
filter may not be used with the intersects
filter (use one
or the other).
Filter by GeoJSON Area of Interest
To download only STAC assets within an area of interest, use the intersects
parameter. This performs a spatial intersection test with each STAC item’s
bounding box.
import json
my_geojson = json.loads(
"""
{
"type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
-13.278048,
8.493532
],
[
-13.278254,
8.447241
],
[
-13.231762,
8.447033
],
[
-13.231551,
8.493323
],
[
-13.278048,
8.493532
]
]
]
}
}
"""
)
sen12floods.download(intersects=my_geojson)
Hint
The intersects
filter may not be used with the bbox
filter (use one or the other).
Error reporting
Any unrecoverable download errors will be logged to {output_dir}/{dataset_id}/err_report.csv
and a Python exception will be raised.
Appendix: Default Filesystem Layout of Downloads
STAC archive file:
{output_dir}/{dataset_id}.tar.gz
Unarchived STAC catalog:
{output_dir}/{dataset_id}/catalog.json
Collection, Item and Asset layout:
{output_dir}/{dataset_id}/{collection_id}/{item_id}/{asset_key}.{ext}
Common Assets, ex: documentation.pdf are saved into a _common directory instead of duplicating them for many items:
{output_dir}/{dataset_id}/_common/{asset_key}.{ext}
Asset Database:
{output_dir}/{dataset_id}/mlhub_stac_assets.db
Error Report:
{output_dir}/{dataset_id}/err_report.csv
Hint
The mlhub_stac_assets.db
file is an artifact which may be safely deleted to free up disk space.