Moving Data#

The main way to use ufs2arco is by creating a yaml “recipe” file, which describes

the data source
transforms to the data
the target data layout
directories specifying where to store the results
whether or not to use MPI

Once that recipe is created, the following command is used to run the workflow serially:

ufs2arco recipe.yaml

In order to use MPI to parallelize the data transfer, for example over 64 processes, one would use:

mpirun -n 64 ufs2arco recipe.yaml

Note that some machines may require different commands here. For example NERSC’s Perlmutter requires users to use srun not mpirun. Note also that the yaml recipe was inspired by anemoi-datasets in spirit, but the actual format and capabilities are a bit different.

This page is still a work in progress, and will describe the nuts and bolts of moving data. In the meantime, feel free to raise an issue on the repo with questions, and check out example recipe files in the ufs2arco integration tests directory for some examples to help you get started.

Missing Data#

There are a number of reasons for data to be missing from the target dataset.

The file could be missing from the data source
The file could be corrupted
The transfer could have failed for some reason (e.g., local disk failure, node failure, SLURM timeout, etc)

When ufs2arco cannot find data for any of these reasons, it produces a warning and moves on. Missing data samples can be found in the following ways:

Look for a yaml file written in the same directory as the target zarr store, prefixed with missing and suffixed with .yaml. This will list all the missing dates, forecast hours, ensemble members, etc.
Check the result dataset attributes for missing_data (base target) or missing_dates (anemoi target). Note that for anemoi datasets only keeps track of missing dates, so we could have 100 ensemble members present for one date, but if one ensemble member is missing, that date is ignored.

Look for the following warning in the root logfile (this example comes from preparing GFS data):

[1306 s] [WARNING] ⚠️  Some data are missing.
[1306 s] [WARNING] ⚠️  The missing dimension combos, i.e., ('t0', 'fhr')
[1306 s] [WARNING] ⚠️  were written to: /pscratch/sd/t/timothys/nested-eagle/v0/data/missing.gfs.analysis.zarr.yaml
[1306 s] [WARNING] You can try running
[1306 s] [WARNING]      python -c 'import ufs2arco; ufs2arco.Driver(''/path/to/your/original/recipe.yaml'').patch()'
[1306 s] [WARNING] to try getting those data again

Note that this shows where the yaml file noted above is located.

Run grep -A 1 WARNING inside of the log directory:

log.0000.0256.out:[1306 s] [WARNING] ⚠️  Some data are missing.
log.0000.0256.out:[1306 s] [WARNING] ⚠️  The missing dimension combos, i.e., ('t0', 'fhr')
log.0000.0256.out:[1306 s] [WARNING] ⚠️  were written to: /pscratch/sd/t/timothys/nested-eagle/v0/data/missing.gfs.analysis.zarr.yaml
log.0000.0256.out:[1306 s] [WARNING] You can try running
log.0000.0256.out:[1306 s] [WARNING]    python -c 'import ufs2arco; ufs2arco.Driver(''/path/to/your/original/recipe.yaml'').patch()'
log.0000.0256.out:[1306 s] [WARNING] to try getting those data again
log.0000.0256.out-
--
log.0067.0256.out:[730 s] [WARNING] GFSArchive: Trouble finding the file: filecache::s3://noaa-gfs-bdp-pds/gfs.20210202/00/gfs.t00z.pgrb2.0p25.f000
log.0067.0256.out-      dims = {'t0': Timestamp('2021-02-02 00:00:00'), 'fhr': np.int64(0)}, file_suffix =
--
log.0112.0256.out:[176 s] [WARNING] GFSArchive: Could not find sp, will stop reading variables for this sample
log.0112.0256.out-      dims = {'t0': Timestamp('2016-01-15 06:00:00'), 'fhr': np.int64(0)}, file_suffixes = ['']

Note that this shows two missing data instances: one where the file couldn’t be found (i.e., on 2021-02-02T00 forecast hour (fhr) 0), and one where the file was found, but it’s corrupted (could not find sp).

Moving Data

Contents

Moving Data#

Missing Data#