Moving Data#
The main way to use ufs2arco is by creating a yaml “recipe” file, which describes
the data source
transforms to the data
the target data layout
directories specifying where to store the results
whether or not to use MPI
Once that recipe is created, the following command is used to run the workflow serially:
ufs2arco recipe.yaml
In order to use MPI to parallelize the data transfer, for example over 64 processes, one would use:
mpirun -n 64 ufs2arco recipe.yaml
Note that some machines may require different commands here.
For example
NERSC’s Perlmutter
requires users to use srun not mpirun.
Note also that the yaml recipe was inspired by
anemoi-datasets
in spirit, but the actual format and capabilities are a bit different.
This page is still a work in progress, and will describe the nuts and bolts of moving data. In the meantime, feel free to raise an issue on the repo with questions, and check out example recipe files in the ufs2arco integration tests directory for some examples to help you get started.
Missing Data#
There are a number of reasons for data to be missing from the target dataset.
The file could be missing from the data source
The file could be corrupted
The transfer could have failed for some reason (e.g., local disk failure, node failure, SLURM timeout, etc)
When ufs2arco cannot find data for any of these reasons, it produces a warning and moves on. Missing data samples can be found in the following ways:
Look for a yaml file written in the same directory as the target zarr store, prefixed with
missingand suffixed with.yaml. This will list all the missing dates, forecast hours, ensemble members, etc.Check the result dataset attributes for
missing_data(base target) ormissing_dates(anemoi target). Note that for anemoi datasets only keeps track of missing dates, so we could have 100 ensemble members present for one date, but if one ensemble member is missing, that date is ignored.
Look for the following warning in the root logfile (this example comes from preparing GFS data):
[1306 s] [WARNING] ⚠️ Some data are missing. [1306 s] [WARNING] ⚠️ The missing dimension combos, i.e., ('t0', 'fhr') [1306 s] [WARNING] ⚠️ were written to: /pscratch/sd/t/timothys/nested-eagle/v0/data/missing.gfs.analysis.zarr.yaml [1306 s] [WARNING] You can try running [1306 s] [WARNING] python -c 'import ufs2arco; ufs2arco.Driver(''/path/to/your/original/recipe.yaml'').patch()' [1306 s] [WARNING] to try getting those data againNote that this shows where the yaml file noted above is located.
Run
grep -A 1 WARNINGinside of the log directory:log.0000.0256.out:[1306 s] [WARNING] ⚠️ Some data are missing. log.0000.0256.out:[1306 s] [WARNING] ⚠️ The missing dimension combos, i.e., ('t0', 'fhr') log.0000.0256.out:[1306 s] [WARNING] ⚠️ were written to: /pscratch/sd/t/timothys/nested-eagle/v0/data/missing.gfs.analysis.zarr.yaml log.0000.0256.out:[1306 s] [WARNING] You can try running log.0000.0256.out:[1306 s] [WARNING] python -c 'import ufs2arco; ufs2arco.Driver(''/path/to/your/original/recipe.yaml'').patch()' log.0000.0256.out:[1306 s] [WARNING] to try getting those data again log.0000.0256.out- -- log.0067.0256.out:[730 s] [WARNING] GFSArchive: Trouble finding the file: filecache::s3://noaa-gfs-bdp-pds/gfs.20210202/00/gfs.t00z.pgrb2.0p25.f000 log.0067.0256.out- dims = {'t0': Timestamp('2021-02-02 00:00:00'), 'fhr': np.int64(0)}, file_suffix = -- log.0112.0256.out:[176 s] [WARNING] GFSArchive: Could not find sp, will stop reading variables for this sample log.0112.0256.out- dims = {'t0': Timestamp('2016-01-15 06:00:00'), 'fhr': np.int64(0)}, file_suffixes = ['']Note that this shows two missing data instances: one where the file couldn’t be found (i.e., on 2021-02-02T00 forecast hour (fhr) 0), and one where the file was found, but it’s corrupted (could not find sp).