Consistent data repositories across HPCs #36

kasra-keshavarz · 2024-02-08T22:48:57Z

The data repository need to become consistent in order to be sharable across various HPCs. Data repository is not directly included in here, however, path names need to be updated soon.

Some recommendations:

lower case directory names,
download workflows,
TBD

In this commit, the following are addressed: * Correcting paths for the local scripts, * Renaming scripts to reflect the owner of the script for further clarification, * Adding parallelization schemes based on model, ensemble, and scenario, * Adding gcc/9.3.0 as the reference clib for the modules loaded to prevent mismatch between various environments defined on the HPCs, * Assuring ESPG:4326 is considered for the input shape file if there is no CRS defined, * Getting rid of \t characters in the help messages, * Correcting short help message to be more informative, * Adding function declarations to follow Google’s shell scripting guidelines, * Assuring --account=STR is described in the help message. Signed-off-by: Kasra Keshavarz <[email protected]>

* Ouranos ESPO-G6-R2 script + new capablities This script introduces new features to the tool, including the capability to process the climate datasets, including those consisting of multiple models, submodels (those with specific configuration sets), ensemble members, and multiple scenarios (SSPs). The parent calling script is in charge of parallelization scheme, if needed. With this script, a few issues related to the current deficiencies of datatool could be resolved simultaneously. Signed-off-by: Kasra Keshavarz <[email protected]> * Fixing short usage and comments * Adding new parallelization schemes Multi parallelization schemes are added, so the package not only submit array jobs based on the given date range and the chunk schemes, but also considers submitting jobs based on various models, ensemble members, and scenarios. These new parallelization schemes mostly applies to climate datasets, but not necessarily. This commit aims to save time for the user and fasten the processing time for datasets. This commit resolves issue #25 on remote GitHub hosting repository. Furthermore, it adds the ESPO dataset to the list of datasets as well. Moreover, a new option is implement to show the list of currently available datasets to the users. Signed-off-by: Kasra Keshavarz <[email protected]> * Separating dataset information from the main extract-dataset script This is meant to clearly organize the information provided inside the package. The new file lists all the available datasets and the keyword that users can provide the `--dataset` option. Previously, this information was part of the main Usage message `--help` of the main script. Signed-off-by: Kasra Keshavarz <[email protected]> * Adding GDDP-NEX-CMIP6 info * Fixing DOI value for ab-gov dataset * Adding NASA GDDP-NEX-CMIP6 script address * ESPO-G6-R2 data processing example * Multiple minor modifications 1. the "function" keywords added to make the style compatible with that of Google's recommendations, 2. required arguments and options are revised alongside the relevant comments, 3. typos are fixed Signed-off-by: Kasra Keshavarz <[email protected]> * AB Government Climate Dataset Script The script deals with the Climate Dataset produced by the Alberta Government. The dataset is not public yet, and is planned to be available soon. Signed-off-by: Kasra Keshavarz <[email protected]> * Adding variable list for various elevation levels Since some hydrological models can use near-surface level or 40m level data, the necessary list of variables for both levels are added. Furthermore, a link to the official website for the dataset is added for further clarity. Signed-off-by: Kasra Keshavarz <[email protected]> * Path to the dataset for rpp-kshook allocation is updated Since multiple HPCs are now used for the workflows, it is important to have consistent datasets synchronized regularly. Therefore, this commit attempts to reflect these efforts by creating consistent paths for various HPCs/allocations. Signed-off-by: Kasra Keshavarz <[email protected]> * Bumping version to v0.5.0 * Addressing issues #39, #37, #36, #35, #34, and #25 In this commit, the following are addressed: * Correcting paths for the local scripts, * Renaming scripts to reflect the owner of the script for further clarification, * Adding parallelization schemes based on model, ensemble, and scenario, * Adding gcc/9.3.0 as the reference clib for the modules loaded to prevent mismatch between various environments defined on the HPCs, * Assuring ESPG:4326 is considered for the input shape file if there is no CRS defined, * Getting rid of \t characters in the help messages, * Correcting short help message to be more informative, * Adding function declarations to follow Google’s shell scripting guidelines, * Assuring --account=STR is described in the help message. Signed-off-by: Kasra Keshavarz <[email protected]> * Assuring compatibility of the style with Google's shell scripting guidelines * Organizing the assets directory Various files within this directory is categorized to be more informative for the users/devs. Signed-off-by: Kasra Keshavarz <[email protected]> * README file for ab-gov dataset The README file for this dataset is added, offering necessary information for the users. Signed-off-by: Kasra Keshavarz <[email protected]> * Minor structural changes This commit assures all dataset scripts follows the convention of <institute>-<dataset-name> under the `scripts` path. Furthermore, necessary adjusments on the styles of the scripts has been implemented, including: * adding `--model`, `--scenario`, and `--ensemble` options, if missing, for compatibility with the main caller script, as these options are given to the script by `extract-dataset.sh` script, * assuring scripting style follows that of Google's shell scripting guidelines, * the paths to the externally called scripts are properlly adjusted, after modifications to the structure of datatool's `assets` directory, and * minor changes to the source code to assure compatibility with the v0.5.0 of datatool. Signed-off-by: Kasra Keshavarz <[email protected]> * Tracking LICENSE of eccc-rdrs * Tracking eccc-rdrs script * Tracking GWF-NCAR CONUS-I script * Documentation for NASA's NEX-GDDP-CMIP6 dataset This commit addresses issue #27 by describing the NASA's NEX-GDDP-CMIP^ dataset and relevant scripts for it. Furthermore, it provides necessary information for users to enable them use `datatool` for extracting subsets of the dataset for any temporal and spatial extents. Signed-off-by: Kasra Keshavarz <[email protected]> * Script for NASA's NEX-GDDP-CMIP6 dataset This commit addresses issue #27 and provides scripts to extract subset from NASA's NEX-GDDP-CMIP6 dataset. This script is capable to work with various models, scenarios, ensemble members, and variables offered by this dataset. Signed-off-by: Kasra Keshavarz <[email protected]> * Adding Ouranos ESPO-G6-R2 Dataset Script This commit addresses issue #34 and processes this dataset that contains multiple GCM model outputs, including various sub-models, scenarios, ensemble members, and variables. Signed-off-by: Kasra Keshavarz <[email protected]> * Documenting Ouranos ESPO-G6-R2 Dataset script Necessary information to use `datatool` for this script is provided to the user via the README.md file. Signed-off-by: Kasra Keshavarz <[email protected]> * Updating changelog for v0.5.0 * Adding a section for WIP directories * Restructuring script directory With the growing number of scripts, this commit tries to restructure this directory to provide more clarity and organization for the users. Signed-off-by: Kasra Keshavarz <[email protected]> * Updates to the documentations The help message has been trimmed to provide more information to the users. This include values provided to the `--lon-lims` that must be within the [-180, +180] limits. This has not been mentioned before to the users and could have provided confusion, as there are multiple methods to describe longitudes. Furthermore, the list of datasets on the main page of the repository has been updated to reflect the most up-to-date list. Signed-off-by: Kasra Keshavarz <[email protected]> * Upgrading style of warning message * Upgrading style of warning message * Updating link addresses for CONUS I & II * Updating link address to ERA5 dataset * Removing dead link for the Ouranos MRCC5 dataset for now --------- Signed-off-by: Kasra Keshavarz <[email protected]>

kasra-keshavarz · 2024-03-06T00:03:22Z

Partially resolved with #43

kasra-keshavarz added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 8, 2024

kasra-keshavarz self-assigned this Feb 8, 2024

kasra-keshavarz changed the title ~~Consistent data repository~~ Consistent data repositories across HPCs Feb 23, 2024

kasra-keshavarz mentioned this issue Mar 5, 2024

Iss25 #43

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistent data repositories across HPCs #36

Consistent data repositories across HPCs #36

kasra-keshavarz commented Feb 8, 2024

kasra-keshavarz commented Mar 6, 2024

Consistent data repositories across HPCs #36

Consistent data repositories across HPCs #36

Comments

kasra-keshavarz commented Feb 8, 2024

kasra-keshavarz commented Mar 6, 2024