Version 59 (modified by 3 years ago) (diff) | ,
---|
ACCESS Gadi Transition
Gadi is NCI's new high performance computer, expected to come online in November-December 2019
-
- ACCESS Gadi Transition
Information from NCI
NCI Information and Transition Timeline
Important Changes from Raijin
- New node size - each node has 48 CPUs, large jobs must have a cpu count that's a multiple of this
- No /short space - Persistent storage is only available on /g/data
- New /scratch space - Very large (!) quotas, but files will be automatically deleted after 90 days
- Only latest /apps modules - Old versions of modules currently on Raijin will not be moved to Gadi, see above link for list
Important Notes from Gadi
- You may update the default $PROJECT and $SHELL at $HOME/.config/gadi-login.conf
- You must use full name accessdev.nci.org.au to log into accessdev on gadi: accessdev is not recognised on gadi; however, on accessdev, you may use either gadi or the full name gadi.nci.org.au
- When generating keys on Gadi for passwordless communication, please use ssh-keygen -t rsa, do not use dsa option.
- Check your project carefully as it may not have copied correctly from Raijin to Gadi
1) Your gadi:/home/<#>/<username> group ownership
Use chgrp to change this, chgrp <default_gid> ~<username> eg chgrp dp9 ~rmb548. You may want to add -R to do all the sub-directories and files eg chgrp -R dp9 ~rmb548.
2) Your gadi:~<username>/.config/gadi-login.conf default login group
Edit this file to contain the default project you require. This also contains your default SHELL so you can reset it if you want. Note resetting your SHELL is not advised as the use of tcsh or ksh may not be fully supported.
3) Also check and correct your /g/data/<project> files as well if you have created any from Gadi this week
Similarly for any directories and files on gadi:/scratch.
SSH Setup
At the moment the remote-job-submission system that was on raijin is not working on gadi. Therefore you need to set up passphraseless SSH between accessdev and gadi in both directions.
Cylc connection gadi to accessdev
You will need to set up a passphraseless SSH key for Cylc jobs running on Gadi to communicate back to Accessdev
On Gadi, run
$ ssh-keygen -f ~/.ssh/id_rsa.accessdev
Just press 'enter' when prompted for a passphrase
Copy the public key to Accessdev with
$ ssh-copy-id -i ~/.ssh/id_rsa.accessdev.pub accessdev.nci.org.au
Configure Gadi to use the key when connecting to Accessdev by adding a new section to '~/.ssh/config':
Host accessdev.nci.org.au IdentityFile ~/.ssh/id_rsa.accessdev
Cylc connection accessdev to gadi
On accessdev, run
ssh-keygen -f ~/.ssh/id_rsa.gadi
Just press 'enter' when prompted for a passphrase
Copy the public key to gadi with
ssh-copy-id -i ~/.ssh/id_rsa.gadi.pub gadi.nci.org.au
Configure accessdev to use the key when connecting to gadi by adding a new section to '~/.ssh/config':
Host gadi.nci.org.au gadi IdentityFile ~/.ssh/id_rsa.gadi IdentitiesOnly yes
Configuring ACCESS Jobs for Gadi
The ACCESS support team will be providing instructions for running ACCESS on Gadi once we have more information. You can contact us by emailing cws_help@…
NOTE - cylc problems may occur if you define CYLC_VERSION
in your login scripts or rose-suite.conf files. To use a supported version of cylc, please either set CYLC_VERSION=7.8.3
or remove the definition from your files.
Rose/Cylc (UM vn10 / ACCESS 2 or later)
The exact settings required to run a Rose/Cylc suite on Gadi will depend on the initial setup of the suite. There are two things you must do - first set up Rose to run suites on Gadi's /scratch disk, which only needs to be done once, second set up the Cylc configuration of each suite you want to run to talk to Gadi.
We should in time have a list of configurations pre-configured to run on Gadi
- Set up Rose to run jobs from /scratch/$PROJECT/$USER/cylc-run by default:
Add to ~/.metomi/rose.conf (Replace
a12/abc123
with your project / user id):[rose-suite-run] root-dir=gadi*=/scratch/a12/abc123 root-dir{share/cycle}=gadi*=/scratch/a12/abc123 root-dir{share}=gadi*=/scratch/a12/abc123 root-dir{work}=gadi*=/scratch/a12/abc123You may wish to send the 'share' outputs, which include the model output files, to /g/data instead of /scratch
- Set up the HPC task in your Rose suite. This will vary depending on the suite you're using, but should be something like this for a GA job
Copy site/nci_raijin.rc to site/nci_gadi.rc, and edit to set up HPC and UMBUILD_RESOURCE, then set SITE in your rose configuration to 'nci_gadi'
Some GA suites check to make sure that the site is in a pre-approved list, if you run into errors you can just edit the site/nci_raijin.rc file instead and leave SITE unchanged in your rose configuration
[ runtime ] # Add any projects you need to access here - scratch/$PROJECT is included by default {% set storage_projects = ['scratch/access', 'gdata/access', 'gdata/'+environ['PROJECT']] %} [[ HPC ]] init-script = """ module purge export PATH=~access/bin:$PATH module use ~access/modules module load openmpi/4.0.1 ulimit -s unlimited """ [[[ remote ]]] host = gadi [[[ job submission ]]] method = pbs [[[ directives ]]] -q = normal -l ncpus = 1 -l walltime = 1:00:00 -l mem = 4 gb -l storage = {{ storage_projects | join('+') }} -W umask = 0022 [[[ environment ]]] UMDIR = ~access/umdir ROSE_TASK_N_JOBS = ${PBS_NCPUS:-1} UM_SIGNALS='' [[ UMBUILD_RESOURCE ]] inherit = HPC init-script = """ module purge export PATH=~access/bin:$PATH module use ~access/modules module load intel-compiler/2019.3.199 module load openmpi/4.0.1 module load gcom/7.0_ompi.4.0.1 module load fcm module load netcdf module load drhook module load eccodes ulimit -s unlimited """ [[[ directives ]]] -q = express -l ncpus = 6 -l mem = 12gb [[[ environment ]]] ROSE_TASK_OPTIONS = -f fcm-make2.cfg # Change the existing value of ROSE_LAUNCHER_PREOPTS in UM tasks to use OpenMP [[ ATMOS_RESOURCE ]] [[[ environment ]]] ROSE_LAUNCHER_PREOPTS = -n {{ cpus(MAIN_ATM_PROCX, MAIN_ATM_PROCY, MAIN_IOS_NPROC, 1) }} --map-by node:PE={{MAIN_OMPTHR_ATM}} --rank-by core
Build configurations
NCI configurations before vn11.2 used -openmp
which is no longer supported, so this needs to be changed to -qopenmp
. Branches with this change have been created for several versions and more will be done on request. See https://code.metoffice.gov.uk/trac/um/ticket/5299. To use this, modify the suite's app/fcm_make_um/rose-app.conf
to use (with vn10.7 for example)
config_revision=@81627 config_root_path=fcm:um.xm_br/dev/martindix/vn10.7_nci_gadi
Example AMIP suites running on gadi
vn10.7 | u-ak889 |
vn11.4 | u-bo915, u-bp440 |
2019.5.281 and later compilers
Builds with the most recent Intel compilers (2019.5.181) and later fail because an intrinsic function name is used as a variable in getobs.F90
. See https://code.metoffice.gov.uk/trac/um/ticket/5328 for Ilia's fix. This fix has been backported to the various vnX.Y_nci_gadi branches.
UMUI (UM vn7 / ACCESS 1)
A hand edit script will apply most of the required changes to central data paths and modules needed to run a UMUI job on Gadi, however you will need to manually change run output paths
Configuration Modifications:
Under User Information and Target Machine -> Target Machine, set:
- Number of processors E-W and N-S so that their product is a multiple of 48 (the number of cores per node on Gadi)
- Machine Name as 'gadi'
Under Input/Output Control and Resources -> Time Convention and SCRIPT Environment Variables, set:
- DATAM to the output directory for model outputs (e.g. '/g/data/$PROJECT/$USER/umui/$RUNID')
- DATAW to the output directory for log files & namelists (e.g. '/scratch/$PROJECT/$USER/umui/$RUNID')
These can be set to the same path if desired'
Under Input/Output Control and Resources -> User hand edit files, add a new entry to the end '~access/gadi/handedits/um7.3' and put a 'Y' in the second column to enable it
Under FCM Configuration -> FCM Extract and Build directories and Output levels, set:
- Target machine root extract directory (UM_ROUTDIR) to a path on /scratch (e.g '/scratch/$PROJECT/$USER/um_builds'). Note the build system adds on '$USER/$RUNID' to this path automatically.
Configuring PBS Jobs for Python for Gadi
New Improved snippet (Thanks Scott):
qsub << QSUB module use /g/data/hh5/public/modules module load conda/analysis27-18.10 python --version python << EOF import numpy as np print np.version EOF QSUB
module load conda/analysis27-18.10
can be replaced with module load conda/analysis3-19.10
or conda environment of your choice
Python version mismatch on conda/analysis3-19.10 Explained
I have noticed that the conda information about the python version number does not match the python version number.
Scott has confirmed that the conda python version is the version that conda was built under and is always going to be different.
Transition work progress
gadi/transition/access modules
Comnnections to external hosts/systems
gadi datamovers
gadi-dm.nci.org.au or if needing to reference a specific one gadi-dm-[01-06].nci.org.au
UKMO / JASMIN
Jasmin logins are rejected from gadi-login nodes, due to problems with the Gadi DNS records. This problem is being investigated by NCI. Jasmin transfers may work from gadi-dm nodes, but I haven't tested this. [Milton Woods - 2019-12-18]
I have no information at this time and I will be discussing the issue with Milton Woods. I populate this section when I get it. I have sent a friendly email to Joao Teixeira as an intial step. Griff. 2019-11-21.
Joao Teixeira has responded with an invitation to review the "set of instructions to help UM Partners into joining JASMIN and access MASS ahead of the GC4 assessment.".
These are not yet finalised (they need reviewing) but since you are going through this process at the moment they could possibly be useful for you? Feel free to point out anything that I may have miss and that you find important for people at BoM 😊 https://code.metoffice.gov.uk/trac/jumps/wiki/JASMIN
[Jin LEE] Some tasks of some Rose suites need to access the main MOSRS repositories or the mirrors of the MOSRS repositories when they run on Gadi. To allow this access there are a few things that need to be done:
- Most tasks access the main MOSRS repositories or their mirrors using FCM keywords. These keywords need to be set up on Gadi. [Done - Milton Woods]
- Mirrors of the MOSRS repositories need to be set up on Gadi. The easiest is to mirror the mirrors on accessdev. [Selected mirrors have been created; please request others via cws_help@…].
OpenMP
To use OpenMP and OpenMPI together:
mpirun -np 48 --map-by node:PE=$OMP_NUM_THREADS --rank-by core $PROGRAM
Module commands
modulecmd is available on gadi. It was not available on raijin. This is used in some Met Office scripts.
KSH scripts from BASH
When a KSH script is executed from a bash environment, the module command is not available. (This didn't occur on raijin, but not clear what the difference is.) Possible solutions (some from Dale Roberts)
- You can have your ksh script source the ksh startup files, which will import the ksh version of the modules function, by changing the shebang line to
#!/usr/bin/env -S ksh -l
. - you can also run them using
ksh -l script
. - Also, you can switch to a ksh environment, which will load the ksh definition of modules, before running your script by running
ksh
before./script
- (From Milton) source /etc/profile.d/modules.sh in the ksh script
- (From Yue Sun, NCI help) set
FPATH=/opt/Modules/v4.3.0/init/ksh-functions
within the KSH script
Note. Dale Roberts advised Jin Lee that this problem is now resolved.
Common Problems and Solutions
- Cylc communication problems: remove
CYLC_VERSION
definition from login scripts and rose-suite.conf (or set the value to '7.8.3'). - Currently recommended MPI is openmpi; investigations of intel-mpi are continuing.
module load openmpi/4.0.2 export KMP_AFFINITY=compact mpirun -np $NP --map-by node:PE=$OMP_NUM_THREADS --rank-by core $EXEC
- Using
rose mpi-launch
allows software-agnostic UM command rendering only if variableROSE_LAUNCHER_ULIMIT_OPTS
is unset - Fortran problems with hdf5: try different versions of hdf5 module
- Missing rpc/rpc.h: use
-I/usr/include/tirpc
for compilation and-ltirpc
for linking - Missing
module
function inksh
scripts: recent NCI changes should have fixed this, otherwise please report to cws_help@…. - UM11.4 no longer seems to support grib_api (change actually with vn11.2, see https://code.metoffice.gov.uk/trac/um/ticket/4163). Suggest switching to eccodes:
module use ~access/modules module load eccodes/2.8.0