Using the Totalview debugger from cylc

The method previously described here for raijin no longer works on gadi because PBS now only allows setting the DISPLAY in interactive jobs.

Instead use reverse connections.

Using Totalview on gadi (using reverse connections)

Job configuration

In your suite, or job, you could avoid using rose mpi-run, and instead use an explicit launcher. Most calling scripts (in UM, OPS, VAR, SURF) have a variable called e.g. RECON_LAUNCHER, or OPS_LAUNCHER, which is the alternative to using rose mpi-launch.

Make sure that any variables that are used by rose-mpi-launch to do something, are instead defined otherwise, e.g. setting ulimits.

For intel-mpi set the launcher variable to e.g. tvconnect $(which mpiexec.hydra) --tv --debug -n $NPROC or whichever are your $ROSE_LAUNCHER_PREOPTS normally. the --debug may not be necessary? tvconnect creates a way for a reverse connection to act to connect to totalview once the job begins.

Note for intel-mpi, the mpirun wrapper may not pass --tv properly, so explicit specification of mpiexec.hydra is needed.

For openmpi, --tv or --debug is not needed (they are synonyms in openmpi), and the mpirun wrapper should work. So just tvconnect $(which mpirun) -n $NPROC

Ensure the totalview module is loaded in your PBS job also. And give your job a longer walltime.

Running Totalview

On gadi, load the totalview module, and launch totalview. Check under the file menu that it is looking for reverse connections.

Once your job begins on gadi, totalview should give you a prompt to connect to the job. Once you have done so, hit "go" (green play button). It should then ask you what you want to do about starting a parallel job. Say No, and it will stop. Then you can search for source files and put in break points, then hit play again. Or just let it run until it fails.

Useful pages

Using DDT

A similar approach with reverse connections also works with the DDT debugger. In the suite set

ROSE_LAUNCHER_PREOPTS = --connect mpirun -n $NPROC

and load the arm-forge module.

Start ddt on gadi and wait for the reverse connection message.

Last modified 13 days ago Last modified on Jul 16, 2021 9:39:55 AM