wiki:access/TotalviewCylc

Using the Totalview debugger from cylc

It's quite simple to modify a suite to run the model with the debugger. See suite u-bj446 for an example that builds and runs a simple test program. The GA6 N48 example suite u-aa124 has an option TOTALVIEW set in rose-suite.conf which runs the model under the debugger. The use of this flag in suite.rc should show what needs to be changed.

In the suite.rc file, add

        pre-script = """
           .... 
           module load totalview

(note this might be pre-command scripting in older suites.

If the ROSE_LAUNCHER_PREOPTS environment variable is already set (likely in a job using OpenMP) add --debug to the options, otherwise set the variable

[[[environment]]]
          ROSE_LAUNCHER_PREOPTS = --debug

If you're using intel-mpi rather than openmpi, set

[[[environment]]]
          ROSE_LAUNCHER_PREOPTS = --tv

Add

[[[job submission]]]
     command template = "qsub -v DISPLAY,PROJECT '%(job)s'"

(for cylc 7 use [[[job]]]). Also add software=totalview to the job directives, e.g.

[[[directives]]]
            -l walltime=1000
            -l ncpus=16
            -l mem=24gb
            -l software=totalview

For totalview to be able to open windows back on accessdev the ssh communication channel must be kept open. At the moment this requires an alternate wrapper script ~access/bin/cylc_totalview which creates a persistent xmessage window before launching cylc on raijin. To use this for a debug job, on accessdev create $HOME/.cylc/global.rc with

[hosts]
     [[raijin.*]]
        cylc executable = /projects/access/bin/totalview_cylc

This script checks for the line

ROSE_LAUNCHER_PREOPTS="--debug"

in the job file before running xmessage so other non-debug suites should keep running without being affected.

cylc shows the debug job in the "ready submitting now" state rather than picking up that it is actually submitted. However it does detect when it starts to run and everything seems to work.

cylc 7.7 and later versions

Cylc 7.7 added a new configuration item, process pool timeout with a default of 10 minutes.

Something about the way the message window gets set up means that, for cylc, the task stays in the ready state rather than submitted so this limit applies and kills the X connection to raijin. Unfortunately it’s a user/site configuration item rather than something you can set in the suite.

In $HOME/.cylc/global.rc on accessdev, add

process pool timeout = PT60M

at the top.

Issues

There is one unfortunate complication from the way rose implements the ulimit option. Many suites have

[[environment]]]
   ROSE_LAUNCHER_ULIMIT_OPTS = -s unlimited

to pass a ulimit -s unlimited setting to the job. The way this is implemented in rose-mpi-launch interferes with starting the debugger, leading to an error message like

At least one program wasn't a valid executable: /projects/access/apps/rose/2016.06.1/bin/rose-mpi-launch

Instead, remove this option and add ulimit -s unlimited to raijin:$HOME/.bashrc.

In the past, with some versions of openmpi, the mpirun wrapper script interfered with launching the debugger. However this doesn't seem to be an issue now so please report any problems with mpirun to access_help.

For reference, the work-around was to use

[[environment]]]
   ROSE_LAUNCHER = mpiexec

This means that you lose mpirun's capability of choosing the correct version so you must make sure that the runtime job loads the exact same version of openmpi as the build job.

Intel MPI

With Intel MPI use

ROSE_LAUNCHER_PREOPTS="-tv"
ROSE_LAUNCHER = mpirun

Other settings as for OpenMPI.

Trouble-shooting

  • On rare occasions xmessage window may fail to come up. You might see following message in stdout or stderr,
[INFO] exec /apps/intel-mpi/5.1.3.210/intel64/bin/mpirun -n 128 -tv /home/548/jtl548/cylc-run/u-am568/share/fcm_make_ops/build/bin/OpsProg_CreateODB.exe
Unable to open X display. Please check your $DISPLAY environment
variable to ensure that it is defined correctly and that you are
authorized to connect to this X server.

A workaround is to shut down the suite and restart it. This seems to fix the problem.


Using Totalview on gadi

Job configuration

In your suite, or job, you could avoid using rose mpi-run, and instead use an explicit launcher. Most calling scripts (in UM, OPS, VAR, SURF) have a variable called e.g. RECON_LAUNCHER, or OPS_LAUNCHER, which is the alternative to using rose mpi-launch.

Make sure that any variables that are used by rose-mpi-launch to do something, are instead defined otherwise, e.g. setting ulimits.

For intel-mpi set the launcher variable to e.g. tvconnect $(which mpiexec.hydra) --tv --debug -n $NPROC or whichever are your $ROSE_LAUNCHER_PREOPTS normally. tvconnect creates a way for a reverse connection to act to connect to totalview once the job begins.

Note for intel-mpi, the mpirun wrapper may not pass --tv properly, so explicit specification of mpiexec.hydra is needed.

For openmpi, --tv is not needed, and the mpirun wrapper should work.

Ensure the totalview module is loaded in your PBS job also. And give your job a longer walltime.

Running Totalview

On gadi, load the totalview module, and launch totalview. Check under the file menu that it is looking for reverse connections.

Once your job begins on gadi, totalview should give you a prompt to connect to the job. Once you have done so, hit "go" (green play button). It should then ask you what you want to do about starting a parallel job.

Useful pages

Last modified 2 months ago Last modified on Feb 16, 2021 12:02:15 PM