wiki:access/TotalviewCylc

Using the Totalview debugger from cylc

It's quite simple to modify a suite to run the model with the debugger. See suite u-bj446 for an example that builds and runs a simple test program. The GA6 N48 example suite u-aa124 has an option TOTALVIEW set in rose-suite.conf which runs the model under the debugger. The use of this flag in suite.rc should show what needs to be changed.

In the suite.rc file, add

        pre-script = """
           .... 
           module load totalview

(note this might be pre-command scripting in older suites.

If the ROSE_LAUNCHER_PREOPTS environment variable is already set (likely in a job using OpenMP) add --debug to the options, otherwise set the variable

[[[environment]]]
          ROSE_LAUNCHER_PREOPTS = --debug

If you're using intel-mpi rather than openmpi, set

[[[environment]]]
          ROSE_LAUNCHER_PREOPTS = --tv

Add

[[[job submission]]]
     command template = "qsub -v DISPLAY,PROJECT '%(job)s'"

(for cylc 7 use [[[job]]]). Also add software=totalview to the job directives, e.g.

[[[directives]]]
            -l walltime=1000
            -l ncpus=16
            -l mem=24gb
            -l software=totalview

For totalview to be able to open windows back on accessdev the ssh communication channel must be kept open. At the moment this requires an alternate wrapper script ~access/bin/cylc_totalview which creates a persistent xmessage window before launching cylc on raijin. To use this for a debug job, on accessdev create $HOME/.cylc/global.rc with

[hosts]
     [[raijin.*]]
        cylc executable = /projects/access/bin/totalview_cylc

This script checks for the line

ROSE_LAUNCHER_PREOPTS="--debug"

in the job file before running xmessage so other non-debug suites should keep running without being affected.

cylc shows the debug job in the "ready submitting now" state rather than picking up that it is actually submitted. However it does detect when it starts to run and everything seems to work.

cylc 7.7 and later versions

Cylc 7.7 added a new configuration item, process pool timeout with a default of 10 minutes.

Something about the way the message window gets set up means that, for cylc, the task stays in the ready state rather than submitted so this limit applies and kills the X connection to raijin. Unfortunately it’s a user/site configuration item rather than something you can set in the suite.

In $HOME/.cylc/global.rc on accessdev, add

process pool timeout = PT60M

at the top.

Issues

There is one unfortunate complication from the way rose implements the ulimit option. Many suites have

[[environment]]]
   ROSE_LAUNCHER_ULIMIT_OPTS = -s unlimited

to pass a ulimit -s unlimited setting to the job. The way this is implemented in rose-mpi-launch interferes with starting the debugger, leading to an error message like

At least one program wasn't a valid executable: /projects/access/apps/rose/2016.06.1/bin/rose-mpi-launch

Instead, remove this option and add ulimit -s unlimited to raijin:$HOME/.bashrc.

In the past, with some versions of openmpi, the mpirun wrapper script interfered with launching the debugger. However this doesn't seem to be an issue now so please report any problems with mpirun to access_help.

For reference, the work-around was to use

[[environment]]]
   ROSE_LAUNCHER = mpiexec

This means that you lose mpirun's capability of choosing the correct version so you must make sure that the runtime job loads the exact same version of openmpi as the build job.

Intel MPI

With Intel MPI use

ROSE_LAUNCHER_PREOPTS="-tv"
ROSE_LAUNCHER = mpirun

Other settings as for OpenMPI.

Trouble-shooting

  • On rare occasions xmessage window may fail to come up. You might see following message in stdout or stderr,
[INFO] exec /apps/intel-mpi/5.1.3.210/intel64/bin/mpirun -n 128 -tv /home/548/jtl548/cylc-run/u-am568/share/fcm_make_ops/build/bin/OpsProg_CreateODB.exe
Unable to open X display. Please check your $DISPLAY environment
variable to ensure that it is defined correctly and that you are
authorized to connect to this X server.

A workaround is to shut down the suite and restart it. This seems to fix the problem.

Last modified 3 months ago Last modified on Jun 7, 2019 2:08:23 PM