Changes between Version 22 and Version 23 of access/TotalviewCylc


Ignore:
Timestamp:
Jul 16, 2021 9:39:55 AM (21 months ago)
Author:
Martin Dix
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • access/TotalviewCylc

    v22 v23  
    11= Using the Totalview debugger from cylc =
    22
    3 It's quite simple to modify a suite to run the model with the debugger. See suite u-bj446 for an example that builds and runs a simple test program.  The GA6 N48 example suite u-aa124 has an option TOTALVIEW set in rose-suite.conf which runs the model under the debugger. The use of this flag in suite.rc should show what needs to be changed.
     3The method previously described here for raijin no longer works on gadi because PBS now only allows setting the DISPLAY in interactive jobs.
    44
    5 In the suite.rc file, add
    6 {{{
    7         pre-script = """
    8            ....
    9            module load totalview
    10 }}}
    11 (note this might be {{{pre-command scripting}}} in older suites.
    12 
    13 If the {{{ROSE_LAUNCHER_PREOPTS}}} environment variable is already set (likely in a job using OpenMP) add {{{--debug}}} to the options, otherwise set the variable
    14 {{{
    15 [[[environment]]]
    16           ROSE_LAUNCHER_PREOPTS = --debug
    17 }}}
    18 If you're using intel-mpi rather than openmpi, set
    19 {{{
    20 [[[environment]]]
    21           ROSE_LAUNCHER_PREOPTS = --tv
    22 }}}
    23 Add
    24 {{{
    25 [[[job submission]]]
    26      command template = "qsub -v DISPLAY,PROJECT '%(job)s'"
    27 }}}
    28 (for cylc 7 use {{{[[[job]]]}}}). Also add software=totalview to the job directives, e.g.
    29 {{{
    30 [[[directives]]]
    31             -l walltime=1000
    32             -l ncpus=16
    33             -l mem=24gb
    34             -l software=totalview
    35 }}}
    36 
    37 
    38 For totalview to be able to open windows back on accessdev the ssh communication channel must be kept open. At the moment this requires an alternate wrapper script {{{~access/bin/cylc_totalview}}} which creates a persistent xmessage window before launching cylc on raijin. To use this for a debug job, on accessdev create {{{$HOME/.cylc/global.rc}}} with
    39 {{{
    40 [hosts]
    41      [[raijin.*]]
    42         cylc executable = /projects/access/bin/totalview_cylc
    43 }}}
    44 
    45 This script checks for the line
    46 {{{
    47 ROSE_LAUNCHER_PREOPTS="--debug"
    48 }}}
    49 in the job file before running xmessage so other non-debug suites should keep running without being affected.
    50 
    51 cylc shows the debug job in the "ready submitting now" state rather than picking up that it is actually submitted. However it does detect when it starts to run and everything seems to work.
    52 
    53 == cylc 7.7 and later versions ==
    54 Cylc 7.7 added a new configuration item, {{{process pool timeout}}} with a default of 10 minutes.
    55 
    56 Something about the way the message window gets set up means that, for cylc, the task stays in the ready state rather than submitted so this limit applies and kills the X connection to raijin. Unfortunately it’s a user/site configuration item rather than something you can set in the suite.
    57 
    58 In {{{$HOME/.cylc/global.rc}}} on accessdev, add
    59 {{{
    60 process pool timeout = PT60M
    61 }}}
    62 at the top.
    63 
    64 == Issues ==
    65 There is one unfortunate complication from the way rose implements the ulimit option. Many suites have
    66 {{{
    67 [[environment]]]
    68    ROSE_LAUNCHER_ULIMIT_OPTS = -s unlimited
    69 }}}
    70 to pass a `ulimit -s unlimited` setting to the job. The way this is implemented in {{{rose-mpi-launch}}} interferes with starting the debugger, leading to an error message like
    71 {{{
    72 At least one program wasn't a valid executable: /projects/access/apps/rose/2016.06.1/bin/rose-mpi-launch
    73 }}}
    74 Instead, remove this option and add `ulimit -s unlimited` to `raijin:$HOME/.bashrc`.
    75 
    76 In the past, with some versions of openmpi, the mpirun wrapper script interfered with launching the debugger. However this doesn't seem to be an issue now so please report any problems with mpirun to access_help.
    77 
    78 For reference, the work-around was to use
    79 {{{
    80 [[environment]]]
    81    ROSE_LAUNCHER = mpiexec
    82 }}}
    83 
    84 This means that you lose mpirun's capability of choosing the correct version so you must make sure that the runtime job loads the exact same version of openmpi as the build job.
    85 
    86 == Intel MPI ==
    87 
    88 With Intel MPI use
    89 {{{
    90 ROSE_LAUNCHER_PREOPTS="-tv"
    91 ROSE_LAUNCHER = mpirun
    92 }}}
    93 
    94 Other settings as for OpenMPI.
    95 
    96 == Trouble-shooting ==
    97 
    98 * On rare occasions xmessage window may fail to come up. You might see following message in stdout or stderr,
    99 
    100 {{{
    101 [INFO] exec /apps/intel-mpi/5.1.3.210/intel64/bin/mpirun -n 128 -tv /home/548/jtl548/cylc-run/u-am568/share/fcm_make_ops/build/bin/OpsProg_CreateODB.exe
    102 Unable to open X display. Please check your $DISPLAY environment
    103 variable to ensure that it is defined correctly and that you are
    104 authorized to connect to this X server.
    105 }}}
    106 
    107 A workaround is to shut down the suite and restart it. This seems to fix the problem.
     5Instead use reverse connections.
    1086
    1097-------------------
    110 = Using Totalview on gadi (using reverse connections)
     8= Using Totalview on gadi (using reverse connections) =
    1119
    11210== Job configuration
     
    13230Useful pages
    13331- https://wikis.uni-paderborn.de/pc2doc/Noctua-Software-TotalView (someone else's wiki on using Totalview)
    134 - [https://help.totalview.io/current/HTML/index.html#page/TotalView/totalviewlhug-reverse-connect.16.01.html# Totalview help on reverse connections]
     32- [https://help.totalview.io/current/HTML/index.html#page/TotalView/totalviewlhug-reverse-connect.15.01.html# Totalview help on reverse connections]
    13533- https://opus.nci.org.au/display/Help/Totalview (NCI help on Totalview)
     34
     35= Using DDT =
     36A similar approach with reverse connections also works with the DDT debugger. In the suite set
     37{{{
     38ROSE_LAUNCHER = ddt
     39ROSE_LAUNCHER_PREOPTS = --connect mpirun -n $NPROC
     40}}}
     41and load the arm-forge module.
     42
     43Start ddt on gadi and wait for the reverse connection message.