5 | | In the suite.rc file, add |
6 | | {{{ |
7 | | pre-script = """ |
8 | | .... |
9 | | module load totalview |
10 | | }}} |
11 | | (note this might be {{{pre-command scripting}}} in older suites. |
12 | | |
13 | | If the {{{ROSE_LAUNCHER_PREOPTS}}} environment variable is already set (likely in a job using OpenMP) add {{{--debug}}} to the options, otherwise set the variable |
14 | | {{{ |
15 | | [[[environment]]] |
16 | | ROSE_LAUNCHER_PREOPTS = --debug |
17 | | }}} |
18 | | If you're using intel-mpi rather than openmpi, set |
19 | | {{{ |
20 | | [[[environment]]] |
21 | | ROSE_LAUNCHER_PREOPTS = --tv |
22 | | }}} |
23 | | Add |
24 | | {{{ |
25 | | [[[job submission]]] |
26 | | command template = "qsub -v DISPLAY,PROJECT '%(job)s'" |
27 | | }}} |
28 | | (for cylc 7 use {{{[[[job]]]}}}). Also add software=totalview to the job directives, e.g. |
29 | | {{{ |
30 | | [[[directives]]] |
31 | | -l walltime=1000 |
32 | | -l ncpus=16 |
33 | | -l mem=24gb |
34 | | -l software=totalview |
35 | | }}} |
36 | | |
37 | | |
38 | | For totalview to be able to open windows back on accessdev the ssh communication channel must be kept open. At the moment this requires an alternate wrapper script {{{~access/bin/cylc_totalview}}} which creates a persistent xmessage window before launching cylc on raijin. To use this for a debug job, on accessdev create {{{$HOME/.cylc/global.rc}}} with |
39 | | {{{ |
40 | | [hosts] |
41 | | [[raijin.*]] |
42 | | cylc executable = /projects/access/bin/totalview_cylc |
43 | | }}} |
44 | | |
45 | | This script checks for the line |
46 | | {{{ |
47 | | ROSE_LAUNCHER_PREOPTS="--debug" |
48 | | }}} |
49 | | in the job file before running xmessage so other non-debug suites should keep running without being affected. |
50 | | |
51 | | cylc shows the debug job in the "ready submitting now" state rather than picking up that it is actually submitted. However it does detect when it starts to run and everything seems to work. |
52 | | |
53 | | == cylc 7.7 and later versions == |
54 | | Cylc 7.7 added a new configuration item, {{{process pool timeout}}} with a default of 10 minutes. |
55 | | |
56 | | Something about the way the message window gets set up means that, for cylc, the task stays in the ready state rather than submitted so this limit applies and kills the X connection to raijin. Unfortunately it’s a user/site configuration item rather than something you can set in the suite. |
57 | | |
58 | | In {{{$HOME/.cylc/global.rc}}} on accessdev, add |
59 | | {{{ |
60 | | process pool timeout = PT60M |
61 | | }}} |
62 | | at the top. |
63 | | |
64 | | == Issues == |
65 | | There is one unfortunate complication from the way rose implements the ulimit option. Many suites have |
66 | | {{{ |
67 | | [[environment]]] |
68 | | ROSE_LAUNCHER_ULIMIT_OPTS = -s unlimited |
69 | | }}} |
70 | | to pass a `ulimit -s unlimited` setting to the job. The way this is implemented in {{{rose-mpi-launch}}} interferes with starting the debugger, leading to an error message like |
71 | | {{{ |
72 | | At least one program wasn't a valid executable: /projects/access/apps/rose/2016.06.1/bin/rose-mpi-launch |
73 | | }}} |
74 | | Instead, remove this option and add `ulimit -s unlimited` to `raijin:$HOME/.bashrc`. |
75 | | |
76 | | In the past, with some versions of openmpi, the mpirun wrapper script interfered with launching the debugger. However this doesn't seem to be an issue now so please report any problems with mpirun to access_help. |
77 | | |
78 | | For reference, the work-around was to use |
79 | | {{{ |
80 | | [[environment]]] |
81 | | ROSE_LAUNCHER = mpiexec |
82 | | }}} |
83 | | |
84 | | This means that you lose mpirun's capability of choosing the correct version so you must make sure that the runtime job loads the exact same version of openmpi as the build job. |
85 | | |
86 | | == Intel MPI == |
87 | | |
88 | | With Intel MPI use |
89 | | {{{ |
90 | | ROSE_LAUNCHER_PREOPTS="-tv" |
91 | | ROSE_LAUNCHER = mpirun |
92 | | }}} |
93 | | |
94 | | Other settings as for OpenMPI. |
95 | | |
96 | | == Trouble-shooting == |
97 | | |
98 | | * On rare occasions xmessage window may fail to come up. You might see following message in stdout or stderr, |
99 | | |
100 | | {{{ |
101 | | [INFO] exec /apps/intel-mpi/5.1.3.210/intel64/bin/mpirun -n 128 -tv /home/548/jtl548/cylc-run/u-am568/share/fcm_make_ops/build/bin/OpsProg_CreateODB.exe |
102 | | Unable to open X display. Please check your $DISPLAY environment |
103 | | variable to ensure that it is defined correctly and that you are |
104 | | authorized to connect to this X server. |
105 | | }}} |
106 | | |
107 | | A workaround is to shut down the suite and restart it. This seems to fix the problem. |
| 5 | Instead use reverse connections. |