-
Frequently Asked Questions about Rose and Cylc
- Committing Suites into Rosie
- Viewing Suites
-
Running Suites
-
- Q: How do I validate a suite.rc ?
- Q: How do I view log files from running a suite?
- Q: I made changes to my suite that is running, how do I load these …
- Q: How do I allow other users to monitor my suite that is running?
- Q: How do I run the model under the totalview debugger
- Q: How do I reduce my suite's disk usage
-
- Rose and cylc version updates
- Restarting after an accessdev reboot
-
Error Messages
-
- Q: When getting authentication failure at any pre-build task in Cylc …
-
Q: Getting the error message
...Killed rose $*
when running a suite - Q: Cylc is able to submit jobs to raijin but gets as error message like
- Q: Rose gives error message about a suite still running when you start …
- Q: Rose gives error message "[FAIL] [Errno 16] Device or resource …
- Q: My log files from raijin don't appear on accessdev
- Q: I'm receiving an error about not finding the rose or cylc command
-
Frequently Asked Questions about Rose and Cylc
Committing Suites into Rosie
Q: How do I commit a suite "A" into rosie
on accessdev?
A: 1. Run rosie create
, you will be prompted an editor for inserting suite information. Type y
for yes after exit the editor. You will see a message like
[INFO] au-aa153: created at svn+ssh://accessdev.nci.org.au/home/access-svn/roses_au_svn/a/a/1/5/3 [INFO] au-aa153: local copy created at /home/548/wml548/roses/au-aa153
- Copy everything in suite "A" to suite au-aa153.
- Run
fcm status
to check status of all files and runfcm add file
to add files you are committing to repository. - Run
fcm commit
to put the suite into svn. - Run
rosie checkout au-aa153
to check out the suite by users.
Q: With rosie copy
or rosie create
, and getting "gvim -f" error
rose.popen.RosePopenError`: gvim\ -f /local/dp9/zxs548/tmp/tmpDOlOS_ # return-code=1, stderr= [Errno 2] No such file or directory: 'gvim -f'
A: Please edit the following lines in your $HOME/.metomi/rose.conf
,
[external] editor="gvim -f" geditor="gvim -f"
to
[external] editor=gvim -f geditor=gvim -f
or just delete your $HOME/.metomi/rose.conf
to use the system default /usr/local/rose/etc/rose.conf
.
Viewing Suites
Q: How do I view a list of Rose suites on accessdev?
A: Run the command rosie go
. By default this shows the suites you have checked out. You can also search by username or search with string "au" to show everything in the local repository or "u" for the MOSRS repository.
Q: How do I view a list of my running Cylc suites on accessdev?
A: Run the command cylc scan
for a text list or cylc gsummary
for a GUI. By default this shows the suites you are curretly running. Double clicking on one will pop up a seperate Cylc window for this suite.
Running Suites
Q: How do I validate a suite.rc ?
A:When designing/testing a new suite, you often validate the definition file suite.rc to remove any errors before testing jobs on HPC. The best way to validate a progressive suite.rc
is to run the suite in a simulation mode without really sending jobs to super computer, simply using
rose suite-run -- --mode=simulation
Q: How do I view log files from running a suite?
A: Log files are located on accessdev in $HOME/cylc-run/SUITE/log/ where SUITE is the name of the suite. Alternatively log files can be viewed using the Rose Bush web interface https://accessdev.nci.org.au/rose-bush/
Q: I made changes to my suite that is running, how do I load these changes without stopping the suite?
A: Run the command rose suite-run --reload
Q: How do I allow other users to monitor my suite that is running?
A: While the suite is running run the command rose monitor --allow SUITE
where SUITE is the name of the suite. Other users can then monitor this suite by running rose monitor --user USER SUITE
where USER is the login id of the user who is running the suite.
Q: How do I run the model under the totalview debugger
A: See access/TotalviewCylc
Q: How do I reduce my suite's disk usage
A: See access/MinimisingDiskUse
Rose and cylc version updates
See access/RoseCylcVersions for information on what happens when versions are updated and how to run with specific versions.
Restarting after an accessdev reboot
accessdev may be rebooted as part of the periodic NCI maintenance. This will be announced in advance but will interrupt any running rose/cylc suites.
The simplest solution is to stop your suites on in advance with cylc stop SUITEID
and do a restart afterwards.
If you don't stop your suite beforehand the reboot will kill the controlling cylc process of any suites you have running. When raijin restarts, held jobs will complete but suites will then stop because they can’t communicate with the cylc process on accessdev.
In order to continue the job run cylc restart SUITEID
. This will first check the status of jobs on raijin so won’t rerun anything unnecessarily. With cylc6 suites you may also need to remove the file ~/.cylc/ports/SUITEID
on accessdev.
If you have a long running suite that was started with a version of rose/cylc that is no longer the current default, you should specify the original versions, e.g.
CYLC_VERSION=6.7.2 ROSE_VERSION=2015.11.0 cylc restart SUITEID
If you’re not sure you can check the CYLC_VERSION and ROSE_VERSION in the processed suite.rc file, ~/cylc-run/SUITEID/suite.rc
. Note this is the processed version in cylc-run
, not the original one in ~/roses
.
Error Messages
Q: When getting authentication failure at any pre-build task in Cylc on any machine $HPC (could be raijin
or accessdev
)
> reason: Username: svn: OPTIONS of > ‘https://access-svn.nci.org.au/svn/cmip5/trunk/bin': authorisation > failed:…
A: Please try on $HPC
svn ls https://access-svn.nci.org.au/svn/cmip5
You need to enter your password and please do the following to make sure the stored password readable only by you,
`chmod 600 ~/.subversion/auth/svn.simple/*`
Please change the svn location to suit your case.
Q: Getting the error message ...Killed rose $*
when running a suite
> [FAIL] ssh -oBatchMode=yes raijin.nci.org.au bash --login -c > \'ROSE_VERSION=2014-05\ /projects/access/bin/rose\ suite-run\ -v\ -v\ > --name=au-aa147\ --run=run\ > --remote=uuid=909dcfca-27af-45a7-8e4f-b838ab69fff8,root-dir-share=/sho > rt/$PROJECT/$USER,root-dir-work=/short/$PROJECT/$USER\' # > return-code=137, stderr= [FAIL] /projects/access/bin/rose: line 10: > 18963 Killed rose $*
A: Please check your quota on raijin
with lquota
to make sure you do not exceed your limit on raijin
$HOME.
Q: Cylc is able to submit jobs to raijin but gets as error message like
Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). ERROR: remote command failed 255 Received signal ERR Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password). ERROR: remote command failed 255
A: Please check the ssh communication between raijin
-raijin
, raijin
-accessdev
, accessdev
-accessdev
, and accessdev
-raijin
.
To check that you need to run on accessdev
,
{{{{
ssh raijin ls
}}}
the communication is fine if the content of your raijin
$HOME is printed out. Then try to run
ssh raijin cylc
If you are asked to input your password, just quit the test and run on accessdev
,
remote-job-submission
This will set up Cylc be able to send jobs to raijin
compute nodes.
Q: Rose gives error message about a suite still running when you start a suite
E.g.,
[FAIL] Suite "access_x_vn7.6_4km" may still be running.[[BR]] [FAIL] Host "localhost" has port-file:[[BR]] [FAIL]~wml548/.cylc/ports/access_x_vn7.6_4km[[BR]] [FAIL] Try "rose suite-shutdown access_x_vn7.6_4km" first?"[[BR]]
A: Go to ~wml548/.cylc/ports/ and delete access_x_vn7.6_4km on accessdev
then rerun the rose suite-run
in /home/548/wml548/roses/access_x_vn7.6_4km.
If this still does not work, apply the following solutions.
Q: Rose gives error message "[FAIL] [Errno 16] Device or resource busy: 'log.20140415T053044Z/suite/.nfs0000000000e458190000057e'" when running rose suite-run
Note: The solutions here may apply in the general cases when a suite cannot be run or restarted
A: This message usually appears
- When you have a process like an editor opening one of the log files or job scripts from the particular suite. Closing these processes and re-running
rose suite-run
should work. - Try running "rose suite-shutdown" or "cylc control stop $suite" if previous step does not work.
- Run
ps -ef | grep $suite
and kill the job related to $suite bykill -9 $JOBNO
.
Q: My log files from raijin don't appear on accessdev
A: There may be a delay in the generation of STDOUT and STDERR log files created from a PBS job. Cylc is configured to retry several times after a delay. Note that files larger than 2 MB will not be automatically retrieved, though this can be overridden in your suite. If you're using a version of cylc earlier than 6.9.1, event hooks should call the wrapper script rose task-hook2
instead of rose task-hook
to ensure log files are pulled back from raijin to accessdev.
Q: I'm receiving an error about not finding the rose or cylc command
A: Check to see if your suite.rc file contains the following lines in the initial scripting section
module use ~access/modules module load rose module load cylc
You may also need to check your .bashrc file to see if there are any conflicts with modules there or your environment variable $PATH.