Opened 4 years ago

Closed 4 years ago

#190 closed (fixed)

Add cylc 6.4.1 and rose 2015.04.1

Reported by: Martin Dix Owned by: Martin Dix
Priority: major Component: ACCESS model
Keywords: Cc:

Description (last modified by Martin Dix)

Install latest versions of rose and cylc.

Investigate setting defaults to these versions.

Change History (11)

comment:1 Changed 4 years ago by Martin Dix

Description: modified (diff)
Owner: set to Martin Dix
Status: newassigned

comment:2 Changed 4 years ago by Martin Dix

When communicating status back to accessdev, cylc 6.3.0 effectively does

ssh accessdev /usr/local/cylc/bin/cylc started

This gets handled by the remote-job-submission command.

However cylc 6.4.1 adds an environment variable before the command, effectively

ssh accessdev CYLC_VERSION=6.4.1 /usr/local/cylc/bin/cylc started

This fails before even getting as far as the remote-job-submission command.

Temporarily working around this by removing line

command += ["CYLC_VERSION=%s" % CYLC_VERSION]

from remote.py

comment:3 Changed 4 years ago by Martin Dix

raijin:~access/bin/cylc changed to do

if [[ $CYLC_VERSION = 6.4.1 ]]; then
    module load rose/2015.04.1
elif [[ $CYLC_VERSION = 6* ]]; then
    module load rose/2015.02.0
else
    module load rose/2014-04
fi

We need something cleverer than this in the longer term.

After updating default version of cylc on accessdev-test to 6.4.1, cylc started failed

raijin% ssh accessdev-test /usr/local/cylc/6.4.1/bin/cylc started
Failed to get version number!

Cylc gets the version number by calling git describe. On accessdev
/opt/remote-job-submission/bin/git has

parent="$(/bin/ps -o command= -p $PPID)"
#echo "$parent" 1>&2

if [ "$parent" = "/bin/bash /usr/local/cylc/admin/get-repo-version" ]; then
        exec /usr/bin/git "$@"
fi

and otherwise fails. In this case the parent is the default version cylc/cylc-6.4.1 rather than cylc. This has been broken since we installed cylc-6.3.0 but not causing a problem because cylc 5.4.14 returned an empty version number rather than a fatal error.

Matching on a regular expression like

cylc_regexp="/bin/bash /usr/local/cylc/cylc-[.0-9]*/admin/get-repo-version" 
if [[ $parent =~  $cylc_regexp ]]; then

would work.

Last edited 4 years ago by Martin Dix (previous) (diff)

comment:4 Changed 4 years ago by Martin Dix

Also need to add a link to the new default cylc version in /opt/remote-job-submission/rbin.d

Presently hardwired in modules/accessdevnode/manifests/init.pp: but ideally it should pick up the cylc default_version set in project.yaml.

comment:5 Changed 4 years ago by Martin Dix

If the cylc status commands on raijin can't include CYLC_VERSION in the command line then they'll always communicate with the default version on accessdev. In turn this creates a message that gets sent to the controlling cylc process on accessdev via pyro.

cylc-5.4.14 created messages like

model.2001010100 started at 2015-05-23T04:10:49

but cylc-6 includes a timezone, e.g.

model.2001010100 started at 2015-05-23T04:10:49Z

cylc uses a regexp to strip off the time part and get the status (e.g. started). The regexp used by cylc5 doesn't handle the timezone and so the messages aren't interpreted properly and the suite hangs. Using the cylc6 regexp works ok.

Last edited 4 years ago by Martin Dix (previous) (diff)

comment:6 Changed 4 years ago by Martin Dix

Chris Allen has patched the remote-job-submission script so that the environment variable is passed properly. This makes previous comment irrelevant. Also patched the remote-job-submission git script to use the regexp.

comment:7 Changed 4 years ago by Martin Dix

Cylc6 jobs now work properly. However a cylc 5 suite will communicate back with a command like

ssh accessdev /usr/local/cylc/bin/cylc started

The lack of a version number means it will get the default version of cylc and the timezone issue above recurs.

Possibilities are

  • Modify cylc-5.4.14 on raijin to include the version number when communicating
  • Patch cylc on accessdev to use a new regexp
  • Modify the cylc wrapper on accessdev to default to the older versions for commands like started but the normal default otherwise

The first seems the simplest and least likely to cause problems with later updates of accessdev.

Created ~access/apps/cylc-5.4.14.sendversion with a change to remote.py to add the version number. Checked that this worked properly with existing setup on accessdev and new setup on accessdev-test and then modified ~access/app/cylc/5.4.14 to point to this version.

Last edited 4 years ago by Martin Dix (previous) (diff)

comment:8 Changed 4 years ago by Martin Dix

Tested updating accessdev-test. Simple cylc5 suite kept running through the entire process without a problem.

rose bush now shows job.out and job.err from cylc6 suites properly.

rosie go shows the MOSRS repository properly. Just need $HOME/.metomi/rose.conf with

[rosie-id]
prefix-username.u = name

comment:9 Changed 4 years ago by Martin Dix

Resolution: fixed
Status: assignedclosed

comment:10 Changed 4 years ago by Martin Dix

Resolution: fixed
Status: closedreopened

rose-bush on accessdev isn't showing the job list of a suite or job output files, though it does on accessdev-test.

Compare
https://accessdev.nci.org.au/rose-bush/jobs/mrd599/simple_cycle6
and
https://accessdev-test.nci.org.au/rose-bush/jobs/mrd599/simple_cycle6

Both are running 2014.05.1

accessdev is showing a job list for a cylc5 suite so somewhere it's still using the old directory structure. Some default not updated somewhere?

Last edited 4 years ago by Martin Dix (previous) (diff)

comment:11 Changed 4 years ago by Martin Dix

Resolution: fixed
Status: reopenedclosed

Error messages suggested it was using the new version of rose bush but still using an old version of some other rose code, presumably cached somewhere. Probably didn't happen on accessdev-test because old version of rose-bush wasn't used before the update.

Fixed by restarting apache on accessdev.

Note: See TracTickets for help on using tickets.