Opened 4 years ago

Last modified 3 years ago

#167 accepted

Move prebuilds out of ~access

Reported by: Scott Wales Owned by: Scott Wales
Priority: major Component: Accessdev Server
Keywords: TIWG Cc: Martin Dix, Jin Lee

Description

The ~access directory is currently managed using a subversion repository. This makes it difficult to add new prebuilds.

Look into moving prebuilds to a different location (/g/data1/access?)

Change History (23)

comment:1 Changed 4 years ago by Scott Wales

Owner: set to Scott Wales
Status: newaccepted

comment:2 Changed 4 years ago by Scott Wales

Blocked by #166

comment:3 Changed 4 years ago by Scott Wales

Cc: Martin Dix added

Now that /g/data1/access is mounted I think that's a good location for the prebuilds, since it can be seen from both machines. Probably with a directory structure like

/g/data1/access/prebuilds/vn10.1/vn10.1_safe_noomp

Will run some tests first to check that this will work

comment:5 Changed 4 years ago by Martin Dix

See https://accessdev.nci.org.au/trac/wiki/access/RoseSuitePrebuilds for some progress and issues.

Creating prebuilds from a suite is going to use the normal $HOME/cylc-run/SUITE/share structure, though perhaps the suite could override this. Moving a prebuild directory is tricky at the moment, https://github.com/metomi/fcm/issues/185.

Also we'll likely need separate directories for the extract and build steps, otherwise fcm clobbers its own state, https://github.com/metomi/fcm/issues/126.

comment:6 Changed 4 years ago by Scott Wales

FCM lets you configure the build directory using the --directory flag, which can be added to the ROSE_TASK_OPTIONS environment variable in the Cylc task.

Trying this out however I get errors like

[FAIL] /g/data/access/prebuild/vn10.1/safe_noomp/.fcm-make/cache/extract/jules/0: cannot create
[FAIL] {'e' => 'svn: E000116: Can\'t create temporary file from template \'/g/data/access/prebuild/vn10.1/safe_noomp/.fcm-make/cache/extract/jules/0/vm/svn-XXXXXX\': Stale file handle

it seems like subversion doesn't like running on the NFS mount.

The failing command is

svn export https://130.56.244.76/svn/jules/main/trunk@692 /g/data/access/prebuild/vn10.1/safe_noomp/.fcm-make/cache/extract/jules/0

comment:7 Changed 4 years ago by Martin Dix

Trying to checkout rose-meta to /g/data/access also failed in the same way

% fcm checkout fcm:um.xm/trunk/rose-meta         
A    rose-meta/um-fcm-make
.....
A    rose-meta/um-fcm-make/versions.py
svn: E000116: Can't create temporary file from template '/g/data1/access/rose-meta/.svn/tmp/svn-XXXXXX': Stale file handle
[FAIL] svn checkout https://130.56.244.76/svn/um/main/trunk/rose-meta # rc=1

comment:8 Changed 4 years ago by Chris Allen

I tried:

svn checkout svn+ssh://accessdev.nci.org.au/home/access-svn/roses_test_svn test

It fails in random places either with the same error or with other filesystem metadata related problems. For example:

21532 11:43:07 open("/g/data1/access.dev/tmp/cma900/test3/.svn/tmp/svn-oUeTOX", O_RDWR|O_CREAT|O_EXCL, 0600 <unfinished ...>
21532 11:43:07 <... open resumed> )     = -1 ESTALE (Stale file handle)

And this is a sign that things aren't well on /g/data:

$ mkdir tmp
$ cd tmp
bash: cd: tmp: Permission denied
(a few seconds later)
$ cd tmp
$

Interestingly, so far I've got the checkout to work every time if I do it in a directory tree which doesn't have any ACLs.

Anyway, there does seem to be some problems with /g/data at the moment so until those are sorted out I wouldn't rely on anything you're seeing. I'll report what I've observed to the storage team.

comment:9 Changed 4 years ago by Scott Wales

Keywords: TIWG added

comment:10 Changed 4 years ago by Scott Wales

It looks like today's maintenance hasn't fixed this issue, still getting stale file handle reports

comment:11 Changed 4 years ago by Chris Allen

OK, I'll chase it up.

comment:12 Changed 4 years ago by Chris Allen

I managed to reproduce the problem with a simple Python script (turns out that we also see other errors without ACLs) and the storage team believe they have located a potential bug and are now waiting on some Lustre patches from upstream.

comment:13 Changed 4 years ago by Scott Wales

Great, thanks for the update

comment:14 Changed 4 years ago by Chris Allen

I'm still seeing errors after today's maintenance - reported to storage team.

comment:15 Changed 4 years ago by Martin Dix

FCM 2015.05.0 adds named builds so that the extract and build steps can run in the same directory without interfering with each other.

There are now two possible ways of setting up the prebuilds in /g/data/access.

1 With new fcm, prebuild creation suite.rc has

{% set prebuild_path = '/g/data/access/prebuilds/vn10.2/fieldcalc' %}

    [[fcm_make]]
       [[[environment]]]
          ROSE_TASK_OPTIONS = --directory={{prebuild_path}} mirror.target={{prebuild_path}}            

    [[fcm_make2]]
      [[[environment]]]
            ROSE_TASK_OPTIONS = --directory={{prebuild_path}} --name=2

and app/fcm_make/file/fcm_make.cfg has

mirror.prop{config-file.name} = 2

Job using the prebuild has

    [[fcm_make]]
         [[[environment]]]
            PREBUILD = {{prebuild_path}}

    [[fcm_make2]]
        [[[environment]]]
            PREBUILD = {{prebuild_path}}
            ROSE_TASK_OPTIONS = --name=2

Suites au-aa360 and au-aa361 are an example that builds just the UM fieldcalc utility.

2 Alternately use separate sub-directories. Creation suite uses

{% set prebuild_path = '/g/data/access/prebuilds/vn10.2/fieldcalc_alt' %}
{% set extract_prebuild = prebuild_path + '/extract' %}
{% set build_prebuild   = prebuild_path + '/build'   %}

   [[fcm_make]]
        [[[environment]]]
            ROSE_TASK_OPTIONS = --directory={{extract_prebuild}} mirror.target={{build_prebuild}}            

    [[fcm_make2]]
        [[[environment]]]
            ROSE_TASK_OPTIONS = --directory={{build_prebuild}}

and job using the prebuild has

    [[fcm_make]]
       [[[environment]]]
            PREBUILD = {{extract_prebuild}}

    [[fcm_make2]]
        [[[environment]]]
            PREBUILD = {{build_prebuild}}

Suites au-aa362 and au-aa363 are an example of this.

I don't have a strong preference for which style we use. In each case the job using the prebuild has to set something different in the two fcm_make tasks.

PS I had a couple of errors doing the extraction to /g/data/access, but it seems better than it was a few months ago.

Last edited 3 years ago by Martin Dix (previous) (diff)

comment:16 Changed 4 years ago by Jin Lee

The problem reported earlier (see comment:7) regarding checkout to /g/data/ seems to be not completely fixed:

accessdev:/g/data/dp9/jtl548/source/ops> fcm co fcm:ops.x/branches/dev/jinlee/r269_ops32.0.0_nci
...
...
A    r269_ops32.0.0_nci/src/public/OpsMod_SignalHandler/OpsMod_SignalHandler.f90
A    r269_ops32.0.0_nci/src/public/OpsMod_SignalHandler/sigtrap.c
svn: E155009: Failed to run the WC DB work queue associated with '/g/data1/dp9/jtl548/source/ops/r269_ops32.0.0_nci/src/public/OpsMod_SignalHandler', work item 3559 (file-install src/public/OpsMod_SignalHandler/OpsMod_SignalHandler.f90 1 0 1 1)
svn: E000013: Can't move '/g/data1/dp9/jtl548/source/ops/r269_ops32.0.0_nci/.svn/tmp/svn-DWuBzq' to '/g/data1/dp9/jtl548/source/ops/r269_ops32.0.0_nci/src/public/OpsMod_SignalHandler/OpsMod_SignalHandler.f90': Permission denied [FAIL] svn checkout https://code.metoffice.gov.uk/svn/ops/main/branches/dev/jinlee/r269_ops32.0.0_nci # rc=1

Can someone able to revisit this problem?

comment:17 Changed 4 years ago by Jin Lee

Cc: Jin Lee added

comment:18 Changed 4 years ago by Chris Allen

Storage team notified that we're still running into these filesystem errors.

comment:19 Changed 4 years ago by Scott Wales

According to Jin checkouts work on raijin to /g/data, but not on accessdev

comment:20 Changed 4 years ago by Chris Allen

At the moment we're being advised that the upstream vendor expects to have a patch available before the end of this month.

comment:21 Changed 3 years ago by Chris Allen

It appears that the /g/data1 filesystem errors seen from VMs should be resolved now. Could you please try again.

comment:22 Changed 3 years ago by Martin Dix

Rose now automatically does --name=2 for fcm_make2 tasks (since 2015.05.0), https://github.com/metomi/rose/pull/1604. Now option 1 above is clearly preferable because the job using the prebuild doesn't have to do anything special.

See access/RoseSuitePrebuilds for instructions on creating prebuilds from a rose stem job.

Last edited 3 years ago by Martin Dix (previous) (diff)

comment:23 Changed 3 years ago by Martin Dix

An fcm issue meant that regular access group members couldn't use the prebuilds. See https://github.com/metomi/fcm/issues/226.

fcm/2016.02.0 patched on raijin to work around this.

It's fixed in fcm/2016.03.0.

Last edited 3 years ago by Martin Dix (previous) (diff)
Note: See TracTickets for help on using tickets.