Opened 3 years ago

Last modified 3 years ago

#271 assigned

Fix apparent random failure of "cycle_generate_bufr.ksh" which copies Bufr files to Raijin

Reported by: Jin Lee Owned by: Robin Bowen
Priority: major Component: ACCESS model
Keywords: APS2 APS3 Bufr remote copy sam MARS Cc: Vinod Kumar, Susan Rennie, David Smith

Description

Currently Bufr files used in the APS2 and APS3 global suites are copied from Ngamai to Raijin on an on-demand basis using a set of scripts whose top-level script is called "cycle_generate_bufr.ksh" (see #240 for details). Occasionally the script fails to copy one or more Bufr files to the Raijin destination directory. This failure seems to happen randomly.

As tracking down the cause of this intermittent problem definitively will be time-consuming Vinod and I decided to put in some diagnostic echoes and tighter checks in the scripts. These are described in this ticket.

Change History (7)

comment:1 Changed 3 years ago by Jin Lee

After thinking about it for a little longer I think the random failure may be due to 2 "cycle_generate_bufr.ksh" processes running at the same time. As the scripts make use of a fixed temporary file location if there happened to be 2 processes running then they both would make use of the same temporary file location. This can cause undesirable interaction between the 2 running processes. I think this is what happened.

A permanent fix is to add process ID ("$$") to the name of the temporary file location used by the scripts.

comment:2 in reply to:  1 Changed 3 years ago by Jin Lee

Replying to jtl548:

After thinking about it for a little longer I think the random failure may be due to 2 "cycle_generate_bufr.ksh" processes running at the same time. As the scripts make use of a fixed temporary file location if there happened to be 2 processes running then they both would make use of the same temporary file location. This can cause undesirable interaction between the 2 running processes. I think this is what happened.

A permanent fix is to add process ID ("$$") to the name of the temporary file location used by the scripts.

Another problem which came into light is that the log files which contain messages about missing Bufr files can be misleading. This happens because the scripts do not create new log files but instead new messages are appended (using '>>') each time the scripts run. If somehow previous run of the scripts resulted in one or more missing Bufr files and a message about this was contained in the log file then the current run of the scripts can reuse the same log file (if rerunning over the same time period). Then this log file is copied to Raijin.

comment:3 Changed 3 years ago by Jin Lee

Cc: Susan Rennie David Smith added

I made a code change to the working copy on,

raijin2:/projects/access/da/utilities/odb/scripts/ScpBufrToRaijin

I then committed the code change which is r522: https://access-svn.nci.org.au/trac/nwp/browser/da/utilities/odb/scripts/ScpBufrToRaijin?rev=522

The change will be rsync'ed to Ngamai shortly.

comment:4 in reply to:  3 Changed 3 years ago by Jin Lee

Replying to jtl548:

I made a code change to the working copy on,

raijin2:/projects/access/da/utilities/odb/scripts/ScpBufrToRaijin

I then committed the code change which is r522: https://access-svn.nci.org.au/trac/nwp/browser/da/utilities/odb/scripts/ScpBufrToRaijin?rev=522

The change will be rsync'ed to Ngamai shortly.

Note that because the rsync from Raijin to Ngamai does not copy the hidden subdirectory, ".svn" I had to manually run "svn update" on Ngamai to bring the Ngamai copy to the correct revision - r522.

comment:5 Changed 3 years ago by Jin Lee

Owner: changed from Jin Lee to Robin Bowen
Status: newassigned

Robin,

Can you update your rsync script so that it copies the hidden directory, ".svn"? When the change is made I will close the ticket.

Jin

comment:6 Changed 3 years ago by Robin Bowen

hi Jin

the .svn directories are the same on Raijin and Ngamai, except for some extra ones on Ngamai ?

Were they all missing before you did svn update ?

cheers

Robin

comment:7 in reply to:  6 Changed 3 years ago by Jin Lee

Replying to rmb548:

hi Jin

the .svn directories are the same on Raijin and Ngamai, except for some extra ones on Ngamai ?

Were they all missing before you did svn update ?

cheers

Robin

Yes, I think the ".svn" directory on Ngamai was missing entirely. I think I had to do a fresh check-out (svn co) on Ngamai to make sure that ".svn" was created.

Please note that "svn update" would not have worked as SVN would not have recognised "ngamai01:/g/sc/ophome/access/da/utilities/odb/scripts/ScpBufrToRaijin" as a valid working copy of an SVN project since there was no ".svn" directory. That's why I had to do "svn co".

Note: See TracTickets for help on using tickets.