Opened 2 years ago

Last modified 3 months ago

#318 accepted

OASIS error handling

Reported by: Martin Dix Owned by: Martin Dix
Priority: minor Component: ACCESS-CM2
Keywords: Cc:

Description (last modified by Martin Dix)

By running individual components of the coupled suite explicitly from the cylc GUI I accidentally started a model where some of the oasis initial files weren't present (e.g. a2i.nc).

The job.out file reported

MPI_ABORT was invoked on rank 576 in communicator MPI_COMM_WORLD 
with errorcode 0.

job.err also had stack traces from the UM abort, but cylc reported that the run had succeeded. In this case PE576 is the first PE used by CICE. Running on a smaller atmospheric decomposition gave the ABORT from PE0, so it depends which component finds that its required file is missing first.

Change History (6)

comment:1 Changed 2 years ago by Martin Dix

Description: modified (diff)
Summary: CICE error handlingOASIS error handling

comment:2 Changed 2 years ago by Martin Dix

In oasis psmile/src/mod_oasis_io.F90 has several nf90_open calls with the same structure

      inquire(file=trim(filename),exist=exists)
      if (exists) then
         status = nf90_open(trim(filename),NF90_NOWRITE,ncid)
         IF (status /= nf90_noerr) WRITE(nulprt,*) subname,' model :',compid,' proc :',&
                                                   mpi_rank_local,':',TRIM(nf90_strerror(status))
      else
         write(nulprt,*) subname,' ERROR: file missing ',trim(filename)
         WRITE(nulprt,*) subname,' abort by  model :',compid,' proc :',mpi_rank_local
         CALL oasis_flush(nulprt)
         call oasis_abort_noarg()
      endif

If the file is missing, this calls oasis_abort_noarg which has

CALL MPI_ABORT (mpi_comm_global, 0, ierror)

as does oasis_abort. There's also an oasis_mpi_abort routine which does set a non-zero error status.

The file missing message is written to ATM_RUNDIR/debug.root.01 (when it fails on PE0).

In OASIS3-MCT3.0 the same file handling code calls oasis_abort() which calls MPI_ABORT with a default errorcode of 1 if it's not set otherwise as an argument.

Last edited 2 years ago by Martin Dix (previous) (diff)

comment:3 Changed 2 years ago by Martin Dix

Description: modified (diff)
Owner: set to Martin Dix
Status: assignedaccepted

comment:4 Changed 2 years ago by Martin Dix

I initially incorrectly blamed CICE and noticed that the CICE routine mpi/ice_exit.F90 has

      write (ice_stderr,*) error_message
      call flush_fileunit(ice_stderr)

      call MPI_ABORT(MPI_COMM_WORLD, ierr)

The MPI_ABORT call is missing the errorcode argument, and so gets a value of 0 from ierr. This is used as the final program exit status so cylc thinks it succeeded.

Routine definition at https://www.open-mpi.org/doc/v1.10/man3/MPI_Abort.3.php

Also ice_fileunits.F90 has

       ice_stderr =  6    ! reserved unit for standard error

so the error message gets written to job.out not job.err. As an example, this happens when running CICE with a decomposition inconsistent with the executable.

Last edited 2 years ago by Martin Dix (previous) (diff)

comment:5 Changed 2 years ago by Martin Dix

OASIS branch https://access-svn.nci.org.au/trac/oasis/browser/branches/dev/mrd599/oasis3-mct-errorhandling duplicates the error messages to stderr and also ensures a non-zero exit status from an abort. A run without a2i.nc now has this message in job.err

 oasis_io_read_avfile ERROR: file missing a2i.nc
 oasis_io_read_avfile model :           1  proc :           0
 oasis_io_read_avfile abort by model :           1  proc :           0

and cylc recognises that it's failed.

comment:6 Changed 3 months ago by Martin Dix

The CICE changes mentioned in comment 4 above are now included in the standard CMIP6 branch https://access-svn.nci.org.au/trac/cice/changeset/403

Note: See TracTickets for help on using tickets.