wiki:ticket/370/ticket/370/TicketDetails/OpsReadFromObstore

Version 12 (modified by Jin Lee, 2 months ago) (diff)

--

Problems encountered when OPS reads observational data from obstore files

There are 2 ways to read data from obstores: one way is to let Ops_ExtractAndProcess to work out all settings from the obstore and read obs. Another way is to let Ops_CreateODB read the obstore and then write out ODB1; then let Ops_ExtractAndProcess read ODB1 and then write back to ODB1.

Run Ops_ExtractAndProcess to read obstore

Here's a list of things to keep in mind when running Ops_ExtractAndProcess this way:

  • It's safer to let OPS determine various parameters - e.g. batch numbers, buffer sizes, etc. - rather than setting them in extractcontrolnl namelist. This means in OPS app config file remove entire extractcontrolnl namelist as well as the file that normally holds the namelist
  • In order to not to use extractcontrolnl namelist at all not only do you need to remove the namelist and the file that contains it from app config file but you will also need to delete the file "<obstype>.nl" from Rose working directory,
<top-level Cylc directory on remotehost>/work/<cycletime>/glu_ops_process_background_<obstype>/ops_extract_control
  • Estimate the amount of memory required to read observations and allocate space within OPS program. Then use PBS resource request just enough to finish processing. This is based on my hunch that the reason for the failure stems from the fact that there might be not enough observations in some PE's and so OPS is allocating memory to certain variables during CX creation, which may be empty (Again this is only my hunch).

Run Ops_CreateODB and Ops_ExtractAndProcess to read obstores and write out ODB1

Outside of UKMO this method should be used as it produces updated ODB1 which can be used by VER. Here's a list of things to keep in mind when running Ops_CreateODB and Ops_ExtractAndProcess this way:

  • make sure maxbatchessubtype is set to a high enough number to be able to read all the batches in a obstore file. If all the data are not read in then you will see in stdout/stderr a message like,
More batches of AMSR2 data are available but batch 21 is the Final batch.

One difficulty when using this second method is that it's hard to know whether OPS retrieved all the data from an obstore file correctly. To make sure that all data are retrieved use the first method and then put together the app config file using the second method while comparing the log output.

  • Depending on the size of each batch in the obstore the job may run of out memory,
    • this does not depend on the number assigned to maxbatchessubtype
    • increasing maxbatchessubtype beyond the number of batches in the obstore doesn't seem to have any effect
    • for a certain type of memory error it appears the error occurs when the program is trying to distribute observations to other PE's; the solution for this type of error is to increase PBS core request
  • For obstore files which have large numbers of batches failures can occur with either Ops_CreateODB or Ops_ExtractAndProcess:
    • Ops_CreateODB fails - decrease buffersize (which is roughly the number of observations in each batch) in inverse proportion to the larger number of batches
    • Ops_ExtractAndProcess fails - if the failure happens towards the end of the processing where updating of ODB1 takes place then increasing the number of nodes and memory can fix this problem
    • for some obsgroups - e.g. sonde - the number of batches used in its obstore may be unusually large. This is fixed by using nodes which have larger memories.
    • for obstype of satwind and surface no amount of fine-tuning allow the tasks to read all observations. It's possible the number of observations as reported by print-obstore is not correct

Resources

  • Susan opened an OPS ticket where a related issue is discussed