wiki:ticket/370/ticket/370/TicketDetails/OpsReadFromObstore

Problems encountered when OPS reads observational data from obstore files

There are 2 ways to read data from obstores: one way is to let Ops_ExtractAndProcess to work out all settings from the obstore and read obs. Another way is to let Ops_CreateODB read the obstore and then write out ODB1; then let Ops_ExtractAndProcess read ODB1 and then write back to ODB1.

Run OpsProg_ExtractAndProcess.exe to read obstore

Here's a list of things to keep in mind when running OpsProg_ExtractAndProcess.exe this way:

  • It's safer to let OPS determine various parameters - e.g. batch numbers, buffer sizes, etc. - rather than setting them in extractcontrolnl namelist. This means in OPS app config file remove entire extractcontrolnl namelist as well as the file that normally holds the namelist
  • In order to not to use extractcontrolnl namelist at all not only do you need to remove the namelist and the file that contains it from app config file but you will also need to delete the file "<obstype>.nl" from Rose working directory,
<top-level Cylc directory on remotehost>/work/<cycletime>/glu_ops_process_background_<obstype>/ops_extract_control
  • Estimate the amount of memory required to read observations and allocate space within OPS program. Then use PBS resource request just enough to finish processing. This is based on my hunch that the reason for the failure stems from the fact that there might be not enough observations in some PE's and so OPS is allocating memory to certain variables during CX creation, which may be empty (Again this is only my hunch).

Run OpsProg_CreateODB.exe and OpsProg_ExtractAndProcess.exe to read obstores and write out ODB1

Outside of UKMO this method should be used as it produces updated ODB1 which can be used by VER. Here's a list of things to keep in mind when running OpsProg_CreateODB.exe and OpsProg_ExtractAndProcess.exe this way:

  • make sure maxbatchessubtype is set to a high enough number to be able to read all the batches in a obstore file. If all the data are not read in then you will see in stdout/stderr from OpsProg_CreateODB.exe a message like,
More batches of AMSR2 data are available but batch 21 is the Final batch.

One difficulty when using this second method is that it's hard to know whether OPS retrieved all the data from an obstore file correctly. To make sure that all data are retrieved use the first method and then put together the app config file using the second method while comparing the log output.

  • Depending on the size of each batch in the obstore the job may run of out memory,
    • this does not depend on the number assigned to maxbatchessubtype
    • increasing maxbatchessubtype beyond the number of batches in the obstore doesn't seem to have any effect
    • for a certain type of memory error it appears the error occurs when the program is trying to distribute observations to other PE's; the solution for this type of error is to increase PBS core request
  • For obstore files which have large numbers of batches failures can occur with either with OpsProg_CreateODB.exe or OpsProg_ExtractAndProcess.exe:
    • OpsProg_CreateODB.exe fails - decrease buffersize (which is roughly the number of observations in each batch) in inverse proportion to the larger number of batches
    • OpsProg_ExtractAndProcess.exe fails - if the failure happens towards the end of the processing where updating of ODB1 takes place then increasing the number of nodes and memory can fix this problem
    • for some obsgroups - e.g. sonde - the number of batches used in its obstore may be unusually large. This is fixed by using nodes which have larger memories.
    • for obstype of satwind and surface no amount of fine-tuning allow the tasks to read all observations. It's possible the number of observations as reported by print-obstore is not correct

Resources

  • Susan opened an OPS ticket where a related issue is discussed
Last modified 12 days ago Last modified on Oct 4, 2019 11:40:29 AM