wiki:ticket/298/TicketSummary

Assimilating Himawari-8/AHI clear-sky radiances in APS3 global suite

Aim

  1. Single observation test - look at vertical and horizontal spread of an CSR radiance
  2. Effect of AHI CSR radiances on 4DVAR convergence
  3. "OSDP 6 - SatRad processing of geostationary clear-sky radiances" reports,
    1. "Improvement in analysed winds and humidities between 400 and 800 hPa has been observed."
    2. "how well it [analysis] is fitted by moisture-sensitive channels on other instruments: this has been found to be improved in more cases than not."

These 2 findings will be tested in our configuration

Experimental Set-up

Overview

Suites used

Following 2 suites were used,

Type Suite Summary of changes Comment
Control u-aj730 A copy of the standard APS3 global suite, u-ag312@24892 (equivalent to UKMO PS38 suite).

Note. At the particular revision when the suite was copied the trial period was 20160515T06 - 20160731T12. This suits my purpose as H8/AHI CSR data are available from marsdev starting from May 2016 (???? when????)
Single-obs trial u-aj977 a copy of u-aj730 with modifications to make OPS tasks for AHI CSR work; further modfications to assimilate only a single obs
Longer trial u-am137 a copy of u-aj730 with modifications to make OPS tasks for AHI CSR work; copy of u-aj977 at an earlier revision plus additional mods

Note 1. u-ag312 differs from PS38 suite, u-ad365: see rose-suite.conf

Note 2. There was an error introduced to u-ag312, which propagated to all child suites including u-aj730 and u-aj977. The error was the use of PS37 initial VarBC file instead of PS38 one, which meant the coefficients for FY-3B and Himawari-8 were missing. The 2 suites were updated to use PS38 initial VarBC file

OPS build used

My development OPS branch is r3192_810_ahicsr_bom_bufr

  • Before making any changes I built this and did a quick test to see if it produces same Aircraft varobs. The Aircraft varobs files from this build is identical to the build, '/projects/access/nwpdir/share/APS3/OPS/ops-2016.03.0'.
  • See here for OPS code changes to make Ops_CreateODB correctly do the Bufr-to-ODB conversion of Bufr files received from MSC

Turning on gl_ops_process_ahiclear app and related apps

NCI optional app config created to replace data extraction from MetDB (used by UKMO) with a direct conversion of HIMCSR Bufr files to ODB databases.

Impact of a Single Observation

Selection of clear AHI CSR segments

To obtain complete information about each and every "FOV" (JMA/MSC uses the term, "segment" because each channel observation OPS processes is an aggregation of 16x16 pixels) SatRad NetCDF writefile was turned on.

To reduce the amount of data read in by OPS only a single split HIMCSR Bufr file for 20160515T06 was used.

From the writefile a single clear FOV was selected based on QCflags and its lat/lon noted,

idx=6860
lon[idx]=108.1577
lat[idx]=-28.04669

Then filtering was applied using "extractcontrolnl{ahiclear}" namelist of "gl_ops_process_ahiclear" app config by specifying geographic bounds in order to reduce the number of FOVs input to OPS to about a dozen. Following setting allowed 4 FOVs to be passed to varobs file,

[namelist:extractcontrolnl{ahiclear}(1)]
...
NorthBound=-28.0
SouthBound=-29.0
EastBound=109.0
WestBound=108.0

I refined the extractcontol namelist to allow a single FOV to be passed to varobs using the following,

[namelist:extractcontrolnl{ahiclear}(1)]
...
NorthBound=-28.04
SouthBound=-28.05
EastBound=108.16
WestBound=108.15

Next I modified channel selection file to allow only a single channel observation to be written out to varobs????

For comparison I did a single-observation test using IASI (see here)

Longer trial

  • I modified u-am137 to archive ODB2 and SatRad NetCDF writefiles for ahiclear.
  • Trial period is from 20160515T06 till 20160619T00 (5 weeks; 1 week of spin-up followed by 4 weeks of clean trial)

Results

Impact of a Single Observation

The vertical levels where the analysis increments are largest do not coincide with the peak heights of weighting. To better understand the relationship between the 2 try following experiments:

  • 3DVAR - this eliminates the effect of PF and its adjoint
  • non-hybrid - this tests the spreading of observational information by static covariance
  • Try assimilating channel obs from IASI - weighting functions of IASI channels are sharper so may be easier to interpret

Note 1. ensemble covariance may not be used at higher levels in UKMO hybrid VAR

Longer trial

Diary

u-aj730
Cycle time Failed task Reason for failure Action taken
20160523T0000Z glm_var_anal_n216 stdout and stderr
20160524T1200Z glm_um_recon_em_n108, glm_um_recon_em_n216 My disk quota exceeded and the suite was in a strange state; when I reran the tasks ensemble forecast files from previous cycle were cleaned out and these tasks failed. Looking at the archive it looks as though all the cycles up to 20160524T18 ran and the current cycle is 20160525T00 Reset succeeded all cycles up to 20160524T18 and continued with 20160525T00
20160526T06 engl_ens_addssts_031 stdout has a message,

WARNING: Merging SST Perts with ETKF perts has not worked for member 031.

stderr has,

/home/548/jtl548/cylc-run/u-aj730/share/fcm_make_var-opt/build-serial/bin/VarScr_UMFileUtils: line 296: 1100: Bus error

perturbed SST field seems to be in the analysis perturbation so triggered task again and it succeeds (problem with hardware?)
20160527T0000Z glu_var_anal_n216 Looking at stderr, some processes were interrupted while writing out analysis increment file: Var_WriteAnalPFUM.f90 -> Var_WriteModel.f90; while others Var_WriteAnalPFUM.f90 -> Var_PFexner.f90 -> Var_SwapBounds.f90 -> mpl_waitall_ Reran the task and it succeeded
20160528T0000Z engl_um_fcst_long_009 It appears there was an MPI-related problem which forced the job to get stuck and then PBS killed the job as it exceeded walltime request. stderr has following trace: for one process, um_main.F90 -> um_shell.F90 -> gc_init_thread.F90 -> mpl_init_thread.F90; for another process, um_main.F90 -> um_shell.F90 -> um_config.F90 -> umprintmgr.F90 -> gc_ibcast.F90 -> mpl_bcast.F90 Reran and the task worked
20160528T0000Z engl_um_fcst_long_ss_009 It looks like the UM history file from the long timestep was deleted. stderr has Cannot read history file /home/548/jtl548/cylc-run/u-aj730/share/cycle/20160528T0000Z/engl_um_009/engla.xhist As engl_um_fcst_long_009 succeeded this task didn't need to run
20160528T1200Z and 20160528T1800Z On average once per cycle engl_ens_addssts_* seem to be stuck in submitted state but the jobs are not in the queue. The 'qstat -f -x' command tells the jobs failed unknown Reset the task to failed state and then trigger
20160606T0000Z glm_ops_bge_atmos job.err has following message:

ERROR: task messaging failure.
unsupported operand type(s) for +: 'NoneType' and 'str'
Reran the task and it completes successfully. Other tasks failed with identical message in job.err. Was there problem with PBS or software that handles messaging?
20160605T1800Z engl_ens_smcperts job.err has:

/home/548/jtl548/cylc-run/u-aj730/share/fcm_make_um_utils/utilities/bin/um-fieldcalc: line 338: 22520 Killed $fieldcalc_exec [FAIL] Problem with Fieldcalc program
Reran task and it succeeds. Cause of the failure unknown.
20160607T1200Z engl_ver_hk_ard stdout has:

Job 8677782.r-man2 killed due to exceeding jobfs quota. Quota: 200.0MB, Used: 255.92MB, Host: r2760
Reran task and it succeeds with the size of ARD_EG bigger than before
20160608T1200Z I'm experiencing a number of what appears to be random failures with error message (stderr):

ERROR: task messaging failure. unsupported operand type(s) for +: 'NoneType?' and 'str' Received signal ERR ERROR: task messaging failure. unsupported operand type(s) for +: 'NoneType?' and 'str'

It turned out that I'm using CYLC_VERSION=7.1.0 and ROSE_VERSION=2017.05.0 (latest installed versions on Accessdev is 7.4.0 and 2017.05.0). Thinking that this mixing of versions is the cause of the problem I decided to start the suite using latest Cylc and Rose versions.
Allowed all tasks of cycletime=20160608T12 to finish. Then warm-started the suite afresh from 20160608T18
20160614T1800Z engl_ens_smcperts stderr has:

Job 9630791.r-man2 has exceeded memory allocation on node r38

Reran task
20160616T0600Z engl_ver_hk_ard stderr has:

Received signal TERM

stdout has:

Job 9705892.r-man2 killed due to exceeding jobfs quota. Quota: 200.0MB, Used: 325.95MB, Host: r2759
In PBS resource request jobfs requested is 200MB which in this case was exceeded. The PBS jobfs resource request was increased 500 MB (???? also in u-am137????)
20160617T12 glu_ops_process_background_satwind and glm_ops_process_background_satwind stderr has following message:

*** glibc detected *** /home/548/jtl548/cylc-run/u-am137/share/fcm_make_ops/build/bin/OpsProg_CreateODB.exe: double free or corruption (out): 0x00002ab50c2dd010 ***

This failure is same as the failure at the same cycle for u-am137. Repeated run seems to have same failure but occasionally other errors occur. Decided to run the glu_ops_process_background_satwind task for 20160617T18 to test and it succeeds. So it looks like there's a problem with the AMV Bufr files for the cycle, 20160617T12. N.B. it looks like Bufr-to-ODB conversion succeeded and there is an ODB for AMV's
Decided to reset glu_ops_process_background_satwind and glm_ops_process_background_satwind to succeeded and let the suite continue
20160617T12 gl_ver_hk_ard stdout has following message,

Job 9880142.r-man2 killed due to exceeding jobfs quota. Quota: 100.0MB, Used: 130.72MB, Host: r47
In the family, [GL_VER_HK_FDB_AND_ARD] added PBS resource request for jobfs of 500 MB; also added the same to the family, [GL_VER_HK_FDB_AND_ARD_EC]
20160618T1800Z engl_ens_smcperts stderr has:

Error from routine: portio2a:flush_unit_buffer

/home/548/jtl548/cylc-run/u-aj730/share/fcm_make_um_utils/utilities/bin/um-fieldcalc: line 338: 10767 Aborted $fieldcalc_exec
Reran and the task succeeds
20160618T18 glu_ops_odb_to_odb2_satwind stdout has:

*** Fatal error; aborting (SIGABRT) ...

stderr has:

Reset task to succeeded and let the suite continue
u-am137
Cycle time Failed task Reason for failure Action taken
20160516T18 glu_var_anal_n108 stderr log says PF_bdy_lyr.f90 failed; at niter= 25 something went wrong; it appears for this cycle mu seems negative more often than other cycles during inner-loop iteration but not sure if this is the cause of the failure reran task and it succeeded; compared stdout log outputs from previous, failed job and from the successful run and the numbers are exactly same until niter=25
20160605T1200Z engl_ens_addssts_014 No obvious message in stdout/stderr; perturbed SST field seems to have been merged with analysis perturbation Reset task to succeeded
20160614T0000Z and 20160614T1200Z gl_ver_obs_satwind stderr has following lines:

/home/548/jtl548/cylc-run/u-am137/share/fcm_make_ver/build/bin/VerScr_VerifVsObs: line 443: 24422: Memory fault
VerScr_VerifVsObs: VerProg_VerifVsObs.exe failed with rc 267
It turns out that the default size of available stack is small on Raijin's compute nodes. So occasionally when obstore files are slightly bigger than usual VerProg_VerifVsObs.exe runs out of stack. The workaround is that in VerScr_VerifVsObs of VER source I added 'ulimit -s unlimited'. See https://code.metoffice.gov.uk/trac/ver/ticket/25
20160612T1200Z - 20160614T1200Z All OPS tasks to do with ahiclear failed to run. Even more strange, they do not appear on gcylc (!) May need to start from 20160612T06 to generate warm-runing files for the analysis at 20160612T12.

N.B. it looks like during testing of FASTRUN I have inadvertently overwritten atmanl files from 20160612T00 and 20160612T06
FASTRUN using atmanl file is more involved than I thought. Here's more details about how to modify glu_um_fcst task to do FASTRUN using atmanl. As FASTRUN using atmanl file requires changes to time profile of output STASH I decided not to rerun from earlier cycle. Instead I started from the cycle when enough warm-running files are available: 20160614T12 seems to have enough warm-running background files for 20160614T18 so I warm-started the suite from 20160614T18. So AHI CSR is not used for cycles 20160612T1200Z - 20160614T1200Z
20160617T12 gl[mu]_ops_process_background_satwind stderr has following message:

*** glibc detected *** /home/548/jtl548/cylc-run/u-am137/share/fcm_make_ops/build/bin/OpsProg_CreateODB.exe: double free or corruption (out): 0x00002ab50c2dd010 ***
It looks like there's a problem with the AMV Bufr files (see diary for u-aj730). Reset gl[mu]_ops_process_background_satwind tasks to succeeded and let the suite continue
20160717T12 gl_ver_obs_satwind stderr has:

Zero observations found in ODB: "/home/548/jtl548/cylc-run/u-am137/share/data/ver/user/ODB_GM/ODB_20160617T1200Z_Satwind.obstore"
Decided to reset the task to succeeded and let the suite proceed
20160618T18 glu_ops_odb_to_odb2_satwind See same failure in u-aj730 See above

Useful information

  • In the suite the OPS tasks for H-8 AHI CSR use the label, 'ahiclear'
  • In the OPS source code the obsgroup used for H-8 AHI CSR is 'ObsGroupAHIClr' - see 'OpsMod_ObsInfo/OpsMod_ObsGroupInfo.f90'
  • In the OPS source code the MetDB subtype used for AHI CSR is 'HIMCSR' - see '../public/Ops_Constants/Ops_SubTypeNameToNum.inc'

ToDo

  1. In raijin4:/g/data/dp9/da/access-g/ops/bufr add "ahiclear.*.bufr" to ECMA tarballs
  1. In "gl[um]_ops_process_background_ahiclear" tasks may need to modify MetDB elements file
    • the repetition may not match what's in ODB
Last modified 4 weeks ago Last modified on Nov 14, 2017 10:26:16 AM