wiki:access/NewSun_011

CAWCR-BoM ACCESS NWP Ngamai Migration Working Group


CAWCR-BoM ACCESS-NWP Ngamai Porting Working Group Meeting Notes

Meeting 11: Wednesday 2nd October 2013, 9E Meeting Room
Present: Joerg Henrichs, Martin Dix, Jim Fraser, Wenming Lu, Chris Tingwell, Michael Naughton, Robin Bowen, Asri Sulaiman, Joan Fernon(phone)
Apologies: Ed Habjan, Ilia Bermous, Zhihong Li, Yi Xiao


Agenda

  • List from previous meeting notes
  • Task List
  • AOB

AG1

  • Ran 3 months to real time, but a problem was discovered (now fixed).
    • Verification was not run because of MARS unavailability.
    • Problem was noticed by Xiao in ACCESS-TC verification.
    • Troubleshooting was aided by Gary's difference plot.
    • Tracked down to an abort in SST data update task
      • SST's not updated from July values at start of trial period.
      • Failure was due to missing directory not created; error was not trapped by SMS.
    • Not noticable over short timeframe.
    • Impact on result is negligible over Australia.
    • Impact shows up clearly in NH.
    • Overall the problem was quickly spotted and solved.
  • Restart run commenced from 1st Sept with problem fixed.
    • Expect to catch up to real time in ~ 1 week.
  • Gary's Diagnostics - Robin to follow up
    • Python program which goes through all output fields and produces statistics, difference plots, etc.
    • Can include other programs from various people.
    • Plan to document for general use.

AR1

  • Will re-run from 1st September.
  • Ivor working on post-processing tasks.
  • Ingestion to MARS to start soon.
    • Limited fields/levels and timesteps for limited verification to reduce load on MARS-1.
  • MARS-7 still being debugged.
    • Data from ACCESS-TC with 3 digit MARS IDs have not been flushing, clogging the cache.
    • This was discovered by Arn and Tan -- flushing will improve MARS-1 capacity by 20%.
  • Verification using files on disk being tried out by Xiaoxi Wu.

MARS / SAM

  • Tan Le to analyse users use of MARS fields.
    • Identify fields which can be archived to tape more quickly.
    • Email CAWCR users to survey usage (rab).
  • Copying of output from Daily run to NCI ( as part of RDSI project ) may further reduce MARS load as some users can access data at NCI.
    • rab, jrf and Joan to follow up
  • Recompilation of MARS-7 being done.
  • Crucial meeting on future of MARS-7 to be held Thursday 3rd October - May decide to revert to old MARS and in interim instead of MARS-7 at SDC.
  • Arn will be on leave in November which will impact MARS development work.

AC1

  • No new updates from Wenming; all is well.
  • NMOC - Joan about to get started on operational ACCESS-C version.

ATC1

  • Joan may be able to start looking at ATC later in October, continuing to Nov.
  • For now, ACCESS-C is higher priority.
  • Xiao will be away for 2.5 weeks from 18 Oct.
    • Will work on new ACCESS-TC & APS2 ACCESS-G when she returns.
  • Improvement proposed for reconfiguration step for various ACCESS-TC domains:
    • Run re-config over whole domain, then use subset as needed.
    • Solution to problem with no land points in ACCESS-TC domains.
    • Martin to supply the job.

NGAMAI ISSUES

  • Xiao problem with obs task on ngamai has been fixed.
  • More monitoring is being done on Ngamai to spot node problems more quickly.
  • James Mandilas / Rob Jukic preparing for operational support to allow mid-November NMOC operational switch over.
  • NMOC plan to be ready for operstional switchover to Ngamai from 1st November.

Run time variation / Ngamai performance

  • Kernel tuning changes to disable defragmentation of "Transparent Huge Pages" to be applied to all ngamai computing nodes.
  • The changes should fix Xiao problem with slow reconfiguration execution, from 120-1200s to around 103s.
  • Problem with 2nd UM run substantially slower than 1st run in 2 run test job has been tracked down to a LUSTRE problem which occurs when 2nd job simply overwrites the files created by 1st job. If a fresh directory is used for 2nd job, elapse time variation disappears.
  • A meeting between BoM/ORACLE and NCI sysadmins is being scheduled.

Executable Build procedures and documentation

  • Dan Cook of Oracle is looking at ksh issue on Ngamai.
  • SCI cgi monitor not working due to Perl issue.
    • rab following up -- work in progress.

UMUI / SVN / TRAC

  • Email announcing migration of access applications from solar to ngamai was sent to all solar users; migration scheduled for the period 4:00pm Friday 4th October to 7:00am Monday 7th October.
  • Test of all main components was done, but not everything.

AOB and items from Task List

  • Rose/Cylc set up on Ngamai
    • rab looking at arranging pre-requisite python package installations.
    • Xiao to install Rose-Cylc packages and try out SREP implementation after her leave.
  • Work to allow compilation on Ngamai compute nodes being addressed.
    • gcc has been installed.
    • "make" required, not yet installed.
  • /apps installation should now be complete.
  • UM Small execs
    • Copy from Raijin now being used.
    • Documentation on their build is requested.
    • Build on ngamai is still desirable but not urgent.
  • CAP program on Ngamai: Wenming and Martin Dix to follow up.
  • Verify is workig on Ngamai, to be done for Raijin.
  • rab sending weekly emails re Solar de-commissioning.
  • NMOC NWP Configuration Management
    • "Station list" information now in sync with DA group.
      • To be kept in SVN after design/structure is completed.
    • Executable Build procedures being developed (see "Build Procedures" section in this notes).
      • Work continuing; NMOC to try out when ready.
  • Fortnightly meetings of this ACCESS NWP Ngamai Migration Working group to be continued until shutdown of Solar.

Next Meeting


* * * 11am Wed 16th October 2013, 9E Meeting Room * * *


[azs, Fri 11/10/2013] First draft. [azs, Mon 14/10/13] Updates with feedback from Robin and Joerg.
[mjn, Mon 14/10/2013] Minor editing.

Last modified 4 years ago Last modified on Oct 20, 2016 2:50:07 PM