wiki:access/NewSun_012

CAWCR-BoM ACCESS NWP Ngamai Migration Working Group


CAWCR-BoM ACCESS-NWP Ngamai Porting Working Group Meeting Notes

Meeting 12: Wednesday 16th October 2013, 9E Meeting Room
Present: Martin Dix, Jim Fraser, Ed Habjan, Chris Tingwell, Michael Naughton, Robin Bowen, Asri Sulaiman, Joan Fernon, Yi Xiao, Ilia Bermous
Apologies: Joerg Henrichs, Wenming Lu, Zhihong Li


Agenda

  • Update on MARS status
  • List from previous meeting notes
  • Task List
  • AOB

MARS / SAM

  • Robin reported: It was decided that MARS-7 will not be ready and the project is put on the back-burner.
  • MARS-1 will be used for ngamai operations.
    • "Forensic" work on MARS-1 have resulted in substantial improvement
  • Joan have streamlined the flow of critical data by cutting down fields and levels of fields from AG1 and AR1 for archiving to MARS.
    • Ivor will be testing the new scripts.
  • Improvements to MARS-1 may in fact make it possible to try out full ingestions of data with fall-back to reduced set if needed.
  • Three more tape drives are expected, which should improve MARS-1 even further.
  • New MARS support staff seconded from ITOPS have done good work in improving MARS-1 operations.
  • ACTION: Continue monitoring AG1 MARS ingestion.
  • Refer to Jim Fraser's updates sent via email.



AG1



AR1

  • Improvement from AR1 runtime on ngamai have been "sensational".
    • This time critical application now takes (on-average) 69 minutes to run - a 50 minutes improvement.
      • MPI 1.6.5, tuning of "Transparent Huge Pages" have contributed to the improved and more consistent run times.
      • Joerg's SSP jumped from 400 to 500 points
  • Mis-config NMOC var fix applied.
  • Verify Broadscore over Australia looks good.
  • Moving towards Surface Verifications when time permits -- this is a system issue, not directly related to porting.



AC1

  • Joan has started running NMOC's ACCESS-C version.
    • Some script fixes required after incorporating Milton's tidy-ups
    • More stricter script criteria implemented.
  • Will start cycling once MARS issues are settled.
  • No timing information yet on ngamai ACCESS-C
  • Expected to start cycling with data from 10/10/2013.----
  • Wenming's AC1 accidentally reverted back to using old STASHMASTER file - this has been fixed.



Turning off of solar's APS-0

  • APS0 systems are being turned off
    • AC0 already turned off
    • AG0 and AR0 to follow



ATC1

  • Work on NMOC's ATC1 starting.
    • Xiao to provide handover notes.
  • Martin have run reconfiguration for big domain on ngamai
    • Extraction of subset is ok.
  • Work done using UM N7.6.
  • Not much structural changes required.
  • Joan and Robin to work on replacing existing reconfig in ATC.
    • Some complication is expected.
    • Do not implement until Xiao gets back from leave.
  • Robin to provide other miscellaneous program updates, including tcbgs program.



New SMS Server

  • Virtual server within a VMWare installation at SDC will be available shortly
    • Will be used for SMS scheduler controlling ngamai operational systems.
    • This will replace current SMS system on RTDS4.
    • Expected to have improved stability.
    • Fail-over capability.
  • New system will have stdout and stderr available in real-time
  • Work to "harmonise" NMOC's sms implementation is very desirable and scheduled for 1Q 2014.



Disk Space quota

  • Disk space quotas are now implemented on ngamai
  • rto quota on ngamai is set to 5Tb
  • Use "lfs quota" command to check usage status
    • eg. lfs quota -u <user> $DATADIR
  • Ngamai quota policies and link to current usage available from: http://ngamai.bom.gov.au/~jjm/quotas.html
  • Keeping ngamai HOME area to manageable size is crucial
    • Currently backup of solar's HOME can take up to 36 hours.
    • 4/5 hours backup time is more reasonable.
    • Currently ngamai HOME usage is already at 1.5Tb ( about 1/4 of solar's)
  • mjn's and Xiao's jobs was hit by ngamai's quota implementation - now fixed.
    • Due to oracle implementation mistake the quota size was set a magnitude too low.
  • Standard ngamai users will be set with Tier-1 quotas. A few people with specific needs are given Tier-2 quotas.



Run-time variation

  • Ilia has successfully verified the improved and consistent runtimes with the implementation of LUSTRE bug workaround and kernel tuning with "Transparent Huge Pages" (From email sent after meeting).
  • Information on kernel tuning have been supplied to NCI and they have also turned off "Transparent Huge Pages" on raijin.



Access Application Migration (UIs, SVN, TRAC, DOCS )

  • Migration have been conducted succesfully with a few issues emerging.
  • Most of those issues have been addressed.
  • Email to be sent to access_nwp_users asking users to test all their usage and identify as many issues as possible before solar becomes unavailable.
  • Remind users to use $DATADIR for UM Builds.
    • All sample UMUI UM build jobs will use $DATADIR for intermediate files.
  • Some users will need to migrate to Solar due to continued use of assimilation infrastructure.



rose/cylc on ngamai

  • Implementation of dependent libraries has been added to /apps tasks
  • Robin to follow up with the aim to have it available by the time Xiao gets back from leave
  • SREP demo planned for JAN 2014



Compilation on compute nodes

  • S/W stack update was made and tested successfully.



Small execs

  • Raijin's version in use
  • Documentation and ngamai build -- work in progress



CAP Program

  • Martin and Wenming following up -- work in progress



Verify

  • Ngamai version now working
  • Work-in-progress for Raijin's version.



solar shutdown

  • Robin implementing weekly emails
  • Impact on research ?
    • Noel and Maree (visiting scientist) will use solar to the end for ATC
      • Migrating is not feasible.
    • AGREPS suites
    • APS-2
    • Other D/A suites?
  • Switching off of some nodes during November
    • reduce power and air-conditioning requirement.
    • usage is supposed to be reducing.
  • NMOC looking at mid-November for the date of switchover of operational systems.
    • May be done earlier
  • Ngamai to be declared fully operational by end of week beginning 21/10/13.



NWP Build Documentation

  • Work on documentation page continuing (https://trac.nci.org.au/trac/access/wiki/Access_NWP_Build_Procedures)
  • Need peer verifications of the steps outlined in the documentation.
    • Test with build of AG1 generated executable of different size (but gives identical result? )
  • Simple test run jobs required for each build
    • Currently need to test within NWP suite.

  • Ilia building VAR and OPS on raijin
    • Need to perform source extraction on ngamai (repository not available from raijin)
    • May build on ngamai, and transfer executable.
    • Solution to the un-availability of OPS and VAR source outside of BoM needs attention.
  • Ilia's test of running UM7.5 R12 Ngamai executable on Raijin gave bit-identical results
    • 186 Time step, 8x16 decomposition, 45 minutes runtime
  • Previous test of vn8.4 executable built on raijin did not run on ngamai.
    • Missing dynamic library

  • Porting of VAR 27.2 and 29.2
    • executable crashed requiring disabling of vectorization
      • Problem with intel compiler reported to compiler developers.

  • Job for building UM VN8.2 on ngamai available
    • To report on performance

  • Test of Intel compiler 14.0.0 reported problems with optimisation at "-O2" or "-O3" and only work with "-O0".



ksh bug

  • ksh problem was narrowed down to a ksh bug
    • Fix is available - to be applied
    • In the meantime workaround is being used.




Next Meeting


* * * 11am Wed 30th October 2013, 10E Meeting Room * * *

NOTE: NEW MEETING VENUE




-> Back to Ngamai Porting index page

--> Back to Main access page


[azs, Mon 28/10/2013] First draft.




Last modified 6 years ago Last modified on Feb 3, 2015 4:33:56 PM