Changes between Initial Version and Version 1 of access/UMResubmission

Oct 10, 2014 2:28:32 PM (7 years ago)



  • access/UMResubmission

    v1 v1  
     1=== Climate model automatic resubmission and restarting after errors ===
     3The UM has a facility for automatically resubmitting another job after successful completion. This can be used to do long climate simulations in manageable chunks. In the vn7.3 UMUI the chunk size is set in the follow-on panel to the "Job submission, resources and re-submission" panel (via the NEXT button). In vn8.0 and later there's a separate "Re-submission pattern" panel.
     5A normal job processed by the UMUI has TYPE=NRUN in the SUBMIT file. This forces a new run, starting from the dump file and date specified in the UMUI. A continuation run has TYPE=CRUN. From UM vn8.1 on this can be set as a UMUI option, but in earlier versions it must be set via a hand-edit or by directly editing the SUBMIT file after processing and before submitting.
     7After each dump file is written (i.e. at each potential restart point), the model writes a file RUNID.thist in the run directory (temporary history - where RUNID is the name of the experiment). At the successful completion of a model job  the utility script qspickup adds the temporary history to a permananent history file RUNID.phist and then deletes the temporary file. Both the thist and phist files have the same format containing 5 fortran namelists
     17The phist file has a set of these for each job restart, with the most recent at the top.
     20Relevant parts of this from a run that has completed a 3 month block are
     24 RUN_RESUBMIT_TARGET = 30, 9, 1, 3*0,
     26End time of last completed run, nyears, nmonths, ndays etc
     30 H_STEPIM        =                539088, 3*0,
     32Number of model time steps completed (first value is atmospheric model, others are components of the HadGEM2 coupled model and so not relevant here).
     36 END_DUMPIM = 'xahnea.dak8a10', 3*'              ',
     41 ASTART   = 'ASTART  : $DATAM/xahnea.dak8710                                                 ',
     42 ARESTART = 'ARESTART: $DATAM/xahnea.dak8a10                                                 ',
     44This specifies the start and end files for the previous block. The next run will start with the ARESTART file.
     46When a CRUN job starts it first checks for the existence of a RUNID.thist file. If this exists (indicating that the previous run did not complete properly) it restarts using the information in this. Otherwise it uses the information in the RUNID.phist file
     48Sometimes it's necessary to intervene in this process, for example, to restart after a crash
     50== Restarting after crashes ==
     52The ACCESS climate runs occasionally crash due to large vertical velocities developing over the Himalayas (frequency is of order once per decade). These incidents could likely be prevented by running with a shorter timestep but it's more economical to simply restart the model from a perturbed dump file. Unless the crash is very close in time to the dump this is usually sufficient to avoid the problem. The normal Met Office approach is to rerun the last month with an increase number of convection calls per timestep. This is effectively perturbing the model evolution by a change in the physics and has the same effect as perturbing the initial condition.
     55To do this on raijin
     57% module use ~access/modules
     58% module load pythonlib/umfile_utils
     59% python ~access/apps/pythonlib/umfile_utils/ dumpfile
     61Note that this modfies the file in place. Resubmitting the job will restart from this last dump file.
     63The default is an perturbation of amplitude 0.01 K applied to the potential temperature. If the restarted model still crashes then this can be increased using the -a argument, e.g.
     65python ~access/apps/pythonlib/umfile_utils/ -a 0.1 dumpfile
     68== Restarting from an earlier dump file ==
     70The cleanest way to do this is perhaps an NRUN from the required file, followed by a switch back to the usual CRUN process. However it's also possible to do it more directly by manipulating the history files.
     72To restart from the end of the last successfully completed run (rather than from an intermediate dump file), just remove the RUNID.thist file. To restart from an even earlier point you can change the RUNID.phist file. Remove sets of the 5 namelists to get back to the desired starting position (i.e. the first occurence of ARESTART has the name of the file you want to start from). Be careful that the namelists are still in the correct order. The first one should be &NLIHISTO.
     74== Changing ancillary files ==
     76Again, not really recommended behaviour but it can be useful.
     78When runs are resubmitted automatically the model gets the names of the ancillary files from the RUNID.pihst file. E.g.
     82 SULPEMIS = 'SULPEMIS : $CMIP5ANCIL/scycle_1850_2000_IPCCf',
     85If the model run fails because you've run off the end of an ancillary file you might want to switch to another one to continue the run, e.g. in an AMIP run from 1979-2010 that crosses the end of the historical emissions files. Again the correct approach is to change the files in the UMUI job and resubmit as an NRUN and then a CRUN (even better would be to create a new set of ancillary files that cover the whole of the required period). The quick fix is to change the names of the files directly in the RUNID.phist file. It's only necessary to change the most recent (top) namelist. E.g. one could set
     87SULPEMIS = 'SULPEMIS : $CMIP5ANCIL/sulp_RCP45_2000_2100f.N96',
     89and then resubmit the job.