wiki:access/UtilsWG/UtilWgDocs/suite_audit_um

suite_audit

Program to rip information from rose/cylc suite logfiles and compile statistics.

There are two programs.

readlogs.py

Provides a summary of information from the suite log files.

To write to excel output rather than text files, the xlsxwriter driver is needed. Run "easy_install --user xlsxwriter" with the python modules loaded first.

suite_profile.py

Creates a graph of the typical run time of a whole suite cycle, using submit and run times relative to the housekeep task. (Assumes there is a housekeep task). It uses readlogs to pull the time information from the log files. This is most use if you have a couple of clean (failure free) cycles. The resultant graph is very large - best viewed on a large portrait-oriented screen.

NOTE: This program was created for PBS Pro 13, and cylc 6.7.2. It may not work for other versions.

Module suite_audit
Run $ module load suite_audit to be able to access the program.

Development
The suite_audit program is available under MOSRS at https://code.metoffice.gov.uk/svn/utils/access/trunk/suite_audit. For developing new features, create a branch from here.

Recommended software versions:
python/2.7.6
pythonlib/pandas/0.18.1

To run interactively, import readlogs inside python.

Usage:
usage: readlogs.py [-h] [-a] [-o [OUTPATH]] [-l [LOGARCHIVE]] [-m [MODE [MODE ...]]] suite

positional arguments:

suite path to log files for suite, typically $HOME/cylc-run/$SUITE_ID

optional arguments:

-h, --help show this help message and exit
-a, --doall Flag to go through all (remote) logs including job-{date} dirs by untarring
-o [OUTPATH], --outpath [OUTPATH]

specify path for output, (defaults to PBS workdir i.e. $HOME

-l [LOGARCHIVE], --logarchive [LOGARCHIVE]

path to log archive, defaults to accessdev:<suite>/log

-m [MODE [MODE ...]], --mode [MODE [MODE ...]]

Determines which aspect to audit. Options: 'r': to audit the resource usage and errors (default) 'u': to analyse UM tasks for stability by maximum vertical wind 't': to audit real run time including queue times 'v': to analyse var tasks for convergence If no entry is given, 'r' will be used, otherwise only the provided options will be audited

Examples:
Read all files on raijin, resource stats only readlogs.py $HOME/cylc-run/u-aa670 -o $HOME/suitestats

Read all files on accessdev, maximum output readlogs.py $HOME/cylc-run/u-aa670 -o $HOME/suitestats -a -m r t u v

To analyse a selection of cycles, manually extract them to a location and supply with the --logarchive /path/to/archive option. This will still also analyse the log files on raijin.

Functionality:
This program extracts resource usage from job.out files in the suite's log directory. It calls the script grepout.ksh from python to search for strings. Resource option=

This incluldes Memory and Walltime requests and usage, Exit status, NCPUs, Service units and Exit Status. It also looks for the job success/fail based on various cylc or task strings in job.out, job.err, and job.status files.

UM option=

This includes maximum vertical wind and location (which can be got from $CYLC_TASK_WORD_DIR/maxwind.dat file on raijin for recent cycles), and error stats for UM forecast tasks.

VAR option=

This retrieves the initial and final cost function values from VAR tasks, both low and high.

Time option=

This retrieves each tasks submit, start and exit time, and calcuates typical queue times, total cycle times, and typical cycle intervals.

It is also possible to read the logs from any previous runs that have been gzipped, by using the -a or --doall flag. This is useful if the log archiving is not on (since it deletes the log files).

The information is written to a temporary file and then read into a pandas dataframe. Python complies information for each task, each cycle, and different error types. The output includes: Number of each type of Exit Status. A list of the tasks that failed and what their exit status was. The total service units for each cycle. The walltime request and usage for each task. The memory request and usage for each task.

Output:
If possible, the output will be written to an excel file, with separate worksheets. Otherwise, the smaller tables (Exit status table, and service units) are written to standard output, and the rest are written to separate CSV text files.

To output to excel, you need the xlswriter engine. To access this, run e.g. $ easy_install --user xlsxwriter after loading the appropriate python modules.

Recommended application:
By analysing the output it is possible to check that resource requirements are appropriate for each task. Tasks that frequently fail will also be notable. The total service units might be useful for estimating the cost of an experiment. Analysis of data is best done manually with the extracted data, e.g. plotting in excel or other tool.

Limitations:
Most suites archive logs and housekeeping deletes them from the suite directory. Therefore only a few cycles may be available to the program. As APS3G does not yet archive logs to another location, the program doesn't yet read archived logs. A change to the file name or path specification may be required to change this.

This program only finds the information, it does not provide any analysis (which is best done in e.g. excel by the user). It is necessary for the user to investigate the cause of failure of any failed tasks, if it is not immediately obvious why it failed (e.g. memore exceeded). This program does not search for Fortran error messages, for example.

The program relies on fixed formatting, which makes it not portable or functional across different software versions. It may be possible to generalise it for wider usage.

Writing dates to excel appears to write the times incorrectly when using pandas v0.15.2.

Adding functionality:
To modify, the strings in grepout.ksh and readlogs.py's readit function need to be modified and matched to each other. There is also some sensitivity to the order in which lines are grepped; it is assumed that job.log is grepped first, and these lines will be read first by python.


Command to run from the command line (or put in a PBS script)

module load suite_audit
readlogs.py $HOME/cylc-run/u-aa670 -a -m t r u v

Sample script to run for development of a branch

#!/bin/bash
#PBS -l mem=500M
#PBS -l walltime=01:00:00
#PBS -q express

module load python/2.7.6
module load pythonlib/pandas/0.18.1

export PYTHONPATH=$PYTHONPATH:$HOME/utils/access/<branch>/suite_audit

python $HOME/utils/access/<branch>/suite_audit/readlogs.py $HOME/cylc-run/u-aa670 -a -m t r u v
Last modified 3 years ago Last modified on Jun 1, 2016 11:01:04 AM