Opened 3 years ago

Closed 3 years ago

#264 closed (fixed)

Testing accessdev capacity before training

Reported by: Martin Dix Owned by: Martin Dix
Priority: major Component: Accessdev Server
Keywords: Cc: Michael Naughton

Description

Accessdev configuration is 4 CPU, 16 GB.

Currently about 25 suites running on accessdev and about 15 instances of cylc-gui. System load average around 0.2. These processes have of order 100 MB RSS memory usage each.

During training we may need to support of order 100 active suites and GUIs. Will the CPU and memory be adequate for this?

Use accessdev-test to test effect of running lots of processes (would need to reboot with same specs as accessdev, currently 1 proc, 2 GB).

If necessary we could create extra test1, test2 etc just for the training.

Change History (4)

comment:1 Changed 3 years ago by Martin Dix

Largest standard configuration available is 8 CPUs, 16 GB memory

[mrd599@cloudlogin ~]$ nova flavor-list
+----+----------------+-----------+------+-----------+------+-------+-------------+-----------+-------------+
| ID | Name           | Memory_MB | Disk | Ephemeral | Swap | VCPUs | RXTX_Factor | Is_Public | extra_specs |
+----+----------------+-----------+------+-----------+------+-------+-------------+-----------+-------------+
| 1  | m1.tiny        | 512       | 0    | 0         |      | 1     | 1.0         | True      | {}          |
| 10 | m1.large.2     | 8192      | 40   | 80        |      | 4     | 1.0         | True      | {}          |
| 11 | m1.medium.3    | 16384     | 80   | 0         |      | 2     | 1.0         | True      | {}          |
| 12 | m1.medium.4    | 16384     | 10   | 0         |      | 2     | 1.0         | True      | {}          |
| 13 | m1.xlarge2     | 16384     | 10   | 160       |      | 2     | 1.0         | True      | {}          |
| 14 | m1.large.2.r16 | 16384     | 40   | 80        |      | 4     | 1.0         | True      | {}          |
| 15 | m1.xlarge3     | 16384     | 160  | 0         |      | 2     | 1.0         | True      | {}          |
| 2  | m1.small       | 2048      | 20   | 0         |      | 1     | 1.0         | True      | {}          |
| 3  | m1.medium      | 4096      | 40   | 0         |      | 2     | 1.0         | True      | {}          |
| 4  | m1.large       | 8192      | 80   | 0         |      | 4     | 1.0         | True      | {}          |
| 5  | m1.xlarge      | 16384     | 160  | 0         |      | 8     | 1.0         | True      | {}          |
| 6  | win.medium     | 4096      | 40   | 20        |      | 2     | 1.0         | True      | {}          |
| 7  | win.large      | 8192      | 40   | 40        |      | 4     | 1.0         | True      | {}          |
| 8  | m1.small.2     | 2048      | 30   | 40        |      | 1     | 1.0         | True      | {}          |
| 9  | m1.medium.2    | 4096      | 40   | 40        |      | 2     | 1.0         | True      | {}          |
+----+----------------+-----------+------+-----------+------+-------+-------------+-----------+-------------+

comment:2 Changed 3 years ago by Martin Dix

Owner: set to Martin Dix
Status: newassigned

Rebooted accessdev-test as m1.medium, 2 cores and 4 GB.

~mrd599/cylc_tests/local_cycle is a simple suite that has 2 apps, a "model" that just sleeps and a housekeeping task to clean up the log files. They both run locally on accessdev (or accessdev-test). This should be more demanding than a simple suite that runs tasks on raijin because the run tasks themselves will also take some resources (from rose and cylc rather than the trivial task itself) in addition to the communication and control.

Decreasing the sleep time increases the CPU load.

accessdev-test was ok with 90 instances of this suite running, each using a 300 s sleep (note no cylc GUIs). CPU usage was around 30-50%. However with 100 suites things started to go wrong. Starting suites had errors like

bash: fork: retry Resource temporarily unavailable

Task connections to server also started failing with timeouts and resource errors.

The states of some of these failed suites weren't shown properly in the cylc gsummary GUI at this stage.

Search for failed suites with

find ~/cylc-run -name job.err -size +1 -exec ls -l {} \;

90 suites with 100 s sleep time gave a consistent near 100 % CPU load (80% user 20% system). Here there's a task running every 0.5 s. A few suites got stuck with tasks in the ready state but most completed 60 steps correctly.

This suggests that the problem with 100 suites is memory rather than CPU.

My VNC session on accessdev hung up with about 10 suites + GUIs but this is a vnc problem rather than a server capacity problem.

Using a VNC session on CSIRO machine ruby, I started suites with GUIs. Up to 40 things seemed ok and the GUIs were still responsive. With 50 suites I started to get memory allocation errors. CPU usage was near 100% but the system still felt reasonably responsive.

Summary: accessdev with 16 GB should have no trouble with 100 suites and GUIs.

Last edited 3 years ago by Martin Dix (previous) (diff)

comment:3 Changed 3 years ago by Martin Dix

During hands-on sessions of course memory usage approached 16 GB and some processes failed to start or were killed.

Now a new flavor m1.large.2.r32 with 32 GB, 4 cores.

New test instance booted with

./tools/nova-boot   --name accessdev-test.nci.org.au --ip 130.56.244.73  --repo git@repos.nci.org.au:p/access.dev/puppet --branch master --install-updates   --   --image centos-6.7-20160321 --flavor m1.large.2.r32   --key-name mrd599   --security-groups ssh,http,umui,ping

comment:4 Changed 3 years ago by Martin Dix

Resolution: fixed
Status: assignedclosed

accessdev now has 32 GB.

Note: See TracTickets for help on using tickets.