From: Renfro, Michael (Renfro_at_tntech.edu)
Date: Mon Oct 16 2017 - 08:18:27 CDT
Two things I’ve found influenced benchmarking:
- model size: smaller models don’t provide enough compute work before needing to communicate back across cores and nodes
- network interconnect: on a modern Xeon system, gigabit Ethernet is a bottleneck, at least on large models (possibly all models)
I benchmarked a relatively similar system starting in July (Dell 730 and 6320, Infiniband, K80 GPUs in the 730 nodes). Results are at [1]. If I wasn’t using a ibverbs-smp build of NAMD, and was using the regular tcp version, 2 nodes gave slower run times than 1. 20k atom models topped out at around 5 28-core nodes, and 3M atom models kept getting better run times, even out to 34 28-core nodes.
A 73k system certainly should show a consistent speedup across your 6 nodes, though. And a CUDA-enabled build showed a 3-5x speedup compared to a non-CUDA run on our tests, so 1-2 of your GPU nodes could run as fast as all your non-GPU nodes combined.
So check your NAMD build features for ibverbs, and maybe verify your Infiniband is working correctly — I used [2] for checking Infiniband, even though I’m not using Debian on my cluster.
[1] https://its.tntech.edu/display/MON/HPC+Sample+Job%3A+NAMD
[2] https://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html
-- Mike Renfro / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University > On Oct 16, 2017, at 1:20 AM, Rik Chakraborty <rik.chakraborty01_at_gmail.com> wrote: > > Dear NAMD experts, > > Recently, we have installed a new cluster and the configurations are following below, > > 1. Master node with storage node- DELL PowerEdge R730xd Server > 2. CPU only node- DELL PowerEdge R430 Server (6 nos.) > 3. GPU node- DELL PowerEdge R730 Server (3 nos.) > 4. 18 ports Infiniband Switch- Mellanox SX6015 > 5. 24 ports Gigabit Ethernet switch- D-link make > > We have run a NAMD job using this cluster to check *the efiiciency in time with increasing number of CPU node. Each CPU node has 24 processor. The details of the given system and the outcomes are listed below, > > 1. No. of atoms used: 73310 > 2. Total simulation time: 1ns > 3. Time step: 2fs > > No. of nodes > > Wall Clock Time (s) > > 1 > > 27568.892578 > > 2 > > 28083.976562 > > 3 > > 30725.347656 > > 4 > > 33117.160156 > > 5 > > 35750.988281 > > 6 > > 39922.492188 > > > As we can see, wall clock time is increased with the increase of no. of CPU nodes which is not expected. > > So, this is my kind request to check this out and let me know about the problem. > > Thanking you, > > Rik Chakraborty > Junior Research Fellow (Project) > Dept. of Biological Sciences > Indian Institute of Science Education and Research, Kolkata > Mohanpur, Dist. Nadia > Pin 721246 > West Bengal, India > > > >
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:39 CST