NAMD performance tuning concepts (NAMD 3.0 User's Guide)

Next: Non-bonded interaction distance-testing Up: Performance Tuning Previous: Performance Tuning Contents Index

Subsections

NAMD performance tuning concepts

The simulation performance obtained from NAMD depends on many factors. The particular simulation protocol being run is one of the largest single factors associated with NAMD performance, as different simulation methods invoke different code that can have substantially different performance costs, potentially with a different degree of parallel scalability, message passing activity, hardware acceleration through the use of GPUs or CPU vectorization, and other attributes that also contribute to overall NAMD performance.

Measuring performance.

When NAMD first starts running, it does significant I/O, FFT tuning, GPU context setup, and other work that is unrelated to normal simulation activity, so it is important to measure performance only when NAMD has completed startup and all of the processing units are running at full speed. The best way to measure NAMD performance accurately requires running NAMD for at least 500 to 1,000 steps of normal dynamics (not minimization), so that load balancing has a chance to take place several times, and all of the CPUs and GPUs have ramped up to 100% clock rate. NAMD provides ``Benchmark time:'' and ``TIMING:'' measurements in its output, which can be used for this purpose. Here, we are only interested in the so-called wall clock time.

Modern GPU performance is now so fast, especially for GPU-resident mode, that it is expected to take a much longer run than 1,000 steps for hardware to achieve a steady thermal state and power utilization reflecting true long-term simulation performance behavior. The following config parameter can be used to run dynamics for some number of seconds.

benchmarkTime Perform benchmark for indicated number of seconds
Acceptable Values: positive integer
Description: Use for benchmarking dynamics. Terminate simulation after running for the indicated number of seconds, where the termination condition is checked at the next ``TIMING:'' output.

A reasonable benchmark length might be in the range of 120 to 180 (two to three minutes) to provide enough time for the hardware to fully warm up. If outputPerformance is left enabled (the default setting) the final ``PERFORMANCE:'' line will show the average ns/day performance. A good value for outputTiming might be 100, 500, or even 1000, depending on the system size and hardware capability, where performance monitoring should probably not be done more than once per second of wall time. These practices are now the preferred way to benchmark GPU-accelerated builds of NAMD, because the measurements shown by the ``Benchmark time:'' output lines are gathered too quickly in the simulation to accurately determine true performance.

NAMD configuration and I/O performance.

Aside from the choice of major simulation protocol and associated methods in use, it is also important to consider the performance impacts associated with routine NAMD configuration parameters such as those that control the frequency of simulation informational outputs and various types of I/O. Simulation outputs such as energy information may require NAMD to do additional computations above and beyond standard force evaluation calculations. We advise that NAMD simulation configuration parameters be selected such that output of energies (via the outputEnergies parameter) be performed only as much as is strictly necessary, since they otherwise serve to slow down the simulation due to the extra calculations they require. NAMD writes ``restart" files to enable simulations that were terminated unexpectedly (for any reason) to be conveniently restarted from the most recently written restart file available. While it is desirable to have a relatively recent restart point to continue from, writing restart information costs NAMD extra network communication and disk I/O. If restart files are written too frequently, this extra activity and I/O will slow down the simulation. A reasonable estimate for restart frequency is to choose the value such that NAMD writes restart files about once every ten minutes of wall clock time. At such a rate, the extra work and I/O associated with writing the restart files should remain an insignificant factor in NAMD performance.

Computational (arithmetic) performance.

NAMD is provided in a variety of builds that support platform-specific techniques such as CPU vectorization and GPU acceleration to achieve higher arithmetic performance, thereby increasing NAMD simulation throughput. Whenever possible NAMD builds should be compiled such that CPU vector instructions are enabled, and highly tuned platform-specific NAMD code is employed for performance-critical force computations. The so-called ``SMP'' builds of NAMD benefit from reduced memory use and can in many cases perform better overall, but one trade-off is that the communication thread is unavailable for simulation work. NAMD performance can be improved by explicitly setting CPU affinity using the appropriate Charm++ command line flags, e.g., ++ppn 7 +commap 0,8 +pemap 1-7,9-15 as an example.

It is often beneficial to reserve one CPU core for the operating system, to prevent harmful operating system noise or ``jitter'', particularly when running NAMD on large scale clusters or supercomputers. The Cray aprun -r 1 command reserves and forces the operating system to run on the last CPU core.

State-of-the-art compute-optimized GPU accelerators, can provide NAMD with simulation performance equivalent to several CPU sockets (on the order of 100 CPU cores) when used to greatest effect, e.g., when GPUs have sufficient work per GPU. In general, effective GPU acceleration currently requires on the order of 10,000 to 100,000 atoms per GPU assuming a fast network interconnect. NAMD GPU-offload mode requires several CPU cores to drive each GPU effectively, ensuring that there is always work ready and available for the GPU. For contemporary CPU and GPU hardware, the most productive ratios of CPU core counts per GPU might range from 8:1 to 64:1 or higher depending on the details of the hardware involved. GPU-offload mode running on modern GPU hardware will be bottlenecked by the CPU, which means that it is especially important to follow NUMA domain mapping considerations and generally avoid running other than one hardware thread per core (i.e., SMT 1, avoiding use of ``hyper-threading''). An advantageous hardware configuration might have one NUMA domain per GPU, which is used most effectively by scheduling one SMP rank per GPU device / NUMA domain, again leaving at least one core per NUMA domain available for the communication thread.

GPU-resident mode shifts almost all computational work to the GPU so generally requires less CPU core support per device than GPU-offload mode, where the CPU is used primarily for GPU kernel management and doing infrequent bursts of activity related to atom migration and file I/O or possibly doing frequent bursts of activity if some host-based computation such as Colvars is used. GPU-resident mode uses shared-memory parallelism but is capable of scaling across multiple GPUs on the same physical node when the GPUs are interconnected by a high-speed networking fabric, such as NVLink or NVSwitch for NVIDIA or Infinity Fabric for AMD. For standard simulations, CPU involvement is low enough to be unaffected by NUMA domain issues. However, heavier per-step use of the CPU cores, such as using Colvars with GPU-resident mode, might benefit from mapping device PEs to a shared NUMA domain.

Networking performance.

When running NAMD on more than a single node, it is important to use a NAMD version that is optimal for the underlying network hardware and software you intend to run on. The Charm++ runtime system on which NAMD is based supports a variety of underlying networks, so be sure to select a NAMD/Charm++ build that is most directly suited for your hardware platform. In general, we advise users to avoid the use of an MPI-based NAMD build as it will underperform compared with a native network layer such as InfiniBand IB verbs (often referred to as ``verbs''), the UCX (Unified Communication X) framework (``ucx''), the Cray-specific ``ofi-crayshasta'' (or ``ofi-crayshasta cxi'') layer for Slingshot-11, or the IBM PAMI message passing layer, as practical examples.

Next: Non-bonded interaction distance-testing Up: Performance Tuning Previous: Performance Tuning Contents Index

http://www.ks.uiuc.edu/Research/namd/