GPU Acceleration (NAMD 3.0 User's Guide)

Next: Xeon and Zen4 Acceleration Up: Running NAMD Previous: CPU Affinity Contents Index

Subsections

GPU Acceleration

NAMD supports GPU-accelerated calculations for NVIDIA GPUs using CUDA and AMD GPUs using HIP/ROCm. The ``classic'' mode of running NAMD with GPU acceleration offloads most of the force calculations to GPU devices (GPU-offload mode) and runs the other calculations (integration, rigid bond constraints, etc.) on the CPU. Version 3.0 of NAMD introduces a GPU-resident mode which calculates the entire dynamics calculations on GPU.

A port of NAMD's CUDA kernels to SYCL/oneAPI is in development to support Intel GPUs (e.g., ALCF Aurora). A source code preview release (version 2.15 alpha 2) providing SYCL support for GPU-offload mode is available on the NAMD website.

GPU-Offload Mode

For GPU-offload mode, NAMD does not offload the entire calculation to the GPU, and performance may therefore be limited by the CPU. In general all available CPU cores should be used, with CPU affinity set as described above.

Energy evaluation is slower than calculating forces alone, and the loss is much greater in CUDA-accelerated builds. Therefore you should set outputEnergies to 100 or higher in the simulation config file. Forces evaluated on the GPU differ slightly from a CPU-only calculation, an effect more visible in reported scalar pressure values than in energies.

NAMD now has the entire force calculation offloaded to GPU for conventional MD simulation options. However, not all advanced features are compatible with CUDA-accelerated NAMD builds, in particular, any simulation option that requires modification to the functional form of the nonbonded forces. Note that QM/MM simulation is also disabled for CUDA-accelerated NAMD, because the calculation is bottlenecked by the QM calculation rather than the MM force calculation, so can benefit from CUDA acceleration of the QM part when available. Table 1 lists the parts of NAMD that are accelerated with CUDA-capable GPUs, and Table 2 lists the advanced simulation options that are disabled within a CUDA-accelerated NAMD build.

**Table 1:** NAMD GPU-offload mode: What is accelerated?
Accelerated	Not Accelerated
short-range nonbonded	integration
PME reciprocal sum	rigid bonds
bonded terms	grid forces
implicit solvent	collective variables
alchemical (FEP and TI)

**Table 2:** NAMD GPU: What features are disabled?
Disabled	Not Disabled
Locally enhanced sampling	Memory optimized builds
Tabulated energies	Conformational free energy
Drude (nonbonded Thole)	Collective variables
Go forces	Grid forces
Pairwise interaction	Steering forces
Pressure profile	Almost everything else
QM/MM

GPU-Resident Mode

GPU-resident mode for MD simulation is a feature new to version 3.0. Unlike GPU-offload mode that offloads the force calculation to GPU devices while performing integration and rigid bond constraints on the host CPU, GPU-resident mode also performs integration and rigid bond constraints on the GPU device and, most importantly, maintains the simulation data on the GPU device between time steps. By removing the performance bottleneck resulting from host-device memory transfers and CPU kernel calculations performed at every time step, MD simulation performance on modern GPU hardware is more than doubled.

The new GPU-resident mode is also capable of scaling a simulation across multiple GPUs within a single node as long as the GPUs have peer access to directly read and write each others' local memories, typically configured when all devices are mutually connected by a high-speed networking fabric such as NVLink/NVSwitch for NVIDIA GPUs or Infinity Fabric for AMD GPUs. Representative configurations are found in desktop workstations with an NVLinked pair of NVIDIA GPUs, NVIDIA DGX computers, and the nodes of many supercomputers, e.g., OLCF Frontier (AMD), OLCF Summit (NVIDIA), ALCF Polaris (NVIDIA), and NERSC Perlmutter (NVIDIA).

GPU-resident mode for now exploits shared-memory parallelism, which means that it is limited to multicore and netlrts-smp builds, where the latter supports multiple replica GPU-resident simulations with each replica running as a single process.

Unlike GPU-offload mode which generally benefits from a larger number of CPU cores, GPU-resident mode offers limited host-side work requiring fewer CPU cores. In fact, attempting to run too many CPU cores with GPU-resident mode can slow down performance, due to the increased overhead of synchronizing those cores without enough work to reasonably utilize them. For GPU-resident mode, the number of CPU cores (PEs) to use depends on the size of the system, the features being used, the number of GPU devices used, and the relative performance of the CPU cores compared to the GPU devices. It is recommended to run benchmarks to determine the optimal core count for your hardware. The command benchmarkTime can be used with outputTiming to easily benchmark a system from the command line without needing to modify the config file. For example,

  ./namd3 +p4 +setcpuaffinity --outputTiming 500 --benchmarkTime 180 <configfile>

will terminate the simulation after running for three minutes (180 seconds), detected at the next output from outputTiming.

Since GPU-resident mode performs all calculation on the GPU device, advanced features must generally be supported by porting a given feature to the GPU. Several features have been ported to GPU-resident mode but many others still need to be ported. Some of these advanced features are available for multi-GPU scaling and others are single-GPU only. There are also some features that are now provided as GPU-resident-only high-performance alternatives to already existing GPU-offload features. Table 3 lists the features available to GPU-resident mode, indicating support for multi-GPU scaling and what methodologies are replaced, if any.

**Table 3:** NAMD GPU-resident mode: What features are supported?
Feature	Multi-GPU	Replacing
Essential dynamics	yes
4-site water models (TIP4P and OPC)	yes
Alchemical (FEP and TI)	yes
Multi-replica	yes
Replica exchange solute scaling (REST2)	yes
Harmonic restraints	no	Fixed atoms
External electric field	no
Monte Carlo barostat	no	Langevin piston
Group position restraints	no	Colvars distance restraints
Colvars	yes
TCL forces	yes

Essential dynamics includes the standard ensembles (constant energy, constant temperature with Langevin damping or stochastic velocity rescaling, and constant pressure and temperature with Langevin piston) together with rigid bond constraints, multiple time stepping, and PME electrostatics. Note that fixed atoms are not yet supported by GPU-resident, with harmonic restraints recommended as a workaround until support for fixed atoms is finished. The Monte Carlo barostat offers a faster pressure control method than Langevin piston, by avoiding calculation of the pressure virial at every step. Group position restraints is a NAMD-native GPU-resident implementation of Colvars distance restraints, providing much higher performance than Colvars. Both Colvars and TCL forces can be used with GPU-resident mode, but their use might significantly impact performance since either one requires host-device data transfer and CPU host calculations every step. The impact to performance depends on what collective variables have been defined and the number of atoms affected.

Whether or not to use multi-GPU scaling for a simulation depends on the size of the system and the capabilities of the GPU. For example, the 1M-atom STMV benchmark system gets reasonably good scaling efficiency across an 8-GPU NVIDIA DGX-A100. A reasonable rule of thumb seems to be around 100k atoms per GPU for the Ampere series of GPUs and twice that per GPU for Hopper.

Multi-GPU scaling performance for GPU-resident mode is significantly impacted by PME. The issue is that due to the difficulty of scaling the 3D FFT calculations, the long-range (gridded) part of PME is delegated to a single GPU, and NAMD's default work decomposition scheme, to evenly distribute patches to CPU cores and evenly distribute CPU cores to devices will naturally overload the PME device. The workaround is to use task-based parallelism to restrict the amount of ``standard'' work to the PME device. The approach exploits the existing load balancing performed by NAMD during its startup by simply reducing the number of PEs assigned to the PME device through the new ``+pmepes'' command line argument. Note that good load balancing should maintain the same number of PEs on the non-PME devices, which means that the overall number of PEs set by ``+p'' will necessarily be reduced. Setting this argument is best determined by benchmarking the given system on the intended hardware platform, which was done to determine optimal settings for the 1M-atom STMV benchmark system running on DGX-A100, using 8 PEs per device:

  ./namd3 +p8 +setcpuaffinity +devices 0 stmv.namd
  ./namd3 +p15 +pmepes 7 +setcpuaffinity +devices 0,1 stmv.namd
  ./namd3 +p29 +pmepes 5 +setcpuaffinity +devices 0,1,2,3 stmv.namd
  ./namd3 +p57 +pmepes 1 +setcpuaffinity +devices 0,1,2,3,4,5,6,7 stmv.namd

Since performance for MD exhibits predominantly linear scaling (up to reasonable size and resource utilization limits), the ratios shown above for STMV can be applied as a starting rule-of-thumb for other systems.

GPU-resident mode can also provide very fast simulation for smaller systems. For example, the AMBER DHFR (23.6k atoms) benchmark, using AMBER force field parameters with 9Å cutoff, PME, rigid bond constraints, and hydrogen mass repartitioning with 4fs time step, can be simulated on A100 with over 1 microsecond/day performance. When simulating smaller systems like DHFR, performance is improved by using twoAwayZ on to double the patch count, producing a greater number of work units to schedule across the GPU processing units.

Small systems will not achieve good scaling across multiple GPUs. Instead, the most effective way to use multi-GPU architectures is to simulate multi-copy ensembles. Depending on the size of the system and the hardware capability, GPU resources are often most efficiently used by running multiple simulations per GPU, in which performance can be measured as the aggregate number of simulated nanoseconds per day achieved. NVIDIA provides technologies MPS (Multi-Process Service) and MIG (Multi-Instance GPU) that can facilitate running multiple simultaneous NAMD jobs on a single GPU.

GPU Hardware Requirements

To benefit from GPU acceleration using NVIDIA GPU hardware you will need a CUDA build of NAMD and a recent NVIDIA video card. CUDA builds will not function without a CUDA-capable GPU and a driver that supports CUDA 9.1. If the installed driver is too old NAMD will exit on startup with the error ``CUDA driver version is insufficient for CUDA runtime version.'' GPUs of compute capability 5.0 are no longer supported and are ignored. GPUs with two or fewer SMs are ignored unless specifically requested with +devices.

Finally, if NAMD was not statically linked against the CUDA runtime then the libcudart.so file included with the binary (copied from the version of CUDA it was built with) must be in a directory in your LD_LIBRARY_PATH before any other libcudart.so libraries. For example, when running a multicore binary (recommended for a single machine):

  setenv LD_LIBRARY_PATH ".:$LD_LIBRARY_PATH"
  (or LD_LIBRARY_PATH=".:$LD_LIBRARY_PATH"; export LD_LIBRARY_PATH)
  ./namd3 +p8 +setcpuaffinity <configfile>

NAMD can be built to run with GPU acceleration on HIP-compatible AMD GPUs. Build instructions can be found in the NAMD distribution notes.txt file. For HIP builds, NAMD has been tested with ROCm 5.4.2 and 5.7.0, and the HIP builds maintain feature parity with CUDA builds.

Each NAMD thread can use only one GPU. Therefore you will need to run at least one thread for each GPU you want to use. Multiple threads in an SMP build of NAMD can share a single GPU, usually with an increase in performance. NAMD will automatically distribute threads equally among the GPUs on a node. Specific GPU device IDs can be requested via the +devices argument on the namd3 command line, for example:

  ./namd3 +p8 +setcpuaffinity +devices 0,2 <configfile>

Devices are shared by consecutive threads in a process, so in the above example threads 0-3 will share device 0 and threads 4-7 will share device 2. Repeating a device will cause it to be assigned to multiple master threads, which is allowed only for different processes and is advised against in general but may be faster in certain cases. When running on multiple nodes the +devices specification is applied to each physical node separately and there is no way to provide a unique list for each node.

When running a multi-node parallel job it is recommended to have one process per device to maximize the number of communication threads. If the job launch system enforces device segregation such that not all devices are visible to each process then the +ignoresharing argument must be used to disable the shared-device error message.

When running a multi-copy simulation with both multiple replicas and multiple devices per physical node, the +devicesperreplica n argument must be used to prevent each replica from binding all of the devices. For example, for 2 replicas per 6-device host use +devicesperreplica 3.

While charmrun with ++local will preserve LD_LIBRARY_PATH, normal charmrun does not. You can use charmrun ++runscript to add the namd3 directory to LD_LIBRARY_PATH with the following executable runscript:

  #!/bin/csh
  setenv LD_LIBRARY_PATH "${1:h}:$LD_LIBRARY_PATH"
  $*

For example:

  ./charmrun ++runscript ./runscript ++n 4 ./namd3 ++ppn 15 <configfile>

An InfiniBand network is highly recommended when running GPU-accelerated NAMD across multiple nodes. You will need either a verbs NAMD binary (available for download) or an MPI NAMD binary (must build Charm++ and NAMD as described above) to make use of the InfiniBand network. The use of SMP binaries is also recommended when running on multiple nodes, with one process per GPU and as many threads as available cores, reserving one core per process for the communication thread.

The CUDA (NVIDIA's graphics processor programming platform) code in NAMD is completely self-contained and does not use any of the CUDA support features in Charm++. When building NAMD with CUDA support you should use the same Charm++ you would use for a non-CUDA build. Do NOT add the cuda option to the Charm++ build command line. The only changes to the build process needed are to add -with-cuda for GPU-offload support or -with-single-node-cuda for GPU-resident support and possibly -cuda-prefix ... to the NAMD config command line.

NAMD can also be built with HIP/ROCm to support compatible AMD GPUs, otherwise matching all features available with CUDA builds. The build configuration options for HIP/ROCm are changed to -with-hip for GPU-offload support or -with-single-node-hip for GPU-resident support and -rocm-prefix ... to specify the library path.

For now, NAMD does not support all available features on GPU. Some keywords have been introduced to give the user better control over GPU calculation. These keywords are relevant only for GPU builds and are ignored if the user is running a CPU-only build.

Keywords

GPUresident Run dynamics calculations entirely on GPU
Acceptable Values: ``on'' or ``off''
Default Value: off
Description: GPU-resident enabled builds also require setting the GPUresident keyword in order to run dynamics calculations entirely on GPU. Without setting this keyword, GPU-enabled builds of NAMD will run in GPU-offload mode by default. Replaces the deprecated CUDASOAintegrate keyword.
GPUAtomMigration Perform atom migration on the device
Acceptable Values: ``on'' or ``off''
Default Value: off
Description: An experimental optimization for GPU-resident mode simulation that performs atom migration on the GPU device rather than the CPU host. Using this can give faster performance, especially for smaller systems, but this feature is still considered experimental. For smaller systems, performance can be further improved by also setting ``twoAwayZ on'' to double the number of patches, improving GPU utilization by increasing the work units to be scheduled across the device. With GPUAtomMigration enabled, the maximum patch size is limited to 2048 atoms; exceeding this limit will cause NAMD to terminate with an error message. If patch sizes are close to this limit, then restarting the simulation with ``twoAwayZ on'' will reduce the sizes by around one-half. Replaces the deprecated DeviceMigration keyword.
GPUForceTable Always use force table interpolation for nonbonded GPU kernel
Acceptable Values: ``on'' or ``off''
Default Value: on
Description: An experimental optimization for GPU-resident mode simulation. Setting ``GPUForceTable off'' will use direct math calculation for non-PME steps (i.e., for multiple time stepping when the error function is not being computed). For recent GPUs direct calculation is faster for certain cases than force table lookup and interpolation. Not all nonbonded kernel variants are supported, for example, the standard van der Waals switching function is implemented, but the ``vdwForceSwitching'' function is not. Similarly, alchemical free energy calculations with FEP and TI are not supported. Replaces the deprecated CUDAForceTable keyword.
bondedGPU 0 to 255
Acceptable Values: Integer value between 0 and 255
Default Value: 255
Description: NAMD provides GPU kernels for calculating six different bonded force terms. This parameter is irrelevant to GPU-resident mode, since all forces are calculated on the GPU. For GPU-offload mode, the bondedGPU parameter acts as a bit mask that can disable particular kernels. Any partial sum of the following values can be used to enable only the specified bonded terms:
- bonds 1
- angles 2
- dihedrals 4
- impropers 8
- exclusions 16
- crossterms 32
Replaces the deprecated bondedCUDA keyword.
usePMEGPU Offload entire PME reciprocal sum to GPU?
Acceptable Values: ``on" or ``off"
Default Value: on
Description: For GPU-resident mode, the entire PME reciprocal sum is automatically run on a single GPU, so it is not necessary to set this parameter. For GPU-offload mode, whether or not usePMEGPU is enabled depends on how NAMD is run. It is automatically enabled when running on a single node and is disabled when running across multiple nodes. Replaces the deprecated usePMECUDA keyword.
PMEoffload Offload PME gridding/ungridding procedures to GPU?
Acceptable Values: ``on" or ``off"
Default Value: off
Description: This parameter is irrelevant for GPU-resident mode. For GPU-offload mode, the gridding and ungridding procedures for calculating the PME reciprocal sum is offloaded to GPUs, with the FFT calculation still performed by CPUs. PMEoffload is enabled by default only for PMEinterpOrder 4.
For huge systems (10 million atoms and above) where the parallel FFT limits performance, it is desirable to use PMEoffload in conjunction with increased order interpolation and increased grid spacing, in order to decrease the overall communication latency by decreasing the overall grid size by a factor of 8 while maintaining the same accuracy for the calculation.

Exemplary use:
```
PME on 


PMEinterpOrder 8 


PMEgridSpacing 2.0 


PMEoffload on   ;# enabled by default for these PME settings
```

Next: Xeon and Zen4 Acceleration Up: Running NAMD Previous: CPU Affinity Contents Index

http://www.ks.uiuc.edu/Research/namd/