Re: Running NAMD on Forge (CUDA)

From: Gianluca Interlandi (gianluca_at_u.washington.edu)
Date: Thu Jul 12 2012 - 15:58:42 CDT

> What are your simulation parameters:
>
> timestep (and also any multistepping values)
2 fs, SHAKE, no multistepping

> cutoff (and also the pairlist and PME grid spacing)
8-10-12 PME grid spacing ~ 1 A

> Have you tried giving it just 1 or 2 GPUs alone (using the +devices)?

Yes, this is the benchmark time:

np 1: 0.48615 s/step
np 2: 0.26105 s/step
np 4: 0.14542 s/step
np 6: 0.10167 s/step

I post here also part of the log running on 6 devices (in case it is
helpful to localize the problem):

Pe 4 has 57 local and 64 remote patches and 1066 local and 473 remote
computes.
Pe 1 has 57 local and 65 remote patches and 1057 local and 482 remote
computes.
Pe 5 has 57 local and 56 remote patches and 1150 local and 389 remote
computes.
Pe 2 has 57 local and 57 remote patches and 1052 local and 487 remote
computes.
Pe 3 has 58 local and 57 remote patches and 1079 local and 487 remote
computes.
Pe 0 has 57 local and 57 remote patches and 1144 local and 395 remote
computes.

Gianluca

> Gianluca
>
> On Thu, 12 Jul 2012, Aron Broom wrote:
>
> have you tried the multicore build?  I wonder if the prebuilt
> smp one is just not
> working for you.
>
> On Thu, Jul 12, 2012 at 3:21 PM, Gianluca Interlandi
> <gianluca_at_u.washington.edu>
> wrote:
>             are other people also using those GPUs?
>
>
> I don't think so since I reserved the entire node.
>
>       What are the benchmark timings that you are given after
> ~1000
>       steps?
>
>
> The benchmark time with 6 processes is 101 sec for 1000
> steps. This is only
> slightly faster than Trestles where I get 109 sec for 1000
> steps running on 16
> CPUs. So, yes 6 GPUs on Forge are much faster than 6 cores on
> Trestles, but in
> terms of SUs it makes no difference, since on Forge I still
> have to reserve the
> entire node (16 cores).
>
> Gianluca
>
>       is some setup time.
>
>       I often run a system of ~100,000 atoms, and I generally
> see an
>       order of magnitude
>       improvement in speed compared to the same number of
> cores without
>       the GPUs.  I would
>       test the non-CUDA precompiled cude on your Forge system
> and see how
>       that compares, it
>       might be the fault of something other than CUDA.
>
>       ~Aron
>
>       On Thu, Jul 12, 2012 at 2:41 PM, Gianluca Interlandi
>       <gianluca_at_u.washington.edu>
>       wrote:
>             Hi Aron,
>
>             Thanks for the explanations. I don't know whether
> I'm doing
>       everything
>             right. I don't see any speed advantage running on
> the CUDA
>       cluster
>             (Forge) versus running on a non-CUDA cluster.
>
>             I did the following benchmarks on Forge (the
> system has
>       127,000 atoms and
>             ran for 1000 steps):
>
>             np 1:  506 sec
>             np 2:  281 sec
>             np 4:  163 sec
>             np 6:  136 sec
>             np 12: 218 sec
>
>             On the other hand, running the same system on 16
> cores of
>       Trestles (AMD
>             Magny Cours) takes 129 sec. It seems that I'm not
> really
>       making good use
>             of SUs by running on the CUDA cluster. Or, maybe
> I'm doing
>       something
>             wrong? I'm using the ibverbs-smp-CUDA
> pre-compiled version of
>       NAMD 2.9.
>
>             Thanks,
>
>                  Gianluca
>
>             On Tue, 10 Jul 2012, Aron Broom wrote:
>
>                   if it is truly just one node, you can use
> the
>       multicore-CUDA
>                   version and avoid the
>                   MPI charmrun stuff.  Still, it boils down
> to much the
>       same
>                   thing I think.  If you do
>                   what you've done below, you are running one
> job with 12
>       CPU
>                   cores and all GPUs.  If
>                   you don't specify the +devices, NAMD will
> automatically
>       find
>                   the available GPUs, so I
>                   think the main benefit of specifying them
> is when you
>       are
>                   running more than one job
>                   and don't want the jobs sharing GPUs.
>
>                   I'm not sure you'll see great scaling
> across 6 GPUs for
>       a
>                   single job, but that would
>                   be great if you did.
>
>                   ~Aron
>
>                   On Tue, Jul 10, 2012 at 1:14 PM, Gianluca
> Interlandi
>                   <gianluca_at_u.washington.edu>
>                   wrote:
>                         Hi,
>
>                         I have a question concerning running
> NAMD on a
>       CUDA
>                   cluster.
>
>                         NCSA Forge has for example 6 CUDA
> devices and 16
>       CPU
>                   cores per node. If I
>                         want to use all 6 CUDA devices in a
> node, how
>       many
>                   processes is it
>                         recommended to spawn? Do I need to
> specify
>       "+devices"?
>
>                         So, if for example I want to spawn 12
> processes,
>       do I
>                   need to specify:
>
>                         charmrun +p12 -machinefile
> $PBS_NODEFILE +devices
>                   0,1,2,3,4,5 namd2
>                         +idlepoll
>
>                         Thanks,
>
>                              Gianluca
>
>                        
>       -----------------------------------------------------
>                         Gianluca Interlandi, PhD
>       gianluca_at_u.washington.edu
>                                             +1 (206) 685 4435
>                                            
>                   http://artemide.bioeng.washington.edu/
>
>                         Research Scientist at the Department
> of
>       Bioengineering
>                         at the University of Washington,
> Seattle WA
>       U.S.A.
>                        
>       -----------------------------------------------------
>
>
>
>
>                   --
>                   Aron Broom M.Sc
>                   PhD Student
>                   Department of Chemistry
>                   University of Waterloo
>
>
>
>
>            
> -----------------------------------------------------
>             Gianluca Interlandi, PhD
> gianluca_at_u.washington.edu
>                                 +1 (206) 685 4435
>                                
> http://artemide.bioeng.washington.edu/
>
>             Research Scientist at the Department of
> Bioengineering
>             at the University of Washington, Seattle WA
> U.S.A.
>            
> -----------------------------------------------------
>
>
>
>
>       --
>       Aron Broom M.Sc
>       PhD Student
>       Department of Chemistry
>       University of Waterloo
>
>
>
>
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>                     +1 (206) 685 4435
>                     http://artemide.bioeng.washington.edu/
>
> Research Scientist at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------
>
>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>
>
>
>
> -----------------------------------------------------
> Gianluca Interlandi, PhD gianluca_at_u.washington.edu
>                     +1 (206) 685 4435
>                     http://artemide.bioeng.washington.edu/
>
> Research Scientist at the Department of Bioengineering
> at the University of Washington, Seattle WA U.S.A.
> -----------------------------------------------------
>
>
>
>
> --
> Aron Broom M.Sc
> PhD Student
> Department of Chemistry
> University of Waterloo
>
>
>

-----------------------------------------------------
Gianluca Interlandi, PhD gianluca_at_u.washington.edu
                     +1 (206) 685 4435
                     http://artemide.bioeng.washington.edu/

Research Scientist at the Department of Bioengineering
at the University of Washington, Seattle WA U.S.A.
-----------------------------------------------------

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:46 CST