Re: Multi node run causes "CUDA error cudaStreamCreate"

From: Sergei (mce2000_at_mail.ru)
Date: Tue Jun 19 2012 - 08:32:39 CDT

Hi All!

I have the same problem as discussed about a year ago: MPI+CUDA namd
fails on multiple nodes. Everything goes OK when all processes are
running on the same node, but something like

  CUDA error cudaStreamCreate on Pe 10 (node6-173-08 device 1):
  all CUDA-capable devices are busy or unavailable

appears as soon as more than one node is used (all nodes have the same
configuration with two Tesla X2070 cards. Processes are started via
slurm like

  sbatch -p gpu -n 8 ompi namd2 test.namd +idlepoll +devices 0,1

slurm is configured to run 8 processes per node, so specifying -n
greater than 8 (or, for example, -n 2 -N 2) causes the error. It does
not seem to be some issue with CUDA (4.2 is used), since the same binary
works fine on a single node.

Another (maybe related) strange fact is that similar error raises if
+devices option is omitted or set to 'all':

  CUDA error on Pe 0 (node6-170-15 device 0): All CUDA devices are in
  prohibited mode, of compute capability 1.0, or otherwise unusable.

The last suggestion in the past-year topic was:

> You might want to try one of the released ibverbs-CUDA binaries
> (charmrun can use mpiexec to launch non-MPI binaries now). If that
> works then the problem is with your binary somehow.

Is there any way to run ibverbs-CUDA namd without charmrun? Or maybe I
can somehow 'mate' charmrun with slurm (since the only way I can access
the cluster is through slurm's sbatch)?

Thanks!

-- 
Sincerely,
  Sergei
> Hi Mike, 
> I see that this is an MPI build. When you say it works on one node is 
> that the same binary launched the same way with mpiexec? Did you run the 
> test on all three nodes you're trying to run on? The same goes for the 
> nvidia-smi tests Axel suggested - you need to test all of the nodes. 
> Since you're getting errors from multiple nodes it's also possible that 
> the LD_LIBRARY_PATH isn't being set or passed through mpiexec correctly 
> and you're getting a different cuda runtime library. 
> It looks like you're using charmrun to launch an MPI binary. I'm going to 
> assume that charmrun is a script that is calling mpiexec more or less 
> correctly since it appears to be launching correctly, but you might want 
> to try just using the mpiexec command directly as you would for any other 
> MPI program on your cluster. 
> Since the call that's triggering the error is actually the first CUDA 
> library call from a .cu file rather than a .C file it's also possible that 
> your nvcc, -I, and -L options are mismatched. This could happen if, for 
> example, you did a partial build, edited the arch/Linux-x86_64.cuda file, 
> and then finished the build without doing a make clean. 
> You might want to try one of the released ibverbs-CUDA binaries (charmrun 
> can use mpiexec to launch non-MPI binaries now). If that works then the 
> problem is with your binary somehow. 
> -Jim 
> On Fri, 1 Apr 2011, Michael S. Sellers (Cont, ARL/WMRD) wrote: 
>>>>>> All, 
>>>>>> 
>>>>>> I am receiving a "FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1 
>>>>>> device 1): no CUDA-capable device is available" when NAMD starts up and 
>>>>>> is 
>>>>>> optimizing FFT steps, for a job running on 3 nodes, 4ppn, 2 Tesla's per 
>>>>>> node. 
>>>>>> 
>>>>>> The command I'm executing within a PBS script is: 
>>>>>> ~/software/bin/charmrun +p12 ~/software/bin/namd2 +idlepoll sim1.conf > 
>>>>>> $PBS_JOBNAME.out 
>>>>>> 
>>>>>> NAMD CUDA does not give this error on 1 node, 8ppn, 2 Teslas. Please 
>>>>>> see 
>>>>>> output below. 
>>>>>> 
>>>>>> Might this be a situation where I need to use the +devices flag? It 
>>>>>> seems 
>>>>>> as though the PEs are binding to CUDA devices on other nodes. 
>>>>>> 
>>>>>> Thanks, 
>>>>>> 
>>>>>> Mike 
>>>>>> 
>>>>>> 
>>>>>> Charm++> Running on 3 unique compute nodes (8-way SMP). 
>>>>>> Charm++> cpu topology info is gathered in 0.203 seconds. 
>>>>>> Info: NAMD CVS-2011-03-22 for Linux-x86_64-MPI-CUDA 
>>>>>> Info: 
>>>>>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/ 
>>>>>> Info: for updates, documentation, and support information. 
>>>>>> Info: 
>>>>>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) 
>>>>>> Info: in all publications reporting results obtained with NAMD. 
>>>>>> Info: 
>>>>>> Info: Based on Charm++/Converse 60303 for mpi-linux-x86_64 
>>>>>> Info: 1 NAMD CVS-2011-03-22 Linux-x86_64-MPI-CUDA 
>>>>>> Info: Running on 12 processors, 12 nodes, 3 physical nodes. 
>>>>>> Info: CPU topology information available. 
>>>>>> Info: Charm++/Converse parallel runtime startup completed at 0.204571 s 
>>>>>> Pe 2 sharing CUDA device 0 first 0 next 0 
>>>>>> Did not find +devices i,j,k,... argument, using all 
>>>>>> Pe 2 physical rank 2 binding to CUDA device 0 on n2: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Pe 3 sharing CUDA device 1 first 1 next 1 
>>>>>> Pe 3 physical rank 3 binding to CUDA device 1 on n2: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Pe 0 sharing CUDA device 0 first 0 next 2 
>>>>>> Pe 0 physical rank 0 binding to CUDA device 0 on n2: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Pe 9 sharing CUDA device 1 first 9 next 11 
>>>>>> Pe 7 sharing CUDA device 1 first 5 next 5 
>>>>>> Pe 5 sharing CUDA device 1 first 5 next 7 
>>>>>> Pe 9 physical rank 1 binding to CUDA device 1 on n0: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Pe 7 physical rank 3 binding to CUDA device 1 on n1: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Pe 5 physical rank 1 binding to CUDA device 1 on n1: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Pe 10 sharing CUDA device 0 first 8 next 8 
>>>>>> Pe 11 sharing CUDA device 1 first 9 next 9 
>>>>>> Pe 8 sharing CUDA device 0 first 8 next 10 
>>>>>> Pe 11 physical rank 3 binding to CUDA device 1 on n0: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Pe 10 physical rank 2 binding to CUDA device 0 on n0: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Pe 8 physical rank 0 binding to CUDA device 0 on n0: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Pe 6 sharing CUDA device 0 first 4 next 4 
>>>>>> Pe 6 physical rank 2 binding to CUDA device 0 on n1: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Pe 1 sharing CUDA device 1 first 1 next 3 
>>>>>> Pe 1 physical rank 1 binding to CUDA device 1 on n2: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Pe 4 sharing CUDA device 0 first 4 next 6 
>>>>>> Pe 4 physical rank 0 binding to CUDA device 0 on n1: 'Tesla T10 
>>>>>> Processor' 
>>>>>> Mem: 4095MB Rev: 1.3 
>>>>>> Info: 51.4492 MB of memory in use based on /proc/self/stat 
>>>>>> ... 
>>>>>> ... 
>>>>>> Info: PME MAXIMUM GRID SPACING 1.5 
>>>>>> Info: Attempting to read FFTW data from 
>>>>>> FFTW_NAMD_CVS-2011-03-22_Linux-x86_64-MPI-CUDA.txt 
>>>>>> Info: Optimizing 6 FFT steps. 1...FATAL ERROR: CUDA error 
>>>>>> cudaStreamCreate 
>>>>>> on Pe 7 (n1 device 1): no CUDA-capable device is available 
>>>>>> ------------- Processor 7 Exiting: Called CmiAbort ------------ 
>>>>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 7 (n1 device 1): 
>>>>>> no 
>>>>>> CUDA-capable device is available 
>>>>>> 
>>>>>> [7] Stack Traceback: 
>>>>>> [7:0] CmiAbort+0x59 [0x907f64] 
>>>>>> [7:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba] 
>>>>>> [7:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f] 
>>>>>> [7:3] _Z15cuda_initializev+0x2a7 [0x624e27] 
>>>>>> [7:4] _Z11master_initiPPc+0x1a1 [0x500a11] 
>>>>>> [7:5] main+0x19 [0x4fd489] 
>>>>>> [7:6] __libc_start_main+0xf4 [0x32ca41d994] 
>>>>>> [7:7] cos+0x1d1 [0x4f9d99] 
>>>>>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): no 
>>>>>> CUDA-capable device is available 
>>>>>> ------------- Processor 9 Exiting: Called CmiAbort ------------ 
>>>>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 9 (n0 device 1): 
>>>>>> no 
>>>>>> CUDA-capable device is available 
>>>>>> 
>>>>>> [9] Stack Traceback: 
>>>>>> [9:0] CmiAbort+0x59 [0x907f64] 
>>>>>> [9:1] _Z8NAMD_diePKc+0x4a [0x4fa7ba] 
>>>>>> [9:2] _Z13cuda_errcheckPKc+0xdf [0x624b5f] 
>>>>>> [9:3] _Z15cuda_initializev+0x2a7 [0x624e27] 
>>>>>> [9:4] _Z11master_initiPPc+0x1a1 [0x500a11] 
>>>>>> [9:5] main+0x19 [0x4fd489] 
>>>>>> [9:6] __libc_start_main+0xf4 [0x32ca41d994] 
>>>>>> [9:7] cos+0x1d1 [0x4f9d99] 
>>>>>> FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): no 
>>>>>> CUDA-capable device is available 
>>>>>> ------------- Processor 5 Exiting: Called CmiAbort ------------ 
>>>>>> Reason: FATAL ERROR: CUDA error cudaStreamCreate on Pe 5 (n1 device 1): 
>>>>>> no 
>>>>>> CUDA-capable device is available 

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:41 CST