Fwd: cuda error cudastreamcreate. SOLVED (probably)

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Wed Jun 15 2011 - 02:45:26 CDT

IT MAY BE OF INTEREST TO NAMD/DEBIAN USERS. HOWEVER, THE END QUESTION
BELOW (IN UPPERCASE) IS DIRECTED SPECIFICALLY TO NAMD

Following suggestions by Lennart Sorensen at "amd64_at_lists.debian.org",
my problem was the presence of a nvidia driver at
/lib/modules/2.6.38-2-amd64/updates/dkms/, which prevented rebuilding.
On the two commands below the correct driver, dated 15 June 2011, was
built for my linux headers.

apt-get remove nvidia-kernel-dkms (which also removes nvidia.ko)

apt-get install nvidia-kernel-dkms

Debian amd64 wheezy packages installed were:

gcc-4.4, 4.5, 4-6
libcuda1 270.41.19-1
libgl1-nvidia-glx 270.41.19-1
libnvidia-ml1 270.41.19-1
linux-headers-2.6-amd64 (2.6.38+34)
linux-headers-2.6.38-2-amd64 (2.6.38-5)
linux-headers-2.6.38-2-common (2.6.38-5)
linux-image-2.6-amd64 (2.38+34)
linux-image-2.6-38-2-amd64 (2.6.38-5)
linux-kbuild-2.6.38 (2.6.38-1)
nvidia-cuda-dev 3.2.16.2
nvidia-cuda-toolkit 3.2.16-2
nvidia-glx 270.41.19-1
nvidia-installer-cleanup 20110515+1
nvidia-kernel-common 20110515+1
nvidia-kernel-dkms 270.41.19-1
nvidia-smi 270.41.19-1

Now:

$ nvidia-smi -L
GPU 0: GeForce GTX 470 (UUID: N/A)
GPU 1: GeForce GTX 470 (UUID: N/A)

# modinfo nvidia
filename: /lib/modules/2.6.38-2-amd64/updates/dkms/nvidia.ko
alias: char-major-195-*
supported: external
license: NVIDIA
alias: pci:v000010DEd00000E00sv*sd*bc04sc80i00*
alias: pci:v000010DEd00000AA3sv*sd*bc0Bsc40i00*
alias: pci:v000010DEd*sv*sd*bc03sc02i00*
alias: pci:v000010DEd*sv*sd*bc03sc00i00*
depends: i2c-core
vermagic: 2.6.38-2-amd64 SMP mod_unload modversions
parm: NVreg_EnableVia4x:int
parm: NVreg_EnableALiAGP:int
parm: NVreg_ReqAGPRate:int
parm: NVreg_EnableAGPSBA:int
parm: NVreg_EnableAGPFW:int
parm: NVreg_Mobile:int
parm: NVreg_ResmanDebugLevel:int
parm: NVreg_RmLogonRC:int
parm: NVreg_ModifyDeviceFiles:int
parm: NVreg_DeviceFileUID:int
parm: NVreg_DeviceFileGID:int
parm: NVreg_DeviceFileMode:int
parm: NVreg_RemapLimit:int
parm: NVreg_UpdateMemoryTypes:int
parm: NVreg_InitializeSystemMemoryAllocations:int
parm: NVreg_UseVBios:int
parm: NVreg_RMEdgeIntrCheck:int
parm: NVreg_UsePageAttributeTable:int
parm: NVreg_EnableMSI:int
parm: NVreg_MapRegistersEarly:int
parm: NVreg_RegisterForACPIEvents:int
parm: NVreg_RegistryDwords:charp
parm: NVreg_RmMsg:charp
parm: NVreg_NvAGP:int

With such settings, NAMD simulation

charmrun $NAMD_HOME/bin/namd2 ++local +p6 +idlepoll ++verbose
filename.conf 2>&1 | tee filename.log

(NAMD_CVS-2011-06-04_Linux-x86_64-CUDA.tar.gz) started correctly,
using both gtx 470 cards, running overnight.

This morning, a second run to continue previous pressure equilibration
(using commands from console memory; there is only X server, no
desktop, and the X server had not been started) failed to start, with
log:

Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
Info: 1 NAMD CVS-2011-06-04 Linux-x86_64-CUDA 6 gig64 francesco
Info: Running on 6 processors, 6 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.00989103 s
Pe 2 sharing CUDA device 0 first 0 next 3
Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'Device
Emulation (CPU)' Mem: 0MB Rev: 9999.9999
FATAL ERROR: CUDA error cudaStreamCreate on Pe 2 (gig64 device 0): no
CUDA-capable device is available

where 'Device Emulation (CPU)', instead of gtx 470, is indicative of
failure. After some info commands, as above, on a second attempt NAMD
simulation started regularly:

Info: Based on Charm++/Converse 60303 for net-linux-x86_64-iccstatic
Info: Built Sat Jun 4 02:22:51 CDT 2011 by jim on lisboa.ks.uiuc.edu
Info: 1 NAMD CVS-2011-06-04 Linux-x86_64-CUDA 6 gig64 francesco
Info: Running on 6 processors, 6 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.00345588 s
Did not find +devices i,j,k,... argument, using all
Pe 0 sharing CUDA device 0 first 0 next 2
Pe 1 sharing CUDA device 1 first 1 next 3
Pe 1 physical rank 1 binding to CUDA device 1 on gig64: 'GeForce GTX
470' Mem: 1279MB Rev: 2.0
Pe 0 physical rank 0 binding to CUDA device 0 on gig64: 'GeForce GTX
470' Mem: 1279MB Rev: 2.0
Pe 3 sharing CUDA device 1 first 1 next 5
Pe 2 sharing CUDA device 0 first 0 next 4
Pe 3 physical rank 3 binding to CUDA device 1 on gig64: 'GeForce GTX
470' Mem: 1279MB Rev: 2.0
Pe 5 sharing CUDA device 1 first 1 next 1
Pe 2 physical rank 2 binding to CUDA device 0 on gig64: 'GeForce GTX
470' Mem: 1279MB Rev: 2.0
Pe 5 physical rank 5 binding to CUDA device 1 on gig64: 'GeForce GTX
470' Mem: 1279MB Rev: 2.0
Pe 4 sharing CUDA device 0 first 0 next 0
Pe 4 physical rank 4 binding to CUDA device 0 on gig64: 'GeForce GTX
470' Mem: 1279MB Rev: 2.0
Info: 1.64104 MB of memory in use based on CmiMemoryUsage
Info: Configuration file is press-04.conf
Info: Working in the current directory
/home/francesco/3b.complex_press04_NAF++/mod1.4
TCL: Suspending until startup complete.

QUESTION TO NAMD:
what does device emulation cpu in log output "Pe 2 physical rank 2
binding to CUDA device 0 on gig64: 'Device Emulation (CPU)' Mem: 0MB
Rev: 9999.9999" mean? I don't understand what is going wrong there.

Thanks a lot
francesco pietra

---------- Forwarded message ----------
From: Francesco Pietra <chiendarret_at_gmail.com>
Date: Tue, Jun 14, 2011 at 6:45 PM
Subject: Re: namd-l: cuda error cudastreamcreate
To: Jim Phillips <jim_at_ks.uiuc.edu>

On Tue, Jun 14, 2011 at 6:02 PM, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
> On Tue, 14 Jun 2011, Francesco Pietra wrote:
>
>> nvidia-smi -r (or nvidia-smi -a)
>> NVIDIA: could not open the device file /dev/nvidia1 (no such file)
>> Failed to initialize NVML: unknown error.
>>
>> If "nvidia-smi" is for Tesla only, how to check GTX 470?
>
> It's not Tesla-only (see tests below).  -Jim
>
> jim_at_lisboa>nvidia-smi -L
> GPU 0: GeForce GTX 285 (UUID: N/A)
>
> jim_at_aberdeen>nvidia-smi -L
> GPU 0: Tesla C870 (UUID:
> GPU-798dee8502c5e13c-7dd72cfe-6069e259-8fd36a96-5163bf00fbbcb8e9f61eda54)
> GPU 1: Tesla C870 (UUID:
> GPU-ed96e9c4afb70d35-694f6869-981de52a-23e64327-917becef3aa20bfd0d66432c)
> GPU 2: GeForce 9800 GTX/9800 GTX+ (UUID: N/A)

It does not work with my installation:

$ which nvidia-smi
/usr/bin/nvidia-smi

$ nvidia-smi -L (or any other option of this command)
could not open device file /dev/nvidiaactl (no such device or address).

I am using the Debian installation of nvidia.ko. I wonder whether it
would be better for me to shift to the nvidia directions suggested by
Ajasja. However, Debian Linux is not mentioned there. Ubuntu is
similar, but for commands only.

Well, it is becoming painful.

francesco

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:26 CST