From: John Stone (johns_at_ks.uiuc.edu)
Date: Wed May 04 2022 - 23:16:21 CDT

Hi Chris,
  It looks to me like your batch system is launching VMD as a normal
executable rather than via the usual 'mpirun' or similar which would normally
be required in order for the VMD MPI initialzation to complete correctly.

What specific MPI distribution are you using on your system(s)?

I can tell that something is very wrong because the nodecount is listed
as 1 for all four processes, and you're getting duplicate startup banners.
When VMD is launched correctly for MPI, you should see startup messages
that look more like the messages below (note that the banner does not
get printed multiple times, and the MPI startup message lists the correct
node count and node indices). This is an old log file I had sitting
around, but nothing should have changed with the code for MPI startup here:

Info) VMD for BLUEWATERS, version 1.9.4a10 (November 9, 2017)
Info) http://www.ks.uiuc.edu/Research/vmd/
Info) Email questions and bug reports to vmd_at_ks.uiuc.edu
Info) Please include this reference in published work using VMD:
Info) Humphrey, W., Dalke, A. and Schulten, K., `VMD - Visual
Info) Molecular Dynamics', J. Molec. Graphics 1996, 14.1, 33-38.
Info) -------------------------------------------------------------
Info) Creating CUDA device pool and initializing hardware...
Info) Initializing parallel VMD instances via MPI...
Info) Found 256 VMD MPI nodes containing a total of 4096 CPUs and 256 GPUs:
Info) 0: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid11518
Info) 1: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid11519
Info) 2: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid11516
Info) 3: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid11517
Info) 4: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid11514

[....]

Info) 249: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid07021
Info) 250: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid07022
Info) 251: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid07023
Info) 252: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid06814
Info) 253: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid06815
Info) 254: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid06812
Info) 255: 16 CPUs, 28.0GB (88%) free mem, 1 GPUs, Name: nid06813
Info) Using plugin js for structure file cone-protein.js
Info) Using plugin js for coordinates from file cone-protein.js
Info) Finished with coordinate file cone-protein.js.
Info) Analyzing structure ...
Info) Atoms: 2440800
Info) Bonds: 2497752
Info) Angles: 3392712 Dihedrals: 4036812 Impropers: 452904 Cross-terms: 31
0524
Info) Bondtypes: 0 Angletypes: 0 Dihedraltypes: 0 Impropertypes: 0
Info) Residues: 313236
Info) Waters: 0
Info) Segments: 1356

I can double check a current build on ORNL Summit, but it uses the
IBM scheduler w/ 'jsrun' which is different from SLURM and 'srun'.
I may also try on CSCS Piz Daint which is a SLURM system just to
validate that it at least works for me.

Best,
  John

On Wed, May 04, 2022 at 08:46:28PM -0700, Chris Taylor wrote:
>
> > On May 4, 2022 1:11 PM John Stone <johns_at_ks.uiuc.edu> wrote:
> >
> >
> > Chris,
> > Can you show me your VMD startup messages? It should help clarify
> > if your binary was properly compiled and how MPI startup proceeded.
>
> Thank you- I hope this isn't too hard to read. I have this sbatch script:
>
> $ cat vmd-text.sbatch
> #!/bin/bash
>
> #SBATCH --nodes=3
> #SBATCH --ntasks-per-node=1
> #SBATCH --cpus-per-task=4
>
> module load ffmpeg
> module load ospray
> module load mpich
> module load vmd-mpi
>
> srun vmd -dispdev text -e noderank-parallel-render.tcl
>
> When I run it:
>
> $ cat slurm-101.out
> Info) VMD for LINUXAMD64, version 1.9.4a55 (April 12, 2022)
> Info) http://www.ks.uiuc.edu/Research/vmd/
> Info) Email questions and bug reports to vmd_at_ks.uiuc.edu
> Info) Please include this reference in published work using VMD:
> Info) Humphrey, W., Dalke, A. and Schulten, K., `VMD - Visual
> Info) Molecular Dynamics', J. Molec. Graphics 1996, 14.1, 33-38.
> Info) -------------------------------------------------------------
> Info) VMD for LINUXAMD64, version 1.9.4a55 (April 12, 2022)
> Info) http://www.ks.uiuc.edu/Research/vmd/
> Info) Email questions and bug reports to vmd_at_ks.uiuc.edu
> Info) Please include this reference in published work using VMD:
> Info) Humphrey, W., Dalke, A. and Schulten, K., `VMD - Visual
> Info) Molecular Dynamics', J. Molec. Graphics 1996, 14.1, 33-38.
> Info) -------------------------------------------------------------
> Info) Initializing parallel VMD instances via MPI...
> Info) Found 1 VMD MPI node containing a total of 4 CPUs and 0 GPUs:
> Info) 0: 4 CPUs, 15.0GB (96%) free mem, 0 GPUs, Name: node001
> Info) VMD for LINUXAMD64, version 1.9.4a55 (April 12, 2022)
> Info) http://www.ks.uiuc.edu/Research/vmd/
> Info) Email questions and bug reports to vmd_at_ks.uiuc.edu
> Info) Please include this reference in published work using VMD:
> Info) Humphrey, W., Dalke, A. and Schulten, K., `VMD - Visual
> Info) Molecular Dynamics', J. Molec. Graphics 1996, 14.1, 33-38.
> Info) -------------------------------------------------------------
> Info) Initializing parallel VMD instances via MPI...
> Info) Found 1 VMD MPI node containing a total of 4 CPUs and 0 GPUs:
> Info) 0: 4 CPUs, 15.0GB (96%) free mem, 0 GPUs, Name: node002
> Info) Initializing parallel VMD instances via MPI...
> Info) Found 1 VMD MPI node containing a total of 4 CPUs and 0 GPUs:
> Info) 0: 4 CPUs, 15.0GB (96%) free mem, 0 GPUs, Name: node003
> 0
> 0
> 1
> Starting, nodecount is 1
> 1
> Starting, nodecount is 1
> node 0 is running ...
> node 0 is running ...
> 0
> 1
> Starting, nodecount is 1
> node 0 is running ...
> Info) Using plugin pdb for structure file 5ire_merged.pdb
> Info) Using plugin pdb for structure file 5ire_merged.pdb
> Info) Using plugin pdb for structure file 5ire_merged.pdb
> Info) Using plugin pdb for coordinates from file 5ire_merged.pdb
> Info) Using plugin pdb for coordinates from file 5ire_merged.pdb
> Info) Using plugin pdb for coordinates from file 5ire_merged.pdb
> Info) Determining bond structure from distance search ...
> Info) Determining bond structure from distance search ...
> Info) Determining bond structure from distance search ...
> Info) Finished with coordinate file 5ire_merged.pdb.
> Info) Analyzing structure ...
> Info) Atoms: 1578060
> Info) Bonds: 1690676
> Info) Angles: 0 Dihedrals: 0 Impropers: 0 Cross-terms: 0
> Info) Bondtypes: 0 Angletypes: 0 Dihedraltypes: 0 Impropertypes: 0
> Info) Residues: 104040
> Info) Waters: 0
> Info) Finished with coordinate file 5ire_merged.pdb.
> Info) Analyzing structure ...
> Info) Atoms: 1578060
> Info) Finished with coordinate file 5ire_merged.pdb.
> Info) Analyzing structure ...
> Info) Atoms: 1578060
> Info) Bonds: 1690676
> Info) Angles: 0 Dihedrals: 0 Impropers: 0 Cross-terms: 0
> Info) Bondtypes: 0 Angletypes: 0 Dihedraltypes: 0 Impropertypes: 0
> Info) Bonds: 1690676
> Info) Angles: 0 Dihedrals: 0 Impropers: 0 Cross-terms: 0
> Info) Bondtypes: 0 Angletypes: 0 Dihedraltypes: 0 Impropertypes: 0
> Info) Segments: 540
> Info) Fragments: 720 Protein: 360 Nucleic: 0
> Info) Residues: 104040
> Info) Waters: 0
> Info) Residues: 104040
> Info) Waters: 0
> Info) Segments: 540
> Info) Segments: 540
> Info) Fragments: 720 Protein: 360 Nucleic: 0
> Info) Fragments: 720 Protein: 360 Nucleic: 0
> 0
> node 0 has loaded data
> Info) Rendering current scene to 'test_node_0.ppm' ...
> 0
> node 0 has loaded data
> 0
> node 0 has loaded data
> Info) Rendering current scene to 'test_node_0.ppm' ...
> Info) Rendering current scene to 'test_node_0.ppm' ...
> OSPRayDisplayDevice) Total rendering time: 1.01 sec
> Info) Rendering complete.
> node 0 has rendered a frame
> 1
> Ending, nodecount is 1
> OSPRayDisplayDevice) Total rendering time: 1.01 sec
> Info) Rendering complete.
> node 0 has rendered a frame
> 1
> Ending, nodecount is 1
> OSPRayDisplayDevice) Total rendering time: 1.02 sec
> Info) Rendering complete.
> node 0 has rendered a frame
> 1
> Ending, nodecount is 1
> Info) VMD for LINUXAMD64, version 1.9.4a55 (April 12, 2022)
> Info) Exiting normally.
> Info) All nodes have reached the MPI shutdown barrier.
> Info) VMD for LINUXAMD64, version 1.9.4a55 (April 12, 2022)
> Info) Exiting normally.
> Info) VMD for LINUXAMD64, version 1.9.4a55 (April 12, 2022)
> Info) Exiting normally.
> Info) All nodes have reached the MPI shutdown barrier.
> Info) All nodes have reached the MPI shutdown barrier.
>
>
> Also I have tried:
>
> $ sbatch --export=VMDOSPRAYMPI vmd-text.sbatch
>
> And I get:
>
> $ cat slurm-102.out
> srun: error: node001: task 0: Exited with exit code 2
> srun: error: node003: task 2: Exited with exit code 2
> srun: error: node002: task 1: Exited with exit code 2
> slurmstepd: error: execve(): vmd: No such file or directory
> slurmstepd: error: execve(): vmd: No such file or directory
> slurmstepd: error: execve(): vmd: No such file or directory
>
>
> If look at the executable I see (highlights follow):
>
> [cht_at_node001 VMD]$ ldd /cm/shared/apps/vmd-mpi/1.9.4/vmd/vmd_LINUXAMD64
> linux-vdso.so.1 (0x0000155555551000)
> libGL.so.1 => /lib64/libGL.so.1 (0x00001555550a1000)
> libGLU.so.1 => /lib64/libGLU.so.1 (0x00001555554cd000)
> libfltk_gl.so.1.3 => /lib64/libfltk_gl.so.1.3 (0x0000155554e83000)
> libmpi.so.12 => /cm/shared/apps/mpich/ge/gcc/64/3.3.2/lib/libmpi.so.12 (0x0000155554937000)
> libospray.so.0 => /cm/shared/apps/ospray/1.8.0/lib/libospray.so.0 (0x0000155555459000)
> libospray_common.so.0 => /cm/shared/apps/ospray/1.8.0/lib/libospray_common.so.0 (0x00001555553f7000)
> libembree3.so.3 => /cm/shared/apps/ospray/1.8.0/lib/libembree3.so.3 (0x00001555520a5000)
> libtbb.so.2 => /cm/shared/apps/ospray/1.8.0/lib/libtbb.so.2 (0x0000155551e47000)
> libtbbmalloc.so.2 => /cm/shared/apps/ospray/1.8.0/lib/libtbbmalloc.so.2 (0x0000155551bf0000)
> ..
> ..
> etc etc

-- 
NIH Center for Macromolecular Modeling and Bioinformatics
Beckman Institute for Advanced Science and Technology
University of Illinois, 405 N. Mathews Ave, Urbana, IL 61801
http://www.ks.uiuc.edu/~johns/           Phone: 217-244-3349
http://www.ks.uiuc.edu/Research/vmd/