Re: Trouble with NAMD on Myrinet

From: Edward Patrick Obrien (edobrien_at_Glue.umd.edu)
Date: Wed Mar 16 2005 - 10:04:20 CST

Hi Gengbin,
   Thanks, the recommendations on the NAMD on Wiki page helped. NAMD on
myrinet now seems to work.

   BUT, a strange thing is occuring; there is no speed up of simulation
upon going from ethernet parallel computing to Myrinet parallel computing.
For example:

                         myrinet ethernet
Seconds per Step 0.266 0.267

The above data is for a system of ~18,000 atoms, that is run parallel on 2
nodes each with 2 processors. The exact same compute nodes were used.

   We checked the traffic between the compute nodes during these
calculations and the myrinet job was communicating via myrinet and the
ethernet via ethernet.

   I use charmrun for ethernet run, and mpirun for myrinet run. Some info
on the installation of myrinet NAMD is given below.

Any ideas what could be going on?
Thanks,
Ed

On Tue, 15 Mar 2005, Edward Patrick Obrien wrote:

> Hi All,
> We compiled NAMD for myrinet but it seems to work sometimes, but not
> correctly, and other times it dies completely (Output Errors are listed at
> end of this message). Has anyone gotten NAMD-2.5 to work with Myrinet? Here's
> some info:
>
> I build NAMD as follows (once I've set up all the files describing where
> plugins, TCL, etc. are and edititing conv-mach.sh with out correct MPICH
> compilers):
>
> cd charm
> ./build charm++ mpi-linux -O -DCMK_OPTIMIZE=1
> cd ..
> ./config tcl fftw plugins Linux-i686-MPI
>
> This creates the namd2 executable in the Linux-i686-MPI directory. I then run
> it via qsub. One problem seems to be in the src/arch/mpi/machine.c
> file the assertion on line 815 seems to get triggered. I broke it into two
> assertions to test and the problem is (startpe<Cmi_numpes) is FALSE. I had
> the program print out startpe (which based on skimming the source seems to be
> the MPI ID of the root process -- Cmi_numpres is the total number of
> processes) andstartpe is HUGE (16535) which suggests that the value is
> getting corrupted somewhere. I tried just resetting it to 0 when that occurs,
> just to test, and thatdidn't help (as I more or less figured).
>
> This may or may not have to do with the errors I have outlined below, but
> maybe someone on the NAMD list knows better.
>
> Here are the 2 types of errors I get:
>
> Error type 1:
>
> FATAL ERROR 17 on MPI node 2 (n13): the GM port on MPI node 0 (n12) is
> closed, i.e. the process has not started, has exited or isdead
> Small/Ctrl message completion error!
> FATAL ERROR 17 on MPI node 3 (n13): the GM port on MPI node 0 (n12) is
> closed, i.e. the process has not started, has exited or isdead
>
> Error type 1:
>
> CCS: Unknown CCS handler name '' requested. Ignoring...
> CCS: Unknown CCS handler name '' requested. Ignoring...
>
> These errors appear after finishing the startup phase:
>
> "Info: Finished startup with 50830 kB of memory in use."
>
>
> System info:
>
> linux cluster, myrinet connections.
>
>
> pbs file:
>
> #!/bin/csh
> #PBS -r n
> #PBS -m b
> #PBS -m e
> #PBS -k eo
> #PBS -l nodes=2:ppn=2
> echo Running on host `hostname`
> echo Time is `date`
> echo Directory is `pwd`
> echo This jobs runs on the following processors:
> echo `cat $PBS_NODEFILE`
> cp $PBS_NODEFILE pbs_nodefile
> set NPROCS = `wc -l < $PBS_NODEFILE`
> echo This job has allocated $NPROCS nodes
>
> set dir1 = /v/apps/mpich-1.2.5..12_04_01_2004/bin
> #set dir1 = /v/apps/mpich-gm-gnu/bin
> set dir2 = /v/estor3/home/edobrien/NAMD-tim
> set nodelist = /v/estor3/home/edobrien/Projects/nodelist
>
> cd $PBS_O_WORKDIR
>
> $dir1/mpirun -machinefile $nodelist -np 4 $dir2/namd2 myrinet_test.namd >&
> myrinet_test_13.log
>
>
> Thanks,
> Ed
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:36 CST