problems with paralellism

From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Thu Nov 23 2006 - 12:49:39 CST

Dear Charm++ and NAMD developers,
We are fighting already for some months with a new cluster
of Amd64 dual-core processors, running Fedora 5.0. Ou cluster
is composed by 9 cpus, being 8 diskless nodes and 1 master.
We have already a Opteron cluster with a similar architecture
which is working fine and running namd with charm++ very
efficiently.

However, we have been unable to run namd in parallel in our new
cluster.

Our observations are:

1. If I try to run in parallel starting the simulation from the master and
using nodes, the simulation does not start, it hangs up before starting the
simulation
and return an error message: Timeout waiting for node-program to connect
(more details on this message at the end of the email)

2. If I try to run in parallel starting from one node, and even using the
master cpu, the simulation eventually hangs up and I get a process running
with 100% cpu on the first machine of the node list, but the simulation does
not continue.

3. I I try to run in parallel, without the master node, the simulation runs
for one day or two, but eventually hangs up with the same problem as in 2.

4. One time we ran a simulation starting from one node and put the master
node at the end of the nodelist file, the simulation hang up and we got:
Warning: 1 processors are overloeaded due to high background load.

5. We have tried different versions of charm++ and namd2, and have
recompiled the charm++ with the options suggested by Jim Philips in
http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnAMD64
but the same results are observed.

We have no clues anymore on what could be the problems. Apparently we have
some problem with load balancing. We have also updated to the latest
kernels, tested all connections, network interfaces, and we cannot find any
hardware problem in the machines.

We would strongly appreciate any insight into what could be the problem.
Also we would appreciate if someone having a similar cluster configuration
shared his/her experiences, so we can rule or not the possibility of some
hardware incompatibility with charm or namd. If this is a problem that you
share some interest in for some reason, we can certainly give you access to
the machines.

Thank you very much,
Leandro Martinez
State University of Campinas
Brazil

Error message when starting from the master trying to use nodes:

Charmrun> charmrun started...
Charmrun> using ./nodelist2 as nodesfile
Charmrun> adding client 0: " 192.168.0.101", IP:192.168.0.101
Charmrun> adding client 1: "192.168.0.101", IP: 192.168.0.101
Charmrun> Charmrun = alehpo.iqm.unicamp.br, port = 42645
Charmrun> Sending "0 alehpo.iqm.unicamp.br 42645 17029 0" to client 0.
Charmrun> find the node program
"/home/lmartinez/./NAMD_2.6b2_Linux-amd64/namd2" at "/home/lmartinez" for 0.
Charmrun> Starting rsh 192.168.0.101 -l lmartinez /bin/sh -f
Charmrun> rsh (192.168.0.101:0) started
Charmrun> Sending "1 alehpo.iqm.unicamp.br 42645 17029 0" to client 1.
Charmrun> find the node program
"/home/lmartinez/./NAMD_2.6b2_Linux-amd64/namd2" at "/home/lmartinez" for 1.

Charmrun> Starting rsh 192.168.0.101 -l lmartinez /bin/sh -f
Charmrun> rsh (192.168.0.101:1) started
Charmrun> node programs all started
Charmrun> waiting for rsh (192.168.0.101:0), pid 17030
Charmrun rsh(192.168.0.101.0)> remote responding...
Charmrun rsh(192.168.0.101.1)> remote responding...
Charmrun rsh( 192.168.0.101.0)> starting node-program...
Charmrun rsh(192.168.0.101.0)> rsh phase successful.
Charmrun rsh(192.168.0.101.1)> starting node-program...
Charmrun rsh(192.168.0.101.1)> rsh phase successful.
Charmrun> waiting for rsh (192.168.0.101:1), pid 17031
Charmrun> Waiting for 0-th client to connect.
Timeout waiting for node-program to connect

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:51 CST