Discussion:
[Bioclusters] RSH fails on seaguest cluster
Jeff Thomas
2008-05-13 21:48:12 UTC
Permalink
Hello All,
We have a cluster with a windows2003 master node and 8 linux (Fedora 4)
slave nodes. Everything was working properly but now rsh fails to connect
to nodes 1-7.

pvm> add node1
add node1
0 successful
HOST DTID
node1 Can't start pvmd

Auto-Diagnosing Failed Hosts...
node1...
Verifying Local Path to "rsh"...

Error - File /usr/ucb/rsh Not Found!
Determine the path to the "rsh" command on your
system, and edit PVM_ROOT\conf\WIN32.def
to adjust the path for the -DRSHCOMMAND=\"\"
flag. Then recompile PVM and your applications.


I have restarted the entire cluster and cleaned the /tmp pvm*.* files on
each node multiple times. As well as restarting the BsdRshd service.

I can not rsh from the slave to the master:

[root at node1 ~]# rsh master "c:\cluster\wrshd\id.exe"
connect to address 192.168.66.250: Connection refused
Trying krb4 rsh...
connect to address 192.168.66.250: Connection refused
trying normal rsh (/usr/bin/rsh)
Access denied.

WRSHD in debug mode yields this:

C:\Cluster\wrshd>rshd -d
(/5/9 10:48:58) Checking WinSockets Version... (/5/9 10:48:58) done.
(/5/9 10:48:58) Loading Equivalence List...(/5/9 10:48:58) Getting
Information f
rom Trustbase
(/5/9 10:48:58) done.
(/5/9 10:48:58) Binding main socket.
(/5/9 10:48:58) cannot bind to the rshd daemon port.Debugging BsdRshd
In StartServiceCtrlDispatcher
Error number: 1063

The pvml Log file

[t80040000] master (192.168.66.250:1036) WIN32 3.4.3
[t80040000] ready Fri May 09 10:46:05 2008
[t80040000] netinput() bogus pkt from 192.168.66.1:32774
[t80040000] netinput() bogus pkt from 192.168.66.2:32771
[t80040000] netinput() bogus pkt from 192.168.66.3:32771
[t80040000] netinput() bogus pkt from 192.168.66.5:32771
[t80040000] netinput() bogus pkt from 192.168.66.6:32771
[t80040000] netinput() bogus pkt from 192.168.66.7:32771
[t80040000] netinput() bogus pkt from 192.168.66.8:32770
[t80040000] startack() host node1 expected version, got "PvmCantStart"
[t80040000] startack() host node2 expected version, got "PvmCantStart"
[t80040000] startack() host node3 expected version, got "PvmCantStart"
[t80040000] startack() host node4 expected version, got ""
[t80040000] startack() host node5 expected version, got "PvmCantStart"
[t80040000] startack() host node6 expected version, got "PvmCantStart"
[t80040000] startack() host node7 expected version, got "PvmCantStart"
[t80040000] startack() host node8 expected version, got "PvmCantStart"
[t80040000] netinput() bogus pkt from 192.168.66.1:32775
[t80040000] startack() host node1 expected version, got "PvmCantStart"
[t80040000] netinput() bogus pkt from 192.168.66.8:32771
[t80040000] startack(

I know it must be something simple becuase it was working fine before
this, any suggestions would be greatly appreciated.

Thanks
Jeff Thomas
Michael Edwards
2008-05-14 19:38:32 UTC
Permalink
Did your switch break? Try swapping it out for another one, especially if
all the broken nodes are on one switch (and to a lesser extent if others are
still working and aren't).
Post by Jeff Thomas
Hello All,
We have a cluster with a windows2003 master node and 8 linux (Fedora 4)
slave nodes. Everything was working properly but now rsh fails to connect
to nodes 1-7.
pvm> add node1
add node1
0 successful
HOST DTID
node1 Can't start pvmd
Auto-Diagnosing Failed Hosts...
node1...
Verifying Local Path to "rsh"...
Error - File /usr/ucb/rsh Not Found!
Determine the path to the "rsh" command on your
system, and edit PVM_ROOT\conf\WIN32.def
to adjust the path for the -DRSHCOMMAND=\"\"
flag. Then recompile PVM and your applications.
I have restarted the entire cluster and cleaned the /tmp pvm*.* files on
each node multiple times. As well as restarting the BsdRshd service.
[root at node1 ~]# rsh master "c:\cluster\wrshd\id.exe"
connect to address 192.168.66.250: Connection refused
Trying krb4 rsh...
connect to address 192.168.66.250: Connection refused
trying normal rsh (/usr/bin/rsh)
Access denied.
C:\Cluster\wrshd>rshd -d
(/5/9 10:48:58) Checking WinSockets Version... (/5/9 10:48:58) done.
(/5/9 10:48:58) Loading Equivalence List...(/5/9 10:48:58) Getting
Information f
rom Trustbase
(/5/9 10:48:58) done.
(/5/9 10:48:58) Binding main socket.
(/5/9 10:48:58) cannot bind to the rshd daemon port.Debugging BsdRshd
In StartServiceCtrlDispatcher
Error number: 1063
The pvml Log file
[t80040000] master (192.168.66.250:1036) WIN32 3.4.3
[t80040000] ready Fri May 09 10:46:05 2008
[t80040000] netinput() bogus pkt from 192.168.66.1:32774
[t80040000] netinput() bogus pkt from 192.168.66.2:32771
[t80040000] netinput() bogus pkt from 192.168.66.3:32771
[t80040000] netinput() bogus pkt from 192.168.66.5:32771
[t80040000] netinput() bogus pkt from 192.168.66.6:32771
[t80040000] netinput() bogus pkt from 192.168.66.7:32771
[t80040000] netinput() bogus pkt from 192.168.66.8:32770
[t80040000] startack() host node1 expected version, got "PvmCantStart"
[t80040000] startack() host node2 expected version, got "PvmCantStart"
[t80040000] startack() host node3 expected version, got "PvmCantStart"
[t80040000] startack() host node4 expected version, got ""
[t80040000] startack() host node5 expected version, got "PvmCantStart"
[t80040000] startack() host node6 expected version, got "PvmCantStart"
[t80040000] startack() host node7 expected version, got "PvmCantStart"
[t80040000] startack() host node8 expected version, got "PvmCantStart"
[t80040000] netinput() bogus pkt from 192.168.66.1:32775
[t80040000] startack() host node1 expected version, got "PvmCantStart"
[t80040000] netinput() bogus pkt from 192.168.66.8:32771
[t80040000] startack(
I know it must be something simple becuase it was working fine before
this, any suggestions would be greatly appreciated.
Thanks
Jeff Thomas
_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters
Loading...