Discussion:
[Bioclusters] MPI-HMMER segmentation fault mpi error
AbDdU!
2009-02-12 05:33:19 UTC
Permalink
Dear all

I have compiled mpi-hmmer recently openmpi (if not mistaken) ... when
i run it using a small dataset it works fine ... but when i input a
big dataset to search against .. like about 200MB ... it crashes on
the first node, exits with "segmentation fault" and mpi-related error
"errno=111" which is a "connection refused" type of error ... any clue
on this? i dont think its mpi related, nor it is an ssh issue ... or
is it? since i'v tried running the same problem on a compute node and
it worked fine with no errors or such.


What im sure of is that mpi is working fine as other software using it
are working as expected ...

any experience with such error?


Thank you guys if you ever read this ... :) and hopefully guide me to
the solution
Joe Landman
2009-02-12 13:45:26 UTC
Permalink
Greetings AbDdU

You probably want to subscribe to the mpihmmer list and post
questions there, as this is where the developers tend to hang out. You
can find it here ...

http://lists.scalableinformatics.com/mailman/listinfo/mpihmmer
Post by AbDdU!
Dear all
I have compiled mpi-hmmer recently openmpi (if not mistaken) ... when
i run it using a small dataset it works fine ... but when i input a
big dataset to search against .. like about 200MB ... it crashes on
the first node, exits with "segmentation fault" and mpi-related error
"errno=111" which is a "connection refused" type of error ... any clue
on this? i dont think its mpi related, nor it is an ssh issue ... or
is it? since i'v tried running the same problem on a compute node and
it worked fine with no errors or such.
A segmentation fault is usually what you get when you run a program that
tries to access memory it doesn't have a right to access. Could you let
us know

a) what you used for a command line

b) what database you used for your search

c) how much memory and what CPU type you have on the node that crashed.

Since it worked on a compute node, this suggests either library
differences, out of memory issues, or similar problems on the machine
you have run on.

To see if this is an ssh issue try this for each machine in your
machines file

ssh machinename hostname

where machinename is the name of the machine in the machines file. So,
for example, if your machines file has

compute-1
compute-2
compute-3
compute-4

then your test would look like this

ssh compute-1 hostname
ssh compute-2 hostname
ssh compute-3 hostname
ssh compute-4 hostname

If these work without a password, and work quickly without a password,
it is unlikely that ssh was a problem. If you didn't use a machines
file, then mpi will often try to do this to the local host, so you
should include

ssh localhost hostname

as a test, and it should work, just like the others.
Post by AbDdU!
What im sure of is that mpi is working fine as other software using it
are working as expected ...
Still not enough information to provide a meaningful answer.
Post by AbDdU!
any experience with such error?
111 errors? Yes. Usually the result of one or more of the mpi
processes crashing.
Post by AbDdU!
Thank you guys if you ever read this ... :) and hopefully guide me to
the solution
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
John Paul Walters
2009-02-12 15:13:39 UTC
Permalink
AbDdU,

In addition to Joe's suggestions, I would also suggest that you grab the
latest Mercurial snapshot, as it contains some fixes that aren't in the
two releases that are posted. The Mercurial link is directly beneath
the release links, and has both .zip and .gz options.

best,
JP
Post by Joe Landman
Greetings AbDdU
You probably want to subscribe to the mpihmmer list and post
questions there, as this is where the developers tend to hang out. You
can find it here ...
http://lists.scalableinformatics.com/mailman/listinfo/mpihmmer
Post by AbDdU!
Dear all
I have compiled mpi-hmmer recently openmpi (if not mistaken) ... when
i run it using a small dataset it works fine ... but when i input a
big dataset to search against .. like about 200MB ... it crashes on
the first node, exits with "segmentation fault" and mpi-related error
"errno=111" which is a "connection refused" type of error ... any clue
on this? i dont think its mpi related, nor it is an ssh issue ... or
is it? since i'v tried running the same problem on a compute node and
it worked fine with no errors or such.
A segmentation fault is usually what you get when you run a program that
tries to access memory it doesn't have a right to access. Could you let
us know
a) what you used for a command line
b) what database you used for your search
c) how much memory and what CPU type you have on the node that crashed.
Since it worked on a compute node, this suggests either library
differences, out of memory issues, or similar problems on the machine
you have run on.
To see if this is an ssh issue try this for each machine in your
machines file
ssh machinename hostname
where machinename is the name of the machine in the machines file. So,
for example, if your machines file has
compute-1
compute-2
compute-3
compute-4
then your test would look like this
ssh compute-1 hostname
ssh compute-2 hostname
ssh compute-3 hostname
ssh compute-4 hostname
If these work without a password, and work quickly without a password,
it is unlikely that ssh was a problem. If you didn't use a machines
file, then mpi will often try to do this to the local host, so you
should include
ssh localhost hostname
as a test, and it should work, just like the others.
Post by AbDdU!
What im sure of is that mpi is working fine as other software using it
are working as expected ...
Still not enough information to provide a meaningful answer.
Post by AbDdU!
any experience with such error?
111 errors? Yes. Usually the result of one or more of the mpi
processes crashing.
Post by AbDdU!
Thank you guys if you ever read this ... :) and hopefully guide me to
the solution
Loading...