[Bioclusters] Parallel Sequence Alignment tool

Discussion:

slitster

2009-07-17 14:41:53 UTC

Does anyone have recommnedations for a parallel sequence alignment tool

User investigation so far has turned up clustalW-MPI, but it seams to be using an older version of clustalW.

Any imput much appreciated.

Cheers

Steve

Juan Carlos Perin

2009-07-22 13:25:14 UTC

Permalink

Are you looking to align short reads from ngs, or other data?

~ juan

Post by slitster
Does anyone have recommnedations for a parallel sequence alignment tool
User investigation so far has turned up clustalW-MPI, but it seams
to be using an older version of clustalW.
Any imput much appreciated.
Cheers
Steve
_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

Nick Holway

2009-07-30 16:19:13 UTC

Permalink

Hello,

Steve actually posted this on behalf of me, so to cut out the middle
man I'll answer.

I'm trying to assist a scientist with a bioinformatics project. He's
trying to align 16s rDNA sequences to identify the bacterial species.
I launched a Muscle job on his behalf which took ~5.5 days to run (on
3GHz "Harpertown" Xeons). The file the scientist gave me had ~5000
sequences in which were mostly 1000-1500 bases long.

I'm trying to persuade the scientist to see if he can reduce the
number of sequences that he needs to align and also to see if his data
needs to let Muscle run to completion rather than just the first two
iterations.

My reason for wanting to know if there are any good parallel sequence
alignment tools is that we've seen some excellent speed increases with
our MD code. Knowing this scientist I imagine he'll need the entire
data set to be aligned :)

If you need me to find out any more information from the scientist
please let me know.

Thanks

Nick

Post by Juan Carlos Perin
Are you looking to align short reads from ngs, or other data?
~ juan

Post by slitster
Does anyone have recommnedations for a parallel sequence alignment tool
User investigation so far has turned up clustalW-MPI, but it seams to be
using an older version of clustalW.
Any imput much appreciated.
Cheers
Steve
_______________________________________________
Bioclusters maillist ?- ?Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

_______________________________________________
Bioclusters maillist ?- ?Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

Antony P Joseph

2009-08-01 17:29:33 UTC

Permalink

Hi Nick
Did you try the -profile option in muscle as a divide-conquer
strategy on data assuming that you are not able to find the
parallelized version of MUSCLE.
number of files = no. of cpus
number of sequence in each file = 5000/ no. of CPUS

With regards
Antony

Post by Nick Holway
Hello,
Steve actually posted this on behalf of me, so to cut out the middle
man I'll answer.
I'm trying to assist a scientist with a bioinformatics project. He's
trying to align 16s rDNA sequences to identify the bacterial species.
I launched a Muscle job on his behalf which took ~5.5 days to run (on
3GHz "Harpertown" Xeons). The file the scientist gave me had ~5000
sequences in which were mostly 1000-1500 bases long.
I'm trying to persuade the scientist to see if he can reduce the
number of sequences that he needs to align and also to see if his data
needs to let Muscle run to completion rather than just the first two
iterations.
My reason for wanting to know if there are any good parallel sequence
alignment tools is that we've seen some excellent speed increases with
our MD code. Knowing this scientist I imagine he'll need the entire
data set to be aligned :)
If you need me to find out any more information from the scientist
please let me know.
Thanks
Nick

Post by Juan Carlos Perin
Are you looking to align short reads from ngs, or other data?
~ juan

Post by slitster
Does anyone have recommnedations for a parallel sequence alignment tool
User investigation so far has turned up clustalW-MPI, but it seams to be
using an older version of clustalW.
Any imput much appreciated.
Cheers
Steve
_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

Paulo Nuin

2009-08-03 15:50:07 UTC

Permalink

Hi

Just my two cents. Aligning rRNA is not a straightforward process and
it shouldn't be attempted to be accomplished automatically. Muscle,
MAFFT and other fast algorithms will generate very low quality
alignments if it's done blindly. Based on the number of sequences you
have, and their nature, you would be OK by wrapping some script around
ClustalW or ClustalW-MPI.

A good protocol to align rRNA is as follows:

- align two sequences
- add a third sequence to it by using the first two as a profile
- add a fourth sequence using the first three as a profile
- add a fifth sequence ...
- at some point you will have a good enough profile that would allow
you to use the aligned sequences as a model to the ones added to the
alignment

The reason is rRNA has a secondary (and tertiary) structure that
contains stems and loops. Stems are short segments that are somewhat
"duplicated" along the flat sequence and attache to each other when
forming the secondary structure. This connection sometimes don't
follow the usual A-T(U) C-G connection. Due to the stems there is a
pattern on the primary structure that has to be followed to generate a
good (but not excellent) alignment.

I guess a rRNA alignment software would be too slow for your
requirements, but I guess by using ClustalW-MPI and some sequences as
profile would you get a slightly good alignment in maybe a couple of
days.

Hope that helps
Paulo

Post by Juan Carlos Perin
Are you looking to align short reads from ngs, or other data?
~ juan

Post by slitster
Does anyone have recommnedations for a parallel sequence alignment tool
User investigation so far has turned up clustalW-MPI, but it seams to be
using an older version of clustalW.
Any imput much appreciated.
Cheers
Steve
_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

Liu,Li

2009-08-14 01:12:59 UTC

Permalink

You may want to try this (http://www.biotech.ufl.edu/people/sun/esprit.html
)

Sent from Li's iPhone

Post by Paulo Nuin
Hi
Just my two cents. Aligning rRNA is not a straightforward process and
it shouldn't be attempted to be accomplished automatically. Muscle,
MAFFT and other fast algorithms will generate very low quality
alignments if it's done blindly. Based on the number of sequences you
have, and their nature, you would be OK by wrapping some script around
ClustalW or ClustalW-MPI.
- align two sequences
- add a third sequence to it by using the first two as a profile
- add a fourth sequence using the first three as a profile
- add a fifth sequence ...
- at some point you will have a good enough profile that would allow
you to use the aligned sequences as a model to the ones added to the
alignment
The reason is rRNA has a secondary (and tertiary) structure that
contains stems and loops. Stems are short segments that are somewhat
"duplicated" along the flat sequence and attache to each other when
forming the secondary structure. This connection sometimes don't
follow the usual A-T(U) C-G connection. Due to the stems there is a
pattern on the primary structure that has to be followed to generate a
good (but not excellent) alignment.
I guess a rRNA alignment software would be too slow for your
requirements, but I guess by using ClustalW-MPI and some sequences as
profile would you get a slightly good alignment in maybe a couple of
days.
Hope that helps
Paulo

Post by Juan Carlos Perin
Are you looking to align short reads from ngs, or other data?
~ juan

Post by slitster
Does anyone have recommnedations for a parallel sequence alignment tool
User investigation so far has turned up clustalW-MPI, but it seams to be
using an older version of clustalW.
Any imput much appreciated.
Cheers
Steve
_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

Kevin M. Carr

2009-08-01 15:32:47 UTC

Permalink

Nick,

Have you checked out this resource:

http://rdp.cme.msu.edu/

(Note, while I work at MSU I am not affiliated with the Ribosomal Database
Project.)

They have a fully developed pipeline for classifying 16s rRNA sequences.
You can create your own account to processes sequences through their
pipeline.

Kevin M. Carr

**************************
Bioinformatics Specialist
Research Technology
Support Facility
S20-A Plant Biology Lab
Michigan State University
East Lansing, MI 48824

Ph: (517) 355-6759 x102
Fax:(517) 355-6758
**************************

From: Nick Holway <nick.holway at gmail.com>
Reply-To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
Date: Thu, 30 Jul 2009 17:19:13 +0100
To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
Subject: Re: [Bioclusters] Parallel Sequence Alignment tool
Hello,
Steve actually posted this on behalf of me, so to cut out the middle
man I'll answer.
I'm trying to assist a scientist with a bioinformatics project. He's
trying to align 16s rDNA sequences to identify the bacterial species.
I launched a Muscle job on his behalf which took ~5.5 days to run (on
3GHz "Harpertown" Xeons). The file the scientist gave me had ~5000
sequences in which were mostly 1000-1500 bases long.
I'm trying to persuade the scientist to see if he can reduce the
number of sequences that he needs to align and also to see if his data
needs to let Muscle run to completion rather than just the first two
iterations.
My reason for wanting to know if there are any good parallel sequence
alignment tools is that we've seen some excellent speed increases with
our MD code. Knowing this scientist I imagine he'll need the entire
data set to be aligned :)
If you need me to find out any more information from the scientist
please let me know.
Thanks
Nick

Post by Juan Carlos Perin
Are you looking to align short reads from ngs, or other data?
~ juan

Post by slitster
Does anyone have recommnedations for a parallel sequence alignment tool
User investigation so far has turned up clustalW-MPI, but it seams to be
using an older version of clustalW.
Any imput much appreciated.
Cheers
Steve
_______________________________________________
Bioclusters maillist ?- ?Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

_______________________________________________
Bioclusters maillist ?- ?Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

Ognen Duzlevski

2009-08-24 12:25:02 UTC

Permalink

Date: Fri, 17 Jul 2009 10:41:53 -0400 (EDT)
From: slitster at rcn.com
Reply-To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
To: bioclusters at bioinformatics.org
Subject: [Bioclusters] Parallel Sequence Alignment tool
Does anyone have recommnedations for a parallel sequence alignment tool
User investigation so far has turned up clustalW-MPI, but it seams to be using an older version of clustalW.

I once wrote a multi-threaded version of clustalw - you can get it here:
http://naniteworld.com/clustalw_smp-0.99-9.tar.gz

Ognen

Abhishek Pratap

2009-08-24 18:34:12 UTC

Permalink

Hey Ognen

Thanks for sharing. Just wondering if you could recommend anything
good on how to convert single threaded programs into multiones.
Ofcourse if the base algorithm is compatible. Any good resource that
you would recommend ?

Thanks,
-Abhi

Post by Ognen Duzlevski

http://naniteworld.com/clustalw_smp-0.99-9.tar.gz
Ognen
_______________________________________________
Bioclusters maillist ?- ?Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

jgans

2009-08-25 15:04:35 UTC

Permalink

Hello,

There is a nice paper from SGI on parallelizing the Clustal program
using OpenMP
(http://www.cs.umd.edu/class/spring2003/cmsc838t/papers2/bio-SGI-parallel-clustal-01.pdf)
.

Even though this paper refers to an earlier version of Clustal (circa
2001), it is a very useful guide. Using this paper as a reference, it
was straight forward to add the required OpenMP code to the most recent
version of Clustal (I only modified the first stage pairwise alignment
portion of the code).

Regards,

Jason Gans

Bioscience Division, B-7
Los Alamos National Laboratory

Post by Abhishek Pratap
Hey Ognen
Thanks for sharing. Just wondering if you could recommend anything
good on how to convert single threaded programs into multiones.
Ofcourse if the base algorithm is compatible. Any good resource that
you would recommend ?
Thanks,
-Abhi

Post by Ognen Duzlevski

http://naniteworld.com/clustalw_smp-0.99-9.tar.gz
Ognen
_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

Ognen Duzlevski

2009-08-27 19:04:26 UTC

Permalink

It would have been nice to be aware of this paper when I parallelized
ClustalW back in 2001/2. The program is not that complicated to
parallelize - it is your basic search for what takes up most time and how
much that portion lends itself to being parallelized. I don't remember
that well - it has been a while - but if I remember correctly ClustalW had
three phases - 1st one was sequence-to-sequence alignment which was very
easy to parallelize, the second phase was irrelevant time-wise to consider
and the third one was where significant time was spent but it was more
difficult to parallelize...

Regards,
Ognen

Date: Tue, 25 Aug 2009 09:04:35 -0600
From: jgans <jgans at lanl.gov>
Reply-To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
To: HPC in Bioinformatics <bioclusters at bioinformatics.org>
Subject: Re: [Bioclusters] Parallel Sequence Alignment tool
Hello,
There is a nice paper from SGI on parallelizing the Clustal program using
OpenMP
(http://www.cs.umd.edu/class/spring2003/cmsc838t/papers2/bio-SGI-parallel-clustal-01.pdf)
.
Even though this paper refers to an earlier version of Clustal (circa 2001),
it is a very useful guide. Using this paper as a reference, it was straight
forward to add the required OpenMP code to the most recent version of Clustal
(I only modified the first stage pairwise alignment portion of the code).
Regards,
Jason Gans
Bioscience Division, B-7
Los Alamos National Laboratory

Post by Ognen Duzlevski

_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

--
Ognen Duzlevski
Independent IT consultant
(561)452-5653
U.S.A.
--

Joe Landman

2009-08-31 14:18:45 UTC

Permalink

Post by Ognen Duzlevski
It would have been nice to be aware of this paper when I parallelized
ClustalW back in 2001/2. The program is not that complicated to
parallelize - it is your basic search for what takes up most time and

As I remember, Haruna and Dmitri were building something called
HT-Clustal around that time, using shared memory rather than a
cluster-ized version. We had previously done the cluster-ized blast
(CT-BLAST aka SGI GenomeCluster).

Post by Ognen Duzlevski
how much that portion lends itself to being parallelized. I don't
remember that well - it has been a while - but if I remember correctly
ClustalW had three phases - 1st one was sequence-to-sequence alignment
which was very easy to parallelize, the second phase was irrelevant
time-wise to consider and the third one was where significant time was
spent but it was more difficult to parallelize...

The HT-Clustal paper and this paper detail the steps needed to
parallelize it. It's made somewhat easier by shared memory (much less
development), but in those days, multiple gigabytes of shared memory
were still pretty expensive.

Someone did an MPI-Clustal as well in 2003. Anyone using that?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615

Mikhail Fursov

2009-12-24 06:59:31 UTC

Permalink

Hi!

The parallel version of MUSCLE algorithm is available in this open-source
package: http://ugene.unipro.ru . You can also run it remotely any PC with
UGENE installed and configured to serve remote requests.

Note, that even not parallel (original) MUSCLE code can do large alignments
a LOT faster than Clustal and will produce comparable in quality results.

Mikhail.

Post by slitster
Does anyone have recommnedations for a parallel sequence alignment tool
User investigation so far has turned up clustalW-MPI, but it seams to be
using an older version of clustalW.
Any imput much appreciated.
Cheers
Steve
_______________________________________________
Bioclusters maillist - Bioclusters at bioinformatics.org
http://www.bioinformatics.org/mailman/listinfo/bioclusters

--
Mikhail Fursov