Million Sequence Clustering

Million Sequence Clustering

Find clusters for large sequence sets with around million sequences

The goal of million sequence clustering was to find clusters for large sequence sets with around million sequences.

Introduction

The goal of million sequence clustering was to find clusters for large sequence sets with around million sequences. We classified two data sets in this regard of which the details are given below.

16S rRNA Sequences

-- This set contained a total of 1160946 16S sRNA sequences where 684769 of them were unique

Fungi Sequences

-- This set contained a total of 957387 fungi sequences where 482158 of them were unique and we considered the ones with lengths greater than 200 resulting 446041 sequences.

Fungi Phylogenetic Tree

-- We proposed a displaying method called Spherical Phylogram to display the phylogenetic tree in 3D space. The sequences includes the representative sequences found by clustering fungi sequences, and sequences from GenBank.

Process

The outline of the clustering process for a given sequence set is as follows (for more details see 16S rRNA Process or fungi process)

Pick a sub set of sequences - called sample sequence set. We used 100K sequences in sample
Perform pairwise alignment on 1. and produce a distance matrix
Perform pairwise clustering and multidimensional scaling on 2. to produce a 3 dimensional view of the sample sequences with different coloring on clusters found from pairwise clustering program
Refine the clusters found in 2. to form spatially compact regions - called Mega-regions aiming to get at most ~100K sequences in each Mega-region as processing time for larger samples was too great.
Assign each remaining (the ones not in the sample sequence set) sequence approximately to a Mega-region based on the pairwise distance between the sequence and the sequence representing the centers of nodes of decomposed trees (see figure 1 below) in a Mega-region. This is extremely fast as can use hierarchical algorithm in 3D mapped space. The result is illustrated in figure 2 below.
For each Mega-region of 5.
Run pairwise alignment and produce distance matrix
Run multidimensional scaling on 6.1
Run pairwise clustering on 6.1
Produce 3 dimensional view based on 6.2 and 6.
If refinements are necessary for clustering repeat from step 6.3 for such clusters The visual depiction of clusters (mapped to 3D) makes it much easier to decide on reliability of results and decide when refinement is warranted). The final result of this is illustrated in figure 3 below
Use Interpolative Joining algorithm to generate Spherical Phylogram as a phylogenetic tree in 3D dimension.

Note each step was run in parallel on a number of cores that depended on problem size but was up to 1000 cores in number.

Dependence of results on sequence length

We also studied the effect of sequence length in final clustering results as presented under following links

16S rRNA length dependency explanation and gallery
Fungi length dependency explanation and gallery

Sequence Alignment

We used the local alignment algorithm, Smith-Waterman to align sequences. We also tried using global alignment with Needleman-Wunsch, but discontinued due to the reasons discussed here.

3D Visualizations

The pictures are screen snapshots of data displayed in a 3D visualizer. You can see the full 3D versions by downloading PlotViz found at download page and using it to open .pviz files labeled as "plot" of download sections.

References

Yang Ruan, Geoffrey House, Saliya Ekanayake, Ursel Schütte, James D. Bever, Haixu Tang, Geoffrey Fox. Integration of Clustering and Multidimensional Scaling to Determine Phylogenetic Trees as Spherical Phylograms Visualized in 3 Dimensions. Proceedings of C4Bio 2014 of IEEE/ACM CCGrid 2014, Chicago, USA, May 26-29, 2014.

Yang Ruan, Geoffrey Fox. A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting. Proceedings of IEEE eScience 2013, Beijing, China, Oct. 22-Oct. 25, 2013. (Best Student Innovation Award)

Yang Ruan, Saliya Ekanayake, Mina Rho, Haixu Tang, Seung-Hee Bae, Judy Qiu, Geoffrey Fox. DACIDR: Deterministic Annealed Clustering with Interpolative Dimension Reduction using a Large Collection of 16S rRNA Sequences. Proceedings of ACM-BCB 2012, Orlando, Florida, ACM, Oct. 7-Oct. 10, 2012.

Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox. HyMR: a Hybrid MapReduce Workflow System. Proceedings of ECMLS’12 of ACM HPDC 2012, Delft, Netherlands, ACM, Jun. 18-Jun. 22, 2012

Adam Hughes, Yang Ruan, Saliya Ekanayake, Seung-Hee Bae, Qunfeng Dong, Mina Rho, Judy Qiu, Geoffrey Fox. Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets, BMC Bioinformatics 2012, 13(Suppl 2):S9.

Technologies Used

Twister
MPI.NET
Dimension Reduction with Deterministic Annealing SMACOF
Dimension Reduction by Interpolation
Clustering by Deterministic Annealing Pairwise Techniques
Smith Waterman Gotoh Distance Computation
NET Bio (formerly Microsoft Biology Foundation)

Work supported in part by the National Science Foundation under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Partners in the FutureGrid project include U. Chicago, U. Florida, San Diego Supercomputer Center - UC San Diego, U. Southern California, U. Texas at Austin, U. Tennessee at Knoxville, U. of Virginia, Purdue I., and T-U. Dresden. Work supported in part by Microsoft Research