A Rank-based Classification Method Applied to Biological Sequence Data

Alex Grossmann This is a report on joint work with C.Devauchelle, A. Dress, S. Gruenewald and A. Henaut.


The methods to be described were largely motivated by problems in the classification and evolution of proteins. It can happen there that a natural "dissimilarity" matrix is not symmetric, and that there are pairs of proteins that are "infinitely distant" from each other. It is then useful to develop clustering methods associated to an arbitrary square "dissimilarity" matrix D.

There is a straightforward construction - based on rank methods- that associates a hierarchy of subsets to every such matrix. This construction is "global",i.e.it does not operate by successive merging of small clusters.

In the examples that we have studied, the family of sets produced by this construction is often too small to be interesting.

There is, however, a way -again inspired by methods of rank statistics-to uncover much richer families. Given a "distance" between rank vectors, one can define a finite family of matrices dD, d^2D,...obtained by iterating a tranformation d which will be described in the talk. Our understanding of this dynamical system is very incomplete, but the methods based on the consideration of this family have often produced surprisingly "good" large scale clusters.

The real-life examples include families of mitochodrial and of plant proteins.