Istituto di Informatica e Telematica     
Pellegrini M., Genovese L. M., Geraci F. Protein families comparisons using repeatome-based profiling. In: (BITS) - Bioinformatics Italian Society Annual Meeting 2014 (Roma, Italy, 26-28/02 2014). Abstract, pp. 1 - 2. -, 2014.
Motivation: Protein architectures form a complex multilayered hierarchy. The primary linear sequence of amino acids residues arranges itself in 3-dimensional space so to form local structures (secondary and super-secondary structures, and extends up to fully functional folded proteins (tertiary and quaternary structures) with their functional characterization. For a majority of proteins only the primary AA sequence is known reliably, while the most valuable characterization in structural ond/or functional terms is routinely attained with the use of prediction tools that try to find matching homologous proteins within databases of validated structural/functional hierarchies (e.g. SCOP, CATH). As remarked in [Simossis and Heringa 2006], at the moment no systematic analysis has been done on how incorporating repetitive features of the primary sequence might help in improving alignment quality of homologous proteins (and protein families) matching. Here we report initial findings in the direction of repeatome-based profiling of protein families with the aim of improving current alignment/matching technologies and classification methods. Methods: PTRStalker [Pellegrini et al. 2012] is an algorithm designed to detect Fuzzy TR (FTR) in protein sequences (20AA alphabet). Using PTRStalker as a black-box we compute a FTR-profile for a protein P by (a) detect the set FTR(P) of FTR in P (b) compute mean of the FTR over ten random shuffling of P (c) remove from FTR(P) all TR of length smaller than the mean computed at (b). The statistically filtered FTR are then turned into a vector of features that include the length of P, the length of all FTR after the statistical filtering in the order of appearance along the protein, and the features of the background random shuffling FTR distribution (mean and max values). This FTR-descriptor for the protein P can be used in different ways. In the next section we report good performance of this descriptor in a direct characterization of structured and unstructured proteins. Also we have used this descriptor together with the Euclidean metric to perform unsupervised learning (clustering) of SCOP protein families obtaining highly homogeneous clusters. As next step we plan to apply this new protein descriptor in conjunction with other descriptors (primary sequence, secondary structure, etc.) in the framework of Chung and Yona 2004 in order to improve the prediction of distant homologies among protein families by augmenting family profiles with FTR descriptors. Results: We have tested three validated data sets. The first data set (DS1) is a collection of 92 sequences covering 54037 bps from [Walsh et al. 2012] corresponding to 18725 bps validated secondary structures (mostly solenoids). This benchmark is intended to measure the capability of PTRStalker in detecting existing secondary structures. After statistical filtering PTRStalker returns 95 Fuzzy Tandem Repeats of which 67 overlap known SS in DS1. In terms of base counts the reported FTR cover 17544 bases of which 11594 cover known SS in DS1 (recall: 0.62, precision: 0.66). The second data set (DS2) is a collection of 105 proteins from the database DisProt classified as 100% disordered. The rationale of the experiment is that disordered protein should be relatively free of long tandem repeats. We split the data in three groups of 35 protein each, of length range [45-110][111-208], and [>209] and in each class we tested the hypothesis that the disordered proteins of that length class have FTR statistically equivalent from that of randomly shuffled proteins. The Wilcoxon signed rank test on the length of the longest FTR found in each of the three classes are respectively: 0.199, 0.135 and 0.008. This result implies that such unstructured proteins are indeed free of significant FTR at least up to length 200. This measure is in line with the findings of experiment on DS1. The third data set (DS3) is composed of 507 non redundant proteins in 6 SCOP superfamilies from [Paccanaro et al. 2006] selected as a challenge for clustering algorithm. Within any superfamily protein pairs have high sequence divergence, but high structural similarity. For each protein we build (see section methods) a descriptor or its FTR profile, including also the background as measured by random shuffling the proteins sequences. Clustering made with the tool Amica [Geraci et al. 2008] using Euclidean distance and a target of 30 clusters has produced 26 highly homogeneous clusters at the superfamily level (with hypergeometric test p-value < 0.004, with BHY FDR adjustment for multiple testing.) covering 90% of the input set. This experiment implies that FTR characterization of proteins is a promising new feature that can be used in novel clustering and classification tasks Availability: http:// Contact E-Mail: marco.pellegrini@iit.cnr.it Info: Simossis, V.A. and Heringa, J. (2006). Local structure prediction of proteins. In: Computational Methods for Protein Structure Prediction and Modeling (Xu, Y., Xu, D., Liang J, Eds.), Springer-Verlag, GmbH. Chung R, Yona G. (2004) Protein family comparison using statistical models and predicted structural information. BMC Bioinformatics. Nov 25;5:183. Walsh, Ian and Sirocco, Francesco G. and Minervini, Giovanni and Di Domenico, Tomás and Ferrari, Carlo and Tosatto, Silvio C.E. (2012). RAPHAEL: Recognition, periodicity and insertion assignment of solenoid protein structures. Bioinformatics. 10.1093/bioinformatics/bts550 M. Pellegrini, and M. Elena Renda and A. Vecchio. Ab Initio Detection of Fuzzy Amino Acid Tandem Repeats in Protein Sequences. BMC Bioinformatics 2012, Vol. 13(Suppl 3):S8, doi:10.1186/1471-2105-13-S3-S8. March 2012. Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN, Obradovic Z, Dunker AK. DisProt: the Database of Disordered Proteins. Nucleic Acids Res. 2007 Jan;35(Database issue):D786-93. A. Paccanaro, J.A. Casbon, M.A.S. Saqi. (2006). Spectral clustering of protein sequences Nucleic acids research 34 (5), 1571-1580 F. Geraci, M. Pellegrini, E. Renda. AMIC@: All MIcroarray Clusterings @ once. Nucleic Acids Research , Vol. 36, Web Server Issue W315~W319, 2008.
Subject algorithm, Biology, Protein architectures
J.3 LIFE AND MEDICAL SCIENCES: Biology and genetics

Icona documento 1) Download Document PDF

Icona documento Open access Icona documento Restricted Icona documento Private


Per ulteriori informazioni, contattare: Librarian http://puma.isti.cnr.it

Valid HTML 4.0 Transitional