Use of variable selection in modeling the secondary structural content of proteins from their composition of amino acid residues

Piližota, Teuta; Lučić, Bono; Trinajstić, Nenad

izvor podataka: crosbi ✓

Use of variable selection in modeling the secondary structural content of proteins from their composition of amino acid residues (CROSBI ID 105202)

Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija

Piližota, Teuta ; Lučić, Bono ; Trinajstić, Nenad Use of variable selection in modeling the secondary structural content of proteins from their composition of amino acid residues // Journal of chemical information and computer sciences, 44 (2004), 1; 113-121-x

Podaci o odgovornosti

Autori

Piližota, Teuta ; Lučić, Bono ; Trinajstić, Nenad

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Use of variable selection in modeling the secondary structural content of proteins from their composition of amino acid residues

Sažetak

The possibility of prediction of protein secondary structure content from composition of their amino acid residues can help in bridging the gap between proteins of known primary sequence having an unknown secondary structure. Almost all recently published models for understanding the relationship between composition (frequency of occurrence) of amino acid residues and secondary structure content of proteins involved composition of all 20 amino acid residues. However, it is well-known that many amino acid residues are mutually similar according to their physicochemical properties (hydrophobicity, hydrophilicity, charge, size, etc.). Because of that, we were motivated to investigate the possibility of reduction of the total number of terms (frequencies of amino acid residues) in the models for describing the relation between the composition of amino acid residues and the percentage of residues belonging to alpha, beta and coil secondary structure. For this purpose, the CROMRsel algorithm (J. Chem. Inf. Comput. Sci. 1999, 39, 121-132) for selection of a small subset of the most important variables/descriptors into the multiregression (MR) models, i.e., frequency of occurrence of amino acid residues in proteins, was used. Analysis was performed on a data set containing 475 proteins, taken from Proteins 1996, 25, 157-168. A complete data set was partitioned into a 317-protein training set and 158-protein test set. The best possible linear models containing I = 1, ..., 20 frequencies were selected among all 20 frequencies of occurrence of amino acid residues on the 317-protein training set, and were used for performing prediction of the corresponding percentage of secondary structure content on the 158-protein test set. For the 317-protein data set the best selected concise models for the alpha, beta, and coil secondary structure contain only 9, 5, and 8 frequencies, respectively. Selected concise models are of the same or better fitted, cross-validated, and predictive statistical parameters than the models containing all 20 frequencies. Additionally, for each I (I = 1, ..., 20) 30 the best possible random models were selected. In each case, the best possible real models are much better than each of the best possible random models, showing clearly that there is no risk of a chance correlation (what one could expect due to the application of an exhaustive search for the best model having I frequencies among all 20!/I!(20-I)! possible models). Finally, the best selected models on the complete 475-protein data set for the alpha, beta, and coil secondary structure contain only 7, 4, and 7 frequencies of amino acid residues, respectively. These models are much simpler and have better fitted and cross-validated errors than the corresponding models from the literature, that were obtained without using a procedure for selection of the most important frequencies of amino acid residues in proteins.

Ključne riječi

globular-proteins; selection of frequencies; composition of amino acid residues; secondary structure; helix/strand/coil content; concise models; prediction; multivariate regression; variable selection

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o izdanju

Časopis

Journal of chemical information and computer sciences

Volumen (broj)

44 (1)

Godina

2004.

Stranice rada

113-121-x

Status objave rada

objavljeno

ISSN

0095-2338

Povezanost rada

Povezane osobe

Nenad Trinajstić (autor/i)

Bono Lučić (autor/i)

Povezane ustanove

Institut Ruđer Bošković (098) (autorova ustanova)

Područje

Kemija

Indeksiranost

Scopus

Current Contents Connect (CCC)

Medline

Web of Science Core Collection, Science Citation Index Expanded (WoSCC-SCI-Exp)

Web of Science Core Collection, SCI-Exp, SSCI & A&HCI (WoSCC-SCI-Exp, SSCI, A&HCI)