Napredna pretraga

Pregled bibliografske jedinice broj: 520232

Functional classification of Adenylation domains by Latent Semantic Indexing (LSI)


Baranašić, Damir
Functional classification of Adenylation domains by Latent Semantic Indexing (LSI) 2011., diplomski rad, diplomski, Prehrambeno-biotehnološki fakultet, Zagreb


Naslov
Functional classification of Adenylation domains by Latent Semantic Indexing (LSI)

Autori
Baranašić, Damir

Vrsta, podvrsta i kategorija rada
Ocjenski radovi, diplomski rad, diplomski

Fakultet
Prehrambeno-biotehnološki fakultet

Mjesto
Zagreb

Datum
01.07

Godina
2011

Stranica
37

Mentor
Starčević, Antonio

Neposredni voditelj
Žučko, Jurica

Ključne riječi
LSI; A-domains; protein tokenization; protein clustering; SVD; dimension reduction; specificity prediction

Sažetak
Latent semantic indexing (LSI) is an information retrieval method which has relatively recently been introduced into computational biology. In this work, LSI was adapted for prediction of the amino acid substrates which are activated by adenylation domains (A-domains). A-domains are obligatory subunits of non-ribosomally synthesised peptide synthetases (NRPS) modules which recognise and activate the amino acid that must be incorporated into the final product, non-ribosomally sythesised peptides. Knowing the specific A-domain substrate for every sequenced A-domain would enable us to predict the final product of linear NRPS and perhaps design novel biologically active natural products. Two methods were used to vectorize A-domain protein sequences and to construct the resulting term-document matrix: “n-grams” method and a novel “tokenization” method. The “n-grams” method finds n-peptides in the protein sequence, and the “tokenization” method creates specific ”tokens”, which couple amino acid residues with the corresponding positions in the multiple sequence alignment. LSI uses a mathematical method called singular value decomposition (SVD) to reduce the unreliable information from the term-document matrix. The number of dimensions used in analysis was obtained computationally and was found to be in accordance with the empirically obtained optimal number of dimensions. Predictions obtained were satisfactory using both “n-grams” and “tokenization” as vectorization methods. “Tokenization” method generally showed better precision and robustness. A novel clustering method based on LSI was also developed. It showed satisfactory clustering results without the need to guess the numbers of clusters in advance which methods such as k-means clustering require.

Izvorni jezik
Engleski

Znanstvena područja
Biotehnologija



POVEZANOST RADA


Projekt / tema
058-0000000-3475 - Generiranje potencijalnih lijekova u uvjetima in silico (Daslav Hranueli, )
0982560

Ustanove
Prehrambeno-biotehnološki fakultet, Zagreb