Pregled bibliografske jedinice broj: 520232
Functional classification of Adenylation domains by Latent Semantic Indexing (LSI)
Functional classification of Adenylation domains by Latent Semantic Indexing (LSI), 2011., diplomski rad, diplomski, Prehrambeno-biotehnološki fakultet, Zagreb
CROSBI ID: 520232 Za ispravke kontaktirajte CROSBI podršku putem web obrasca
Naslov
Functional classification of Adenylation domains by Latent Semantic Indexing (LSI)
Autori
Baranašić, Damir
Vrsta, podvrsta i kategorija rada
Ocjenski radovi, diplomski rad, diplomski
Fakultet
Prehrambeno-biotehnološki fakultet
Mjesto
Zagreb
Datum
01.07
Godina
2011
Stranica
37
Mentor
Starčević, Antonio
Neposredni voditelj
Žučko, Jurica
Ključne riječi
LSI; A-domains; protein tokenization; protein clustering; SVD; dimension reduction; specificity prediction
Sažetak
Latent semantic indexing (LSI) is an information retrieval method which has relatively recently been introduced into computational biology. In this work, LSI was adapted for prediction of the amino acid substrates which are activated by adenylation domains (A-domains). A-domains are obligatory subunits of non-ribosomally synthesised peptide synthetases (NRPS) modules which recognise and activate the amino acid that must be incorporated into the final product, non-ribosomally sythesised peptides. Knowing the specific A-domain substrate for every sequenced A-domain would enable us to predict the final product of linear NRPS and perhaps design novel biologically active natural products. Two methods were used to vectorize A-domain protein sequences and to construct the resulting term-document matrix: “n-grams” method and a novel “tokenization” method. The “n-grams” method finds n-peptides in the protein sequence, and the “tokenization” method creates specific ”tokens”, which couple amino acid residues with the corresponding positions in the multiple sequence alignment. LSI uses a mathematical method called singular value decomposition (SVD) to reduce the unreliable information from the term-document matrix. The number of dimensions used in analysis was obtained computationally and was found to be in accordance with the empirically obtained optimal number of dimensions. Predictions obtained were satisfactory using both “n-grams” and “tokenization” as vectorization methods. “Tokenization” method generally showed better precision and robustness. A novel clustering method based on LSI was also developed. It showed satisfactory clustering results without the need to guess the numbers of clusters in advance which methods such as k-means clustering require.
Izvorni jezik
Engleski
Znanstvena područja
Biotehnologija
POVEZANOST RADA
Projekti:
0982560
058-0000000-3475 - Generiranje potencijalnih lijekova u uvjetima in silico (Hranueli/Jurica Žučko, Daslav, MZOS ) ( CroRIS)
Ustanove:
Prehrambeno-biotehnološki fakultet, Zagreb