An XML format for peptide datasets

Repar, Jelena; Škunca, Nives; Supek, Fran; Šmuc, Tomislav

Pregled bibliografske jedinice broj: 373878

An XML format for peptide datasets

Repar, Jelena; Škunca, Nives; Supek, Fran; Šmuc, Tomislav

An XML format for peptide datasets // ECCB'08 European Conference on Computational Biology
Cagliari, Italija, 2008. (poster, međunarodna recenzija, sažetak, znanstveni)

CROSBI ID: 373878 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
An XML format for peptide datasets

Autori
Repar, Jelena ; Škunca, Nives ; Supek, Fran ; Šmuc, Tomislav

Vrsta, podvrsta i kategorija rada
Sažeci sa skupova, sažetak, znanstveni

Izvornik
ECCB'08 European Conference on Computational Biology / - , 2008

Skup
ECCB'08 European Conference on Computational Biology

Mjesto i datum
Cagliari, Italija, 22.09.2008. - 26.09.2008

Vrsta sudjelovanja
Poster

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
XML; peptide; datasets; standardization

Sažetak
There is a number of papers in the recent scientific literature that deal with a set of seemingly dissimilar bioinformatic problems. Their connecting motif is that they all have sequences of peptides, or fragments of proteins, as the focus of their interest. Generally, the peptide sequences in question have first been characterized by wet lab experiments. The results have been published in a paper, and possibly stored in a highly specialized database . In order to use the sequences for bioinformatics analyses of the processes that generated them a researcher needs to extract peptides from the database, analyze them and form a dataset. As the last step is time-consuming and arduous there is a general tendency for other scientists to re-use such datasets for trying to arrive to the even better bioinformatic models and, hopefully, understanding of the processes. The peptide datasets we have collected so far can be roughly divided into four categories, with the possibility of adding more categories as the need arises: a) posttranslational modification of proteins (e.g. phosphorylation) b) cleavage by broad-specificity proteases (e.g. proteasome, HIV-I protease) c) determining protein secondary structure d) epitope recognition (e.g. T-cell epitope recognition) As many of these processes are not only of biological but also of a medical importance, there is great interest in their further study and explanation. The peptide datasets can all generally be viewed as sets of amino acid sequences to which a class label has been assigned experimentally. However unconnected their underlying problem may seem at the first glance (e.g. prediction of protein secondary structure, prediction of phosphorylation sites in proteins) they all often serve for construction of classification models by similar supervised machine learning approaches. The final aim of the computational approaches is to develop a model that will best serve for the in silico predictions of events occuring in live cells. Most attempted modelling of these problems has met with the question of numerical representations of amino acid sequences, numerical representation being necessary for many classification algorithms. Various approaches to finding the best representation have been taken and they commonly compare amino acid representations on the same problem using the same classification method. Only rarely have researchers compared optimal amino acid representations between different problems. In the light of similarity between various peptide classification problems one can not help wondering whether there is an optimal amino acid representation that would work best for different problems and that would explain the most prominent amino acid features in shaping the natural processes in question. It is the aim of our future research to address this question in more detail. Despite the striking similarity between the peptide modelling problems, there is not an established flow of information. Quite a few of the field-specialized databases are readily available but need to be extensively pre-processed to result in the modelling-appropriate dataset. Additionaly, due to the frequent database updates and vagueness of descriptions of dataset production it is hard for different scientists to arrive to exactly the same dataset, and even slight variations in the datasets would invalidate a comparison of modelling approaches. Although some datasets are available on request, there is not an established format of dataset exchange, resulting in more time wasted on managing different data formats. Therefore, easing the process of data exchange by standardizing peptide dataset formats is a fundamental requirement for the better research in protein structural biology. We have chosen Extensible Markup Language (XML) for the production of such a data format. XML has been shown to be highly efficient in storing data in an orderly, researcher-comprehensible manner. On the one hand it is robust and on the other hand extensible which assures not only correct data distribution, but also adaptability in the instances in which hindsight fails. We hereby propose an xml format for the peptide sequences datasets which we hope will lead to the more efficient information exchange within this specific area of protein structural biology. In the future, we hope to further stimulate the information flow by building a peptide dataset repository.

Izvorni jezik
Engleski

Znanstvena područja
Biologija, Računarstvo, Biotehnologija

POVEZANOST RADA

Projekti:
098-0000000-3168 - Strojno učenje prediktivnih modela u računalnoj biologiji (Šmuc, Tomislav, MZOS ) ( CroRIS)
098-0982913-2862 - Molekularni mehanizmi rekombinacije i popravka DNA (Zahradka, Davor, MZOS ) ( CroRIS)

Ustanove:
Institut "Ruđer Bošković", Zagreb

Profili:

Nives Škunca (autor)

Fran Supek (autor)

Jelena Repar (autor)

Tomislav Šmuc (autor)

www.eccb08.org

CROSBI Hrvatska znanstvena bibliografija

Pregled bibliografske jedinice broj: 373878

An XML format for peptide datasets

Citiraj ovu publikaciju:

Pregled bibliografske jedinice broj: 373878

An XML format for peptide datasets

Citiraj ovu publikaciju:

Podijeli: