Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 1031950

Supervised learning approach to long read classification


Vrček, Lovro; Šikić, Mile
Supervised learning approach to long read classification // Fourth International Workshop on Data Science Abstract Book
Zagreb, Hrvatska, 2019. str. 71-72 (poster, međunarodna recenzija, prošireni sažetak, znanstveni)


CROSBI ID: 1031950 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
Supervised learning approach to long read classification

Autori
Vrček, Lovro ; Šikić, Mile

Vrsta, podvrsta i kategorija rada
Sažeci sa skupova, prošireni sažetak, znanstveni

Izvornik
Fourth International Workshop on Data Science Abstract Book / - , 2019, 71-72

Skup
Fourth International Workshop on Data Science

Mjesto i datum
Zagreb, Hrvatska, 15.10.2019

Vrsta sudjelovanja
Poster

Vrsta recenzije
Međunarodna recenzija

Ključne riječi
Supervised learning, long reads, pile-o-grams

Sažetak
Determining the complete genetic material of an organism is a central task of genomics as it would enable various applications in medicine and biotechnology. In order to make that possible, several techniques for DNA sequencing have been developed. The most recent approach, called third generation sequencing technology, provides us with DNA sequences of length that can surpass a 100, 000 bases, but that comes at a cost of high sequencing error which is often greater than 15%. The most popular approach for assembling the genome using the mentioned sequences, also called reads, is based on the OLC paradigm which consists of three steps: overlap, layout and consensus. In the overlap phase all the reads are mutually aligned which provides an overlap graph. In the assembly step such graph has to be simplified in order to obtain the Hamiltonian path which defines the assembly genome. In the final phase, consensus, the polishing of the assembly genome is performed by comparing the assembly with the reads obtained from the sequencer. However, not only is finding a Hamiltonian path in a graph an NP-hard problem, but due to high error rate of the third generation sequencers the overlap graph can also be overly complex. In order to avoid that, reads are first analyzed and divided into three groups: regular, chimeric and repeat. When mapped to the reference genome, regular reads have uniform coverage since they have a unique position in the genome. Chimeric reads are usually created as a flaw of sequencer which connects two distant regions into a single read which is characterized as a sudden drop in coverage. Repeat reads have a significantly higher coverage at either end of a read, originating from overlap of bases from that end with multiple positions on a reference genome. The first assembly tool to perform such analysis was HINGE [1]. It observes pile-o-grams of each read, which are plots of coverage versus base index. Another tool that utilizes similar method is Ra [2]. It stores signals from pile-o- grams into vectors of unsigned short integers and calculates the coverage slope at each position by keeping the sliding window on both sides of the observed position. The importance of identifying reads as chimeric and repeats is that they can produce complex overlap graph or add errors into the assembly. Therefore, each type is dealt in its own manner: chimeric reads are cut and only the longest non-chimeric region is retained and ridges are removed from repeat. In this work, we introduce a supervised learning approach for solving this problem. Since the main goal is to improve de-novo genome assembly reads are not mapped onto the reference, but onto each other. This introduces noise into the pile-o- grams and makes the classification more difficult. To deal with that, reads are also mapped onto the reference using the ratlesnake tool [3] which helps classify dubious reads during the training. The training is performed on grayscale images of pile-o-grams on a dataset of 5.000 reads of each class, using the convolutional neural networks. Accuracy obtained for regular reads is 88.93 %, for chimeric 88.17 %, while for repeat reads it is 86.99 %. Code can be found at https://github.com/lvrcek/LongReadClassification under the MIT license. REFERENCES: [1] G. M. Kamath, I. Shomorony, F. Xia, T. A. Courtade, and D. N. Tse, “HINGE: Long-read assembly achieves optimal repeat resolution”, Genome Research, 2017. [2] R. Vaser and M. Šikić, “Yet another de novo genome assembler”, 2019. [3] ratlesnake, https://github.com/lbcb- sci/ratlesnake

Izvorni jezik
Engleski

Znanstvena područja
Biologija, Računarstvo



POVEZANOST RADA


Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb

Profili:

Avatar Url Mile Šikić (autor)

Avatar Url Lovro Vrček (autor)

Citiraj ovu publikaciju

Vrček, Lovro; Šikić, Mile
Supervised learning approach to long read classification // Fourth International Workshop on Data Science Abstract Book
Zagreb, Hrvatska, 2019. str. 71-72 (poster, međunarodna recenzija, prošireni sažetak, znanstveni)
Vrček, L. & Šikić, M. (2019) Supervised learning approach to long read classification. U: Fourth International Workshop on Data Science Abstract Book.
@article{article, year = {2019}, pages = {71-72}, keywords = {Supervised learning, long reads, pile-o-grams}, title = {Supervised learning approach to long read classification}, keyword = {Supervised learning, long reads, pile-o-grams}, publisherplace = {Zagreb, Hrvatska} }
@article{article, year = {2019}, pages = {71-72}, keywords = {Supervised learning, long reads, pile-o-grams}, title = {Supervised learning approach to long read classification}, keyword = {Supervised learning, long reads, pile-o-grams}, publisherplace = {Zagreb, Hrvatska} }




Contrast
Increase Font
Decrease Font
Dyslexic Font