Supervised learning approach to long read classification

Vrček, Lovro; Šikić, Mile

izvor podataka: crosbi !

Supervised learning approach to long read classification (CROSBI ID 683590)

Prilog sa skupa u zborniku | prošireni sažetak izlaganja sa skupa | međunarodna recenzija

Vrček, Lovro ; Šikić, Mile Supervised learning approach to long read classification // Fourth International Workshop on Data Science Abstract Book. 2019. str. 71-72

Podaci o odgovornosti

Autori

Vrček, Lovro ; Šikić, Mile

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Supervised learning approach to long read classification

Sažetak

Determining the complete genetic material of an organism is a central task of genomics as it would enable various applications in medicine and biotechnology. In order to make that possible, several techniques for DNA sequencing have been developed. The most recent approach, called third generation sequencing technology, provides us with DNA sequences of length that can surpass a 100, 000 bases, but that comes at a cost of high sequencing error which is often greater than 15%. The most popular approach for assembling the genome using the mentioned sequences, also called reads, is based on the OLC paradigm which consists of three steps: overlap, layout and consensus. In the overlap phase all the reads are mutually aligned which provides an overlap graph. In the assembly step such graph has to be simplified in order to obtain the Hamiltonian path which defines the assembly genome. In the final phase, consensus, the polishing of the assembly genome is performed by comparing the assembly with the reads obtained from the sequencer. However, not only is finding a Hamiltonian path in a graph an NP-hard problem, but due to high error rate of the third generation sequencers the overlap graph can also be overly complex. In order to avoid that, reads are first analyzed and divided into three groups: regular, chimeric and repeat. When mapped to the reference genome, regular reads have uniform coverage since they have a unique position in the genome. Chimeric reads are usually created as a flaw of sequencer which connects two distant regions into a single read which is characterized as a sudden drop in coverage. Repeat reads have a significantly higher coverage at either end of a read, originating from overlap of bases from that end with multiple positions on a reference genome. The first assembly tool to perform such analysis was HINGE [1]. It observes pile-o-grams of each read, which are plots of coverage versus base index. Another tool that utilizes similar method is Ra [2]. It stores signals from pile-o- grams into vectors of unsigned short integers and calculates the coverage slope at each position by keeping the sliding window on both sides of the observed position. The importance of identifying reads as chimeric and repeats is that they can produce complex overlap graph or add errors into the assembly. Therefore, each type is dealt in its own manner: chimeric reads are cut and only the longest non-chimeric region is retained and ridges are removed from repeat. In this work, we introduce a supervised learning approach for solving this problem. Since the main goal is to improve de-novo genome assembly reads are not mapped onto the reference, but onto each other. This introduces noise into the pile-o- grams and makes the classification more difficult. To deal with that, reads are also mapped onto the reference using the ratlesnake tool [3] which helps classify dubious reads during the training. The training is performed on grayscale images of pile-o-grams on a dataset of 5.000 reads of each class, using the convolutional neural networks. Accuracy obtained for regular reads is 88.93 %, for chimeric 88.17 %, while for repeat reads it is 86.99 %. Code can be found at https://github.com/lvrcek/LongReadClassification under the MIT license. REFERENCES: [1] G. M. Kamath, I. Shomorony, F. Xia, T. A. Courtade, and D. N. Tse, “HINGE: Long-read assembly achieves optimal repeat resolution”, Genome Research, 2017. [2] R. Vaser and M. Šikić, “Yet another de novo genome assembler”, 2019. [3] ratlesnake, https://github.com/lbcb- sci/ratlesnake

Ključne riječi

Supervised learning, long reads, pile-o-grams

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

71-72.

Godina izdavanja

2019.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Naslov

Fourth International Workshop on Data Science Abstract Book

Podaci o skupu

Skup

4rd International Workshop on Data Science (IWDS 2019)

Vrsta sudjelovanja

poster

Datum održavanja skupa

15.10.2019-15.10.2019

Mjesto održavanja skupa

Zagreb, Hrvatska

Povezanost rada

Povezane osobe

Mile Šikić (autor/i)

Lovro Vrček (autor/i)

Povezane ustanove

Fakultet elektrotehnike i računarstva (036) (autorova ustanova)

Područje

Biologija, Računarstvo