RNA Splice Aware Mapper

Marić, Josip; Križanović, Krešimir; Šikić, Mile

izvor podataka: crosbi !

RNA Splice Aware Mapper (CROSBI ID 683545)

Prilog sa skupa u zborniku | prošireni sažetak izlaganja sa skupa

Marić, Josip ; Križanović, Krešimir ; Šikić, Mile RNA Splice Aware Mapper. 2019. str. 72-74

Podaci o odgovornosti

Autori

Marić, Josip ; Križanović, Krešimir ; Šikić, Mile

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

RNA Splice Aware Mapper

Sažetak

Advances in sequencing technology achieved by companies such as Oxford Nanopore technologies (ONT) and Pacific Biosciences (PacBio) have resulted in production of long reads that are over 10 kbp in length. Initially, such long reads had high error rate which has steadily improved and the latest generation of PacBio protocols produce reads comparable in accuracy to Illumina short reads. However, most of long-read technologies in use still produce error rate up to 10 %. Although, short reads are still predominantly used in the field of RNA-seq analysis, longer reads help in detection and quantification of isoforms and better annotation of new genomes. Algorithmically, mapping RNA-seq reads to known transcripts is equivalent to mapping DNA reads. Yet, mapping these reads to eukaryotic genomes is more complex due to RNA splicing. In this paper we present a new splice-aware mapping method, built upon our previously developed DNA mapping method, which is tailored for long reads produced by Pacific Biosciences and Oxford Nanopore devices. It uses several newly developed algorithms which enable higher precision and recall for transcript and exon boundary detection. New RNA splice aware mapping method developed in this paper maps RNA reads to the reference genome in four steps: (1) Candidate positions (anchors) selection, (2) Anchor filtering and alignment, (3) Exon extending and boundaries adjustment and (4) Exon grouping and consensus. The first step, Candidate positions (anchors) selection, includes finding candidate positions on the reference genome using short seeds where for every read in the input dataset, a set of approximate matches between parts of the read and parts of the reference is produced. These matches are represented by anchors, where every anchor consists of start and end locations on the read, and start and end locations on the reference in that match. In the second step, anchor filtering and alignment, a variation of the knapsack algorithm is used to find the optimal set of anchors and piecewise affine gapped alignment is performed between these anchors to produce unpolished alignments. These alignments are then processed in the third step, Exon extending and boundaries adjustment, where the quality of the unpolished alignments is improved by modifying exon boundaries using the information about donor-acceptor splice sites. Finally, in the fourth step, Exon grouping and consensus, exon boundaries are refined based on information from multiple read alignments that align to the same location and final alignments are produced. The idea is to extract predominant group of alignments and try to realign remaining reads using this group as a template. Exons from the group that have alignments which do not have start or end location identical to group’s start/end location are corrected by modifying their start or end locations. There are several splice-aware mapping tools developed for long reads produced by third generation sequencing technologies. An evaluation of these tools has been reported with GMAP [1] and Minimap2 [2] being the best performing tools. The developed mapper was compared to two best performing RNA seq mappers Minimap2 and GMAP. All three tools were evaluated on 7 different datasets which contain reads representing third generation sequencing datasets. Two main goals of RNA-seq splice alignment are: (1) all of the exons of the RNA read need to be correctly identified (2) generated alignment need to be precise up to a single nucleotide base. Failure to meet the first goal leads to misinterpretation of the gene expressed by the mapped RNA molecule while not meeting the second goal due to missing a single nucleotide can result in a frame shift. In this paper we present evaluation of the tools using two measures: (1) correct measure is used to evaluate how well tools find reads that are perfectly correct, up to a single nucleotide, while the (2) hit-all measure is used to evaluate if the reads alignment has the same number of exons as the corresponding annotation and has all of its exons overlapping with corresponding exons in the annotation by at least one nucleotide. Our developed mapper is the only tool that consistently achieves good results in correct-0 and hit-all measures at the same time, thus it most consistently maps reads correctly and with all exons identified at the same time.

Ključne riječi

RNA ; mapper

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

72-74.

Godina izdavanja

2019.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Podaci o skupu

Skup

4rd International Workshop on Data Science (IWDS 2019)

Vrsta sudjelovanja

poster

Datum održavanja skupa

15.10.2019-15.10.2019

Mjesto održavanja skupa

Zagreb, Hrvatska

Povezanost rada

Povezane osobe

Mile Šikić (autor/i)

Krešimir Križanović (autor/i)

Josip Marić (autor/i)

Područje

nije evidentirano