Pretražite po imenu i prezimenu autora, mentora, urednika, prevoditelja

Napredna pretraga

Pregled bibliografske jedinice broj: 1031806

RNA Splice Aware Mapper

Marić, Josip; Križanović, Krešimir; Šikić, Mile
RNA Splice Aware Mapper // Fourth International Workshop on Data Science
Zagreb, Hrvatska, 2019. str. 72-74 (poster, recenziran, prošireni sažetak, znanstveni)

CROSBI ID: 1031806 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

RNA Splice Aware Mapper

Marić, Josip ; Križanović, Krešimir ; Šikić, Mile

Vrsta, podvrsta i kategorija rada
Sažeci sa skupova, prošireni sažetak, znanstveni

Fourth International Workshop on Data Science

Mjesto i datum
Zagreb, Hrvatska, 15.10.2019

Vrsta sudjelovanja

Vrsta recenzije

Ključne riječi
RNA ; mapper

Advances in sequencing technology achieved by companies such as Oxford Nanopore technologies (ONT) and Pacific Biosciences (PacBio) have resulted in production of long reads that are over 10 kbp in length. Initially, such long reads had high error rate which has steadily improved and the latest generation of PacBio protocols produce reads comparable in accuracy to Illumina short reads. However, most of long-read technologies in use still produce error rate up to 10 %. Although, short reads are still predominantly used in the field of RNA-seq analysis, longer reads help in detection and quantification of isoforms and better annotation of new genomes. Algorithmically, mapping RNA-seq reads to known transcripts is equivalent to mapping DNA reads. Yet, mapping these reads to eukaryotic genomes is more complex due to RNA splicing. In this paper we present a new splice-aware mapping method, built upon our previously developed DNA mapping method, which is tailored for long reads produced by Pacific Biosciences and Oxford Nanopore devices. It uses several newly developed algorithms which enable higher precision and recall for transcript and exon boundary detection. New RNA splice aware mapping method developed in this paper maps RNA reads to the reference genome in four steps: (1) Candidate positions (anchors) selection, (2) Anchor filtering and alignment, (3) Exon extending and boundaries adjustment and (4) Exon grouping and consensus. The first step, Candidate positions (anchors) selection, includes finding candidate positions on the reference genome using short seeds where for every read in the input dataset, a set of approximate matches between parts of the read and parts of the reference is produced. These matches are represented by anchors, where every anchor consists of start and end locations on the read, and start and end locations on the reference in that match. In the second step, anchor filtering and alignment, a variation of the knapsack algorithm is used to find the optimal set of anchors and piecewise affine gapped alignment is performed between these anchors to produce unpolished alignments. These alignments are then processed in the third step, Exon extending and boundaries adjustment, where the quality of the unpolished alignments is improved by modifying exon boundaries using the information about donor-acceptor splice sites. Finally, in the fourth step, Exon grouping and consensus, exon boundaries are refined based on information from multiple read alignments that align to the same location and final alignments are produced. The idea is to extract predominant group of alignments and try to realign remaining reads using this group as a template. Exons from the group that have alignments which do not have start or end location identical to group’s start/end location are corrected by modifying their start or end locations. There are several splice-aware mapping tools developed for long reads produced by third generation sequencing technologies. An evaluation of these tools has been reported with GMAP [1] and Minimap2 [2] being the best performing tools. The developed mapper was compared to two best performing RNA seq mappers Minimap2 and GMAP. All three tools were evaluated on 7 different datasets which contain reads representing third generation sequencing datasets. Two main goals of RNA-seq splice alignment are: (1) all of the exons of the RNA read need to be correctly identified (2) generated alignment need to be precise up to a single nucleotide base. Failure to meet the first goal leads to misinterpretation of the gene expressed by the mapped RNA molecule while not meeting the second goal due to missing a single nucleotide can result in a frame shift. In this paper we present evaluation of the tools using two measures: (1) correct measure is used to evaluate how well tools find reads that are perfectly correct, up to a single nucleotide, while the (2) hit-all measure is used to evaluate if the reads alignment has the same number of exons as the corresponding annotation and has all of its exons overlapping with corresponding exons in the annotation by at least one nucleotide. Our developed mapper is the only tool that consistently achieves good results in correct-0 and hit-all measures at the same time, thus it most consistently maps reads correctly and with all exons identified at the same time.

Izvorni jezik



Avatar Url Josip Marić (autor)

Avatar Url Krešimir Križanović (autor)

Avatar Url Mile Šikić (autor)

Citiraj ovu publikaciju

Marić, Josip; Križanović, Krešimir; Šikić, Mile
RNA Splice Aware Mapper // Fourth International Workshop on Data Science
Zagreb, Hrvatska, 2019. str. 72-74 (poster, recenziran, prošireni sažetak, znanstveni)
Marić, J., Križanović, K. & Šikić, M. (2019) RNA Splice Aware Mapper. U: Fourth International Workshop on Data Science.
@article{article, year = {2019}, pages = {72-74}, keywords = {RNA, mapper}, title = {RNA Splice Aware Mapper}, keyword = {RNA, mapper}, publisherplace = {Zagreb, Hrvatska} }