ALGORITHMS FOR DE NOVO GENOME ASSEMBLY FROM THIRD GENERATION SEQUENCING DATA

Sović, Ivan

Pregled bibliografske jedinice broj: 853388

ALGORITHMS FOR DE NOVO GENOME ASSEMBLY FROM THIRD GENERATION SEQUENCING DATA

Sović, Ivan

ALGORITHMS FOR DE NOVO GENOME ASSEMBLY FROM THIRD GENERATION SEQUENCING DATA, 2016., doktorska disertacija, Fakultet elektrotehnike i računarstva, Zagreb

CROSBI ID: 853388 Za ispravke kontaktirajte CROSBI podršku putem web obrasca

Naslov
ALGORITHMS FOR DE NOVO GENOME ASSEMBLY FROM THIRD GENERATION SEQUENCING DATA

Autori
Sović, Ivan

Vrsta, podvrsta i kategorija rada
Ocjenski radovi, doktorska disertacija

Fakultet
Fakultet elektrotehnike i računarstva

Mjesto
Zagreb

Datum
04.10

Godina
2016

Stranica
185

Mentor
Šikić, Mile

Ključne riječi
de novo ; assembly ; PacBio ; nanopore ; NanoMark ; GraphMap ; Racon ; Aracon

Sažetak
During the past ten years, genome sequencing has been an extremely hot and active topic, with an especial momentum happening right now. New, exciting and more affordable technologies have been released, requiring the rapid development of new algorithmic methods to cope with the data. Affordable commercial availability of the sequencing technology and algorithmic methods which can leverage the data could open doors to a vast number of very important applications, such as diagnosis and treatment of chronic diseases through personalized medicine or identification of pathogenic microorganisms from soil, water, food or tissue samples. Sequencing the entire genome of an organism is a difficult problem, because all sequencing technologies to date have limitations on the length of the molecule that they can read (much smaller than the genomes of a vast majority of organisms). In order to obtain the sequence of an entire genome, reads need to be either stitched together (assembled) in a de novo fashion when the genome of the organism is unknown in advance, or mapped and aligned to the reference genome if one exists (reference assembly or mapping). The main problem in both approaches stems from the repeating regions in the genomes which, if longer than the reads, prevent complete assembly of the genome. The need for technologies that would produce longer reads which could solve the problem of repeating regions has resulted in the advent of new sequencing approaches – the so-called third generation sequencing technologies which currently include two representatives: Pacific Biosciences (PacBio) and Oxford Nanopore. Both technologies are characterized, aside from long reads, by high error rates which existing assembly algorithms of the time were not capable of handling. This caused the development of time- consuming read error correction methods which were applied as a pre- processing step prior to assembly. Instead, the focus of the work conducted in the scope of this thesis is to develop novel methods for de novo DNA assembly from third generation sequencing data, which provide enough sensitivity and precision to completely omit the error- correction phase. Strong focus is put on nanopore data. In the scope of this thesis, four new methods were developed: (I) NanoMark - an evaluation framework for comparison of assembly methods from nanopore sequencing data ; (II) GraphMap - a fast and sensitive mapper for long error- prone reads ; (III) Owler - a sensitive overlapper for third generation sequencing ; and (IV) Racon - a rapid consensus module for correcting raw assemblies. Owler and Racon were used as modules in the development of a novel de novo genome assembler Aracon. The results show that Aracon reduces the overall assembly time by at least 3x and up to even an order of magnitude less compared to the state-of-the-art methods, while retaining comparable or better quality of assembly.

Izvorni jezik
Engleski

Znanstvena područja
Računarstvo

POVEZANOST RADA

Ustanove:
Fakultet elektrotehnike i računarstva, Zagreb

Profili:

Ivan Sović (autor)