Authors: Tomislav Šebrek,Jan Tomljanović,Josip Krapac,Mile Šikić
ArXiv: 1904.10353
Document:
PDF
DOI
Abstract URL: http://arxiv.org/abs/1904.10353v1
In this paper, we propose a semi-supervised deep learning method for
detecting the specific types of reads that impede the de novo genome assembly
process. Instead of dealing directly with sequenced reads, we analyze their
coverage graphs converted to 1D-signals. We noticed that specific signal
patterns occur in each relevant class of reads. Semi-supervised approach is
chosen because manually labelling the data is a very slow and tedious process,
so our goal was to facilitate the assembly process with as little labeled data
as possible. We tested two models to learn patterns in the coverage graphs:
M1+M2 and semi-GAN. We evaluated the performance of each model based on a
manually labeled dataset that comprises various reads from multiple reference
genomes with respect to the number of labeled examples that were used during
the training process. In addition, we embedded our detection in the assembly
process which improved the quality of assemblies.