Milko stat seq_toulouse

Milko Krachunov2, Ivan Popov1, Valeria Simeonova2, Irena Avdjieva1, Paweł Szczęsny3, Urszula Zelenkiewicz3, Piotr Zelenkiewicz3,

Dimitar Vassilev1

1Bioinforomatics group, AgroBioInstitute, Bulgaria 2Faculty of mathematics and informatics; Sofia University “St. Kliment Ohridski”, Bulgaria 3Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland

Detection and correction of errors in metagenomic 16S RNA parallel sequencing

NGS errors – common problems

Introduced errors in the assembled reads due to imperfections both of biological and mathematical origin;

Impossibility to re-sequence the same sample again in metagenomic studies ;

Tendency the error rate to increase in every step of the process;

No easy way to differentiate between “sequencing error” and “rare variant”;

Many existing methods and algorithms concerning different aspects of the problem but no unified solutions are available;

Large amounts of data are difficult to process with common software.

Significance of 16S RNA sequencing

Highly conserved between different species of bacteria and archaea;

Sequence analysis is done with universal PCR primers;

Contains hypervariable regions that can provide species-specific signature sequences;

Suitable for phylogenetic studies;

Suitable for metagenomic studies.

General approach in metagenomic biodiversity studies

454 Sequencing

Filtering / Denoising

Multiple alignment

Distance matrix

ОTU clusters with abundance count

Our approach:

A. Raw data characteristics and processing

Two separate runs of metagenomic 16S RNA fragments, sequenced with 454 platform and converted in FASTA format:

run 02 – 46429 short reads run 04 – 41386 short reads

Our task – extract, denoise and correct only the quality reads.

Raw data length histogram

Run 02 Run 04

B. Correction with SHREC

C. Correction with our method:

Classification and performance evaluation

ClaMS parameters:

Distance cut-off: 0,05Signature type: DBC

k-mer length: 3Existing taxonomy: 4th Level

Aim of the method – idea outline

To deal with the heterogeneous nature of the data, similar or related sequences are considered more important in the error evaluation

The naïve approach: If a base is less common than the sequencer error rate, assume it’s likely an error and replace with the most common base

Our modification: Calculate the occurrence of the base in reads that are similar in the given region – assign them bigger weights or use them exclusively

Progress so far

Calculate occurrence rates of every base in reads that are identical to the evaluated read in a window with radius of n bases

Preliminary results: The first basic implementation leads to an increase in the number of OTUs found with ClaMS

Under development

Good choice(s) of approach for alignment of the reads

Empirical evaluation of the parameters

Comparative evaluation of the variants of the approach

Software used in this project:

Python: http://www.python.org/

Cython: http://cython.org/

MEGA (Molecular Evolutionary Genetics Analysis): http://www.megasoftware.net/

Muscle: http://www.drive5.com/muscle/

SHREC (SHort Read Error Correction method): http://ww2.cs.mu.oz.au/~schroder/shrec_www/

ClaMS (Classifier for Metagenomic Sequences): http://clams.jgi-psf.org/

NINJA (modified): http://nimbletwist.com/software/ninja/index.html

R-package: http://www.r-project.org/

[email protected]

Thank you

Milko stat seq_toulouse

Technology

Transcript of Milko stat seq_toulouse