Milko stat seq_toulouse

14
Milko Krachunov 2 , Ivan Popov 1 , Valeria Simeonova 2 , Irena Avdjieva 1 , Paweł Szczęsny 3 , Urszula Zelenkiewicz 3 , Piotr Zelenkiewicz 3 , Dimitar Vassilev 1 1 Bioinforomatics group, AgroBioInstitute, Bulgaria 2 Faculty of mathematics and informatics; Sofia University “St. Kliment Ohridski”, Bulgaria 3 Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland Detection and correction of errors in metagenomic 16S RNA parallel sequencing

Transcript of Milko stat seq_toulouse

Page 1: Milko stat seq_toulouse

Milko Krachunov2, Ivan Popov1, Valeria Simeonova2, Irena Avdjieva1, Paweł Szczęsny3, Urszula Zelenkiewicz3, Piotr Zelenkiewicz3,

Dimitar Vassilev1

1Bioinforomatics group, AgroBioInstitute, Bulgaria 2Faculty of mathematics and informatics; Sofia University “St. Kliment Ohridski”, Bulgaria 3Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland

Detection and correction of errors in metagenomic 16S RNA parallel sequencing

Page 2: Milko stat seq_toulouse

NGS errors – common problems

Introduced errors in the assembled reads due to imperfections both of biological and mathematical origin;

Impossibility to re-sequence the same sample again in metagenomic studies ;

Tendency the error rate to increase in every step of the process;

No easy way to differentiate between “sequencing error” and “rare variant”;

Many existing methods and algorithms concerning different aspects of the problem but no unified solutions are available;

Large amounts of data are difficult to process with common software.

Page 3: Milko stat seq_toulouse

Significance of 16S RNA sequencing

Highly conserved between different species of bacteria and archaea;

Sequence analysis is done with universal PCR primers;

Contains hypervariable regions that can provide species-specific signature sequences;

Suitable for phylogenetic studies;

Suitable for metagenomic studies.

Page 4: Milko stat seq_toulouse

General approach in metagenomic biodiversity studies

454 Sequencing

Filtering / Denoising

Multiple alignment

Distance matrix

ОTU clusters with abundance count

Page 5: Milko stat seq_toulouse

Our approach:

Page 6: Milko stat seq_toulouse

A. Raw data characteristics and processing

Two separate runs of metagenomic 16S RNA fragments, sequenced with 454 platform and converted in FASTA format:

run 02 – 46429 short reads run 04 – 41386 short reads

Our task – extract, denoise and correct only the quality reads.

Page 7: Milko stat seq_toulouse

Raw data length histogram

Run 02 Run 04

Page 8: Milko stat seq_toulouse

B. Correction with SHREC

Page 9: Milko stat seq_toulouse

C. Correction with our method:

Page 10: Milko stat seq_toulouse

Classification and performance evaluation

ClaMS parameters:

Distance cut-off: 0,05Signature type: DBC

k-mer length: 3Existing taxonomy: 4th Level

Page 11: Milko stat seq_toulouse

Aim of the method – idea outline

To deal with the heterogeneous nature of the data, similar or related sequences are considered more important in the error evaluation

The naïve approach: If a base is less common than the sequencer error rate, assume it’s likely an error and replace with the most common base

Our modification: Calculate the occurrence of the base in reads that are similar in the given region – assign them bigger weights or use them exclusively

Page 12: Milko stat seq_toulouse

Progress so far

Calculate occurrence rates of every base in reads that are identical to the evaluated read in a window with radius of n bases

Preliminary results: The first basic implementation leads to an increase in the number of OTUs found with ClaMS

Under development

Good choice(s) of approach for alignment of the reads

Empirical evaluation of the parameters

Comparative evaluation of the variants of the approach

Page 13: Milko stat seq_toulouse

Software used in this project:

Python: http://www.python.org/

Cython: http://cython.org/

MEGA (Molecular Evolutionary Genetics Analysis): http://www.megasoftware.net/

Muscle: http://www.drive5.com/muscle/

SHREC (SHort Read Error Correction method): http://ww2.cs.mu.oz.au/~schroder/shrec_www/

ClaMS (Classifier for Metagenomic Sequences): http://clams.jgi-psf.org/

NINJA (modified): http://nimbletwist.com/software/ninja/index.html

R-package: http://www.r-project.org/

Page 14: Milko stat seq_toulouse

[email protected]

Thank you