Milko stat seq_toulouse
-
Upload
valeriya-simeonova -
Category
Technology
-
view
42 -
download
0
Transcript of Milko stat seq_toulouse
Milko Krachunov2, Ivan Popov1, Valeria Simeonova2, Irena Avdjieva1, Paweł Szczęsny3, Urszula Zelenkiewicz3, Piotr Zelenkiewicz3,
Dimitar Vassilev1
1Bioinforomatics group, AgroBioInstitute, Bulgaria 2Faculty of mathematics and informatics; Sofia University “St. Kliment Ohridski”, Bulgaria 3Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, Poland
Detection and correction of errors in metagenomic 16S RNA parallel sequencing
NGS errors – common problems
Introduced errors in the assembled reads due to imperfections both of biological and mathematical origin;
Impossibility to re-sequence the same sample again in metagenomic studies ;
Tendency the error rate to increase in every step of the process;
No easy way to differentiate between “sequencing error” and “rare variant”;
Many existing methods and algorithms concerning different aspects of the problem but no unified solutions are available;
Large amounts of data are difficult to process with common software.
Significance of 16S RNA sequencing
Highly conserved between different species of bacteria and archaea;
Sequence analysis is done with universal PCR primers;
Contains hypervariable regions that can provide species-specific signature sequences;
Suitable for phylogenetic studies;
Suitable for metagenomic studies.
General approach in metagenomic biodiversity studies
454 Sequencing
Filtering / Denoising
Multiple alignment
Distance matrix
ОTU clusters with abundance count
Our approach:
A. Raw data characteristics and processing
Two separate runs of metagenomic 16S RNA fragments, sequenced with 454 platform and converted in FASTA format:
run 02 – 46429 short reads run 04 – 41386 short reads
Our task – extract, denoise and correct only the quality reads.
Raw data length histogram
Run 02 Run 04
B. Correction with SHREC
C. Correction with our method:
Classification and performance evaluation
ClaMS parameters:
Distance cut-off: 0,05Signature type: DBC
k-mer length: 3Existing taxonomy: 4th Level
Aim of the method – idea outline
To deal with the heterogeneous nature of the data, similar or related sequences are considered more important in the error evaluation
The naïve approach: If a base is less common than the sequencer error rate, assume it’s likely an error and replace with the most common base
Our modification: Calculate the occurrence of the base in reads that are similar in the given region – assign them bigger weights or use them exclusively
Progress so far
Calculate occurrence rates of every base in reads that are identical to the evaluated read in a window with radius of n bases
Preliminary results: The first basic implementation leads to an increase in the number of OTUs found with ClaMS
Under development
Good choice(s) of approach for alignment of the reads
Empirical evaluation of the parameters
Comparative evaluation of the variants of the approach
Software used in this project:
Python: http://www.python.org/
Cython: http://cython.org/
MEGA (Molecular Evolutionary Genetics Analysis): http://www.megasoftware.net/
Muscle: http://www.drive5.com/muscle/
SHREC (SHort Read Error Correction method): http://ww2.cs.mu.oz.au/~schroder/shrec_www/
ClaMS (Classifier for Metagenomic Sequences): http://clams.jgi-psf.org/
NINJA (modified): http://nimbletwist.com/software/ninja/index.html
R-package: http://www.r-project.org/
Thank you