Michael Hiller Christine Schulz - uni-leipzig.dedominic/stuff/pub/introns_poster.pdf · Conserved...

1
Conserved introns reveal novel transcripts in eukaryotic genomes Michael Hiller 1,2 , Sven Findeiß 3,4 , Sandro Lein 5 , Manja Marz 3 , Claudia Nickel 5 , Dominic Rose 3 , Christine Schulz 6 , Rolf Backofen 1 , Sonja J. Prohaska 3,4,7 , Gunter Reuter 5 and Peter F. Stadler 3,4,6,7,8 1) Bioinformatics Group, Albert-Ludwigs-University Freiburg, Germany 5) Institute of Genetics, Martin Luther University Halle-Wittenberg, Germany 2) Department of Developmental Biology, Stanford University, USA 6) RNomics Group, Fraunhofer Institut f¨ ur Zelltherapie und Immunologie, Germany 3) Bioinformatics Group, University of Leipzig, Germany 7) Institut f¨ ur Theoretische Chemie und Molekulare Strukturbiologie, University of Vienna, Austria 4) Interdisciplinary Center of Bioinformatics, University of Leipzig, Germany 8) Sante Fe Institute, Santa Fe, USA email: [email protected] – www: http://www.bioinf.uni-leipzig.de 1. Introduction & Outline Introduction Most of eukaryotic genomes are transcribed producing large num- bers of non-coding RNAs (ncRNAs), a heterogeneous class of es- sential transcripts exerting their function at the RNA level with- out ever being translated into protein. A subclass of them, similar to mRNAs, gets spliced, capped, and polyadenylated and is therefore called messenger-like non-coding RNAs (mlncRNAs), examples: Xist, H19 (gene regulators). Contrary to protein-coding genes, ncRNA gene-finding solely based on sequence data is a challenging problem (no start-/stop codon, lack of discernible open reading frames, poor sequence conservation, in case of long ncRNAs usually not even structure conservation). Outline Novel genome-wide comparative genomics approach. Search for conserved introns in eukaryotic genomes. Capable to identify novel transcripts/genes. Idea: Gene-finding based on intron prediction. Intron detection allows to extend or revise existing annotation. identify novel protein-coding genes. identify novel mlncRNAs. 2. Overview The idea Functional pair of donor (5’) and acceptor (3’) splice sites will be retained over long evolutionary time scales only if the locus is transcribed into a functional transcript and accurate intron removal is necessary to produce a functional tran- script. The data 2 screens, all input data available at the UCSC genome browser: 15 insects, already published, see [1] (12 drosophila genomes, mosquito, beetle, honeybee) 44 vertebrates (human teleosts, lamprey) The plan Insects: focus on short conserved introns (40-81 nt) Apply intronscan (preliminary filter) build alignments evaluate characteristic intron evolution train support vector machine (SVM) classify candidate set of novel introns Vertebrates: focus on general independent splice-site prediction (they have only few short introns) Apply MaxEntScan (preliminary filter) compile set of real (positive) and “pseudo” (false) donor/acceptor splice-sites evaluate characteristic splice-site evolution train SVM clas- sify candidate set of novel splice-sites 3. Insects – Methods 498,231 predictions with orthologs D.ere D.mel D.moj 1,398,939 predicted introns for B retain orthologous intronscan predictions A + 12 insects predict introns in individual insect genomes using intronscan variation donor score acceptor score variation variation intron length conservation scores scores splice site C evaluate characteristic intron evolution training samples distributions of train an SVM with these 5 discriminative features apply to 342,785 predictions that overlap no protein-coding gene D. melanogaster 369 conserved introns predicted negative positive substitution genome genome D.ere D.mel D.moj + 12 insects + strand intron - strand intron > > > > > > >>>>>> >>>>>> > > > > > > > > > > > > > > > > > > > > > > > > >>>>>> >>>>>> >>>>>> >>>>>> > > > > > > >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> > > > > > > > > > > > > > > > > > > >>>>>> >>>>>> > > > > > > > > > > > > >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> > > > > > > > > > > > > >>>>>> 1 False Positive Rate True Positive Rate 0 1 independent test set ROC curve of AUC = 0.983 4. Insects – Splice-site evolution Nucleotide frequencies differ at splice-site positions. 10 -10 10 -10 10 -10 10 -10 10 -10 10 -10 10 -10 10 -10 10 -10 G T C A 0 -20 20 D.sim D.sec D.yak D.ere D.ana D.pse D.per D.wil D.moj D.vir D.gri A.gam T.cas A.mel percent less frequent compared to more frequent D.mel compared to D.mel donor (5’) splice site acceptor (3’) splice site +5 +6 -7 -6 -5 -4 -3 +3 +4 less frequent ←|→ more frequent (compared to Dmel) e.g. Apis prefers A over G (donor +3) and T over C (acceptor -3) learn log odd substitution scores: x, y ∈{A,T,C,G} ,x = y : log 2 freq pos (xy ) freq neg (xy ) substitution matrix Conservation scores (PhastCons) 0 1 0.5 8...20 and -20...-8 average conservation scores for region substitution scores sum of 0.5 substitution scores sum of 8...20 and -20...-8 average conservation scores for region Density 0.03 1.0 Density 20 0 0 0 40 80 -10 Density 0.5 40 80 0 1.0 Density 10 0 0 -10 0 average = 0.002 sum = 31.6 sum = -3.7 average = 0.92 GGT negative positive D.pse D.sec D.mel D.sim D.yak D.ere D.vir D.gri D.gri D.moj D.vir D.wil D.per D.pse D.ana D.ere D.sim D.mel D.sec GTAAGAT-TATTCCGATTTTTATAGCTTCATTTTTGAGAAATTTAATTTGATTAA----TTTTTAG GTAAGCC----TTACAAAAAACCATATATATTTTTAGTGAATCAATATTGCCTTATT--TTTGTAG GTAGGAT-TAACCATCCAGCTATCTATATATCTGTAGTAATATCTTGAACTATAA----TTTGCAG GTAA AC---GCTATTAGAATTCATTTACATTTACAGACGAT-AATAGTGTATATCTTCAT AG G GTGAGTG-TAACCGTAACCAGCAACTGGCTCCAGCAGTAGACCTATCGAATATA-----TCCGCAG GTGAGTG-TAACCGTAACCAGCAACTGGCTCCAGCAGTAGACCTATCGAATATA-----TCCGCAG GTAAGCTTTTCCGAAGAGATAGCATT--TATTATGATTCAATTGTTT------------TTCACAG GTGAGAA--ACACAAGACATGCTATTGCCAATAATATCATAT-ACCAAGAACTCAA---TTTACAG GTGAGAC--ACCCAAGACATTCTATTGGCAATAATATCCTTT-ACCAAGGACCCA----TTTACAG GTGAGAC--ACCCAAGACATTATATTGGCAATAATATCATCT-ACCAAGGGCTCA----TTTACAG GTGAGAC--CCCCAAGACATTTTATTGGCAATAATATCCTAT-ACCAAGGACCCA----TTTACAG A substitution scores B +20 +8 -8 -20 +20 +8 0 1 -8 -20 GTGGGCTCAG---TCGGTACTCCATTATGATTGTTTATTTA-------ATATGCGCTTGATTTGAAG GTGGGCTCAGTCTGTGGTACTCCATTATGATTGTTTATTTA-------ATATGCGCTTGATTTGAAG GTGGGCTCAGTCTGTGGTACTCCATTATGATTGTTTATTTA-------ATATGCGCTTGATTTGAAG GTGGGCTCTC---TCGGTACTGCATTATGATTGTTTATTTT-------ATATGCGCTTGATTTG AG G GTGGGCTCAG---TCGGAACTCCATTACGATTGTTTATTTT-------ATATGCGCTTGATTTG AG G GTGGGCTCAG-AGTCGGTACTCCACTGCGATTATTTATTTT-------ATTTGCGCCTGATTTG AG G GTGG TTTG-------GACTCCATTATAATTATTTATATT-ACCCGTGTTTGCGCTTGATTTGAAG AT GTGG ATCT----GGGGACTCCATTATAATTATTTATATTTGCTCGTATTTGCGCTTGATTTGAAG G A distribution of positive training samples distribution of negative training samples classified as false prediction (SVM probability 0.001) classified as real intron (SVM probability 0.999) Conservation scores (PhastCons) 5. Insects – Results chr3R: chr2R: chr3L: chr3L: chrX: Conservation Conservation CG14614 21856300 13232800 19480100 4479900 8881300 21856400 13232900 8881400 4480000 4480100 19480200 500 bp 500 bp 600 bp 300 bp 21856500 13233000 8881500 19480300 21856600 8881600 13233100 19480400 4480200 FlyBase Protein-Coding Genes predicted intron predicted intron 8881700 4480300 FlyBase Noncoding Genes D. melanogaster mRNAs from GenBank predicted intron D. melanogaster ESTs That Have Been Spliced 8881800 4480400 D. melanogaster ESTs That Have Been Spliced predicted intron predicted intron predicted intron 600 bp predicted intron D. melanogaster ESTs That Have Been Spliced FlyBase Protein-Coding Genes D. melanogaster ESTs That Have Been Spliced D. melanogaster ESTs That Have Been Spliced CA805633 CA807669 CA805453 CA807471 CO192200 CA807690 E D B C A CA804813 Conservation Conservation EY198607 EY198595 CA805394 CA805952 CA805663 CA804428 CA805031 CA805317 CA807678 pncr009:3L-RA Conservation BE979091 AI944913 EC251326 AY113603 CO334041 CO319199 CK135604 dally EC247591 CO295956 EC249419 A) Predicted intron located at 5’UTR (- strand), B) Predicted intron belonging to antisense transcript of dally, C) EST-confirmed intron prediction, D) Predicted EST-confirmed intron revising current FlyBase annotation, E) Clustered predictions at a putative novel protein-coding gene (blastx hits in several species) area under ROC: 0.983, p>0.95: 80% TP at 0.12% FP 369 predictions outside of known protein-cod. genes (p>0.95) 131 EST/FlyBase-transcript confirmed introns, 238 unconfirmed Discard novel protein-coding ones: 129 novel mlncRNAs 6. Insects – Exp. verification RT-PCR, 5 different developmental stages of Dmel: embryo, larva, pupa, male, female 18/29 (62%) experimentally validated: mlncRNAs: 7/12, introns in putative cod. transcripts: 11/17 7. Vertebrates - Refinements Meet increasing requirements Vertebrate introns ! = insect introns (2 % vs 54 % short introns) Rather than predicting complete introns, we switch to individual splice-site prediction new (SVM-)features needed to distin- guish real from false splice-sites, we propose: (1) The human MaxEntScan splice-site score. (2-4) Three log odd substitution score variants s tree , s pair , s median . (5) The total number of species in an alignment (6) The total number of species with conserved GT/AG dinu- cleotides and a MaxEntScan score >= 0. (7) The slope of a regression line fitted to the splice-sites’ PhastCons sequence conservation profile of [-20,+20]. (8) The average GC content. (9) The mean pairwise identity. Improve log odd substitution scores Reconstruct ancestral sequences for each splice-site region using prequel and learn splice-site substitution patterns for each edge e of the 44-species tree: s tree = eE log 2 f pos (xy )/ nA f pos (xn) f neg (xy )/ nA f neg (xn) 8. Vertebrates – Results 2 models, AUC: 0.93 donor, 0.94 acceptor intron candidates: arbitrarily defined as adjacent do/acc pairs with distance <=5000 nt on same strand chr21: 886 pairs (p> 0.9), 105 with typical PhastCons pro- file/basin, 16 manually chosen for experimental validation (on- going work) Scale chr22: chr22 ACC chr22 DO GM128 cell tot GM128 cyto pA- GM128 cyto pA+ GM128 nucl pA- GM128 nucl pA+ K562 cell total K562 psom pA- K562 cyto pA- K562 cyto pA+ K562 nucl pA- K562 nucl pA+ K562 nplsm total K562 chrm total K562 nlos tot Multiz Align RepeatMasker 1 kb 18111500 18112000 18112500 18113000 chr22 ACC chr22 DO chr22 phast-filtered introns ENCODE Affymetrix/CSHL Subcellular RNA Localization by Tiling Array Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Placental Mammal Conservation by PhastCons Vertebrate Conservation by PhastCons Repeating Elements by RepeatMasker intron:chr22:18111410-18112940 GM128 cell tot GM128 cyto pA- GM128 cyto pA+ GM128 nucl pA- GM128 nucl pA+ K562 cell total K562 psom pA- K562 cyto pA- K562 cyto pA+ K562 nucl pA- K562 nucl pA+ K562 nplsm total K562 chrm total K562 nlos tot Primate Cons 1 _ 0 _ Mammal Cons 1 _ 0 _ Vertebrate Cons 1 _ 0 _ Scale chr22: CONTRAST SGP Genes Geneid Genes Genscan Genes Multiz Align RepeatMasker 100 bases 22066000 22066050 22066100 22066150 chr22 phast-filtered introns CONTRAST Gene Predictions SGP Gene Predictions Using Mouse/Human Homology Geneid Gene Predictions Genscan Gene Predictions Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Placental Mammal Conservation by PhastCons Vertebrate Conservation by PhastCons Repeating Elements by RepeatMasker intron:chr22:22066024-22066128 Primate Cons 1 _ 0 _ Mammal Cons 1 _ 0 _ Vertebrate Cons 1 _ 0 _ Scale chr22: <--- Gencode Manual SIB Genes SGP Genes Geneid Genes Genscan Genes Multiz Align RepeatMasker 5 bases 22659655 22659660 22659665 G A T C G G T G T G A C C C C C C T chr22 phast-filtered introns ENCODE Gencode Gene Annotations Ensembl Gene Predictions Swiss Institute of Bioinformatics Gene Predictions from mRNA and ESTs SGP Gene Predictions Using Mouse/Human Homology Geneid Gene Predictions Genscan Gene Predictions Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Placental Mammal Conservation by PhastCons Vertebrate Conservation by PhastCons Repeating Elements by RepeatMasker intron:chr22:22659660-22661320 ENST00000405781 Primate Cons 1 _ 0 _ Mammal Cons 1 _ 0 _ Vertebrate Cons 1 _ 0 _ Scale chr22: <--- Gencode Manual SIB Genes SGP Genes Geneid Genes Genscan Genes Multiz Align RepeatMasker 10 bases 22661305 22661310 22661315 22661320 22661325 22661330 22661335 T C G T CGG G T GCC T G GCC A A T GG A G AG T C G G T T C C A C T T C A G chr22 phast-filtered introns ENCODE Gencode Gene Annotations Ensembl Gene Predictions Swiss Institute of Bioinformatics Gene Predictions from mRNA and ESTs SGP Gene Predictions Using Mouse/Human Homology Geneid Gene Predictions Genscan Gene Predictions Vertebrate Multiz Alignment & Conservation (44 Species) Primate Conservation by PhastCons Placental Mammal Conservation by PhastCons Vertebrate Conservation by PhastCons Repeating Elements by RepeatMasker on:chr22:22659660-22661320 ENST00000405781 Primate Cons 1 _ 0 _ Mammal Cons 1 _ 0 _ Vertebrate Cons 1 _ 0 _ Acknowledgements for contributing: Micha, Sven, Sandro, Manja, Claudia, Christine, Rolf, Sonja, Gunter and Peter for funding: German Research Foundation (STA 850/7-1 and Hi 1423/2-1) Graduiertenkolleg Wissensrepr¨asentation of University Leipzig European Network of Excellence “The Epigenome” 6th Framework Programme of the European Union (SYNLET) References [1] M. Hiller, S. Findeiß, S. Lein, M. Marz, C. Nickel, D. Rose, C. Schulz, R. Backofen, S. J. Prohaska, G. Reuter, P. F. Stadler, Conserved introns reveal novel transcripts in Drosophila melanogaster, Genome Res. 19 (2009) 1289–1300. Printed by Universit¨atsrechenzentrum Leipzig

Transcript of Michael Hiller Christine Schulz - uni-leipzig.dedominic/stuff/pub/introns_poster.pdf · Conserved...

Page 1: Michael Hiller Christine Schulz - uni-leipzig.dedominic/stuff/pub/introns_poster.pdf · Conserved introns reveal novel transcripts in eukaryotic genomes Michael Hiller1,2, Sven Findeiß3,4,

Conserved introns reveal novel transcripts in eukaryotic genomes

Michael Hiller1,2, Sven Findeiß3,4, Sandro Lein5, Manja Marz3, Claudia Nickel5, Dominic Rose3,

Christine Schulz6, Rolf Backofen1, Sonja J. Prohaska3,4,7, Gunter Reuter5 and Peter F. Stadler3,4,6,7,8

1) Bioinformatics Group, Albert-Ludwigs-University Freiburg, Germany 5) Institute of Genetics, Martin Luther University Halle-Wittenberg, Germany2) Department of Developmental Biology, Stanford University, USA 6) RNomics Group, Fraunhofer Institut fur Zelltherapie und Immunologie, Germany3) Bioinformatics Group, University of Leipzig, Germany 7) Institut fur Theoretische Chemie und Molekulare Strukturbiologie, University of Vienna, Austria4) Interdisciplinary Center of Bioinformatics, University of Leipzig, Germany 8) Sante Fe Institute, Santa Fe, USA

email: [email protected] – www: http://www.bioinf.uni-leipzig.de

1. Introduction & Outline

Introduction

•Most of eukaryotic genomes are transcribed producing large num-bers of non-coding RNAs (ncRNAs), a heterogeneous class of es-sential transcripts exerting their function at the RNA level with-out ever being translated into protein.

•A subclass of them, similar to mRNAs, gets spliced, capped, andpolyadenylated and is therefore called messenger-like non-codingRNAs (mlncRNAs), examples: Xist, H19 (gene regulators).

• Contrary to protein-coding genes, ncRNA gene-finding solelybased on sequence data is a challenging problem (no start-/stopcodon, lack of discernible open reading frames, poor sequenceconservation, in case of long ncRNAs usually not even structureconservation).

Outline

•Novel genome-wide comparative genomics approach.

• Search for conserved introns in eukaryotic genomes.

• Capable to identify novel transcripts/genes.

• Idea: Gene-finding based on intron prediction.

• Intron detection allows to

– extend or revise existing annotation.

– identify novel protein-coding genes.

– identify novel mlncRNAs.

2. Overview

The idea

Functional pair of donor (5’) and acceptor (3’) splice sites will beretained over long evolutionary time scales only if

• the locus is transcribed into a functional transcript and

• accurate intron removal is necessary to produce a functional tran-script.

The data

2 screens, all input data available at the UCSC genome browser:

• 15 insects, already published, see [1](12 drosophila genomes, mosquito, beetle, honeybee)

• 44 vertebrates(human→ teleosts, lamprey)

The plan

Insects: focus on short conserved introns (40-81 nt)

•Apply intronscan (preliminary filter) → build alignments →evaluate characteristic intron evolution → train support vectormachine (SVM)→ classify candidate set of novel introns

Vertebrates: focus on general independent splice-site prediction(they have only few short introns)

•Apply MaxEntScan (preliminary filter) → compile set of real(positive) and “pseudo” (false) donor/acceptor splice-sites →evaluate characteristic splice-site evolution→ train SVM→ clas-sify candidate set of novel splice-sites

3. Insects – Methods

498,231 predictions with orthologs

D.ere

D.mel

D.moj

1,398,939 predicted introns for

B retain orthologous intronscan predictions

A

+ 12 insects

predict introns in individual insect genomes using intronscan

variationdonor score acceptor score

variationvariationintron lengthconservation

scoresscores

splice site

C evaluate characteristic intron evolution

trai

ning

sam

ples

dist

ribut

ions

of

train an SVM with these 5 discriminative features

apply to 342,785 predictions that overlap no protein−coding gene

D. melanogaster

369 conserved introns predicted

negative

positive

substitution

genome

genome

D.ere

D.mel

D.moj+ 12 insects

+ strand intron− strand intron

>>>>>>

>>>>>> >>>>>>

>>>>>>

>>>>>>

>>>>>>>>>>>>

>>>>>> >>>>>> >>>>>>>>>>>>

>>>>>>

>>>>>> >>>>>>>>>>>> >>>>>>

>>>>>>

>>>>>> >>>>>>

>>>>>>

>>>>>> >>>>>>

>>>>>>>>>>>>

>>>>>> >>>>>> >>>>>>

>>>>>> >>>>>>>>>>>>

>>>>>>

>>>>>>

>>>>>>

1False Positive Rate

Tru

e P

ositi

ve R

ate

0

1

independent test setROC curve of

AUC = 0.983

4. Insects – Splice-site evolution

Nucleotide frequencies differ at splice-site positions.

10−10 10−10 10−1010−10 10−10 10−10 10−10 10−10 10−10

G

T

C

A0−20 20

D.sim

D.sec

D.yak

D.ere

D.ana

D.pse

D.per

D.wil

D.moj

D.vir

D.gri

A.gam

T.cas

A.mel

percent

less frequentcompared tomore frequent

D.melcompared to D.meldonor (5’) splice site acceptor (3’) splice site

+5 +6 −7 −6 −5 −4 −3+3 +4

less frequent← | → more frequent (compared to Dmel)e.g. Apis prefers A over G (donor +3) and T over C (acceptor -3)

learn log odd substitution scores: ∀x, y ∈ {A, T,C,G} , x 6= y :

log2

(

freqpos(x→y)freqneg(x→y)

)

→ substitution matrix

Conservation scores (PhastCons)

0

1

0.5

8...20 and −20...−8

average conservationscores for region

substitution scoressum of

0.5

substitution scoressum of

8...20 and −20...−8

average conservationscores for region

Den

sity

0.03

1.0

Den

sity

20

00

0 40 80−10

Den

sity

0.5

40 80 0 1.0

Den

sity

1000

−10 0

average = 0.002sum = 31.6

sum = −3.7

average = 0.92

G G T

negativepositive

D.pse

D.sec

D.melD.sim

D.yakD.ere

D.virD.gri

D.griD.mojD.virD.wilD.perD.pseD.anaD.ere

D.simD.mel

D.sec

G T A A G A T − T A T T C C G A T T T T T A T A G C T T C A T T T T T G A G A A A T T T A A T T T G AT T A A − − − − T T T T T A GG T A A G C C − − − − T T A C A A A A A A C C A T A T A T A T T T T T A G T G A A T C A A T A T T G CC T T A T T − − T T T G T A GG T A G G A T − T A A C C A T C C A G C T A T C T A T A T A T C T G T A G T A A T A T C T T G A A C TA T A A − − − − T T T G C A GG T A A A C − − − G C T A T T A G A A T T C A T T T A C A T T T A C A G A C G A T − A A T A G T G T AT A T C T T C A T A GGG T G A G T G − T A A C C G T A A C C A G C A A C T G G C T C C A G C A G T A G A C C T A T C G A A TA T A − − − − − T C C G C A GG T G A G T G − T A A C C G T A A C C A G C A A C T G G C T C C A G C A G T A G A C C T A T C G A A TA T A − − − − − T C C G C A GG T A A G C T T T T C C G A A G A G A T A G C A T T − − T A T T A T G A T T C A A T T G T T T − − − −− − − − − − − − T T C A C A GG T G A G A A − − A C A C A A G A C A T G C T A T T G C C A A T A A T A T C A T A T − A C C A A G A AC T C A A − − − T T T A C A G

G T G A G A C − − A C C C A A G A C A T T C T A T T G G C A A T A A T A T C C T T T − A C C A A G G AC C C A − − − − T T T A C A GG T G A G A C − − A C C C A A G A C A T T A T A T T G G C A A T A A T A T C A T C T − A C C A A G G GC T C A − − − − T T T A C A G

G T G A G A C − − C C C C A A G A C A T T T T A T T G G C A A T A A T A T C C T A T − A C C A A G G AC C C A − − − − T T T A C A G

A substitution scores B

+20+8 −8−20

+20+8

0

1

−8−20

G T G G G C T C A G − − − T C G G T A C T C C A T T A T G A T T G T T T A T T T A − − − − − − − A T AT G C G C T T G A T T T G A A GG T G G G C T C A G T C T G T G G T A C T C C A T T A T G A T T G T T T A T T T A − − − − − − − A T AT G C G C T T G A T T T G A A GG T G G G C T C A G T C T G T G G T A C T C C A T T A T G A T T G T T T A T T T A − − − − − − − A T AT G C G C T T G A T T T G A A GG T G G G C T C T C − − − T C G G T A C T G C A T T A T G A T T G T T T A T T T T − − − − − − − A T AT G C G C T T G A T T T G A GGG T G G G C T C A G − − − T C G G A A C T C C A T T A C G A T T G T T T A T T T T − − − − − − − A T AT G C G C T T G A T T T G A GGG T G G G C T C A G − A G T C G G T A C T C C A C T G C G A T T A T T T A T T T T − − − − − − − A T TT G C G C C T G A T T T G A GGG T G G T T T G − − − − − − − G A C T C C A T T A T A A T T A T T T A T A T T − A C C C G T G T T T GC G C T T G A T T T G A A GA TG T G G A T C T − − − − G G G G A C T C C A T T A T A A T T A T T T A T A T T T G C T C G T A T T T GC G C T T G A T T T G A A GGA

distribution of positive training samples

distribution of negative training samplesclassified as false prediction (SVM probability 0.001)

classified as real intron (SVM probability 0.999)

Conservation scores (PhastCons)

5. Insects – Results

chr3R:

chr2R:

chr3L:

chr3L:

chrX:

Conservation

Conservation

CG14614

21856300

13232800

19480100

4479900

8881300

21856400

13232900

8881400

4480000 4480100

19480200500 bp

500 bp

600 bp

300 bp21856500

13233000

8881500

19480300

21856600

8881600

13233100

19480400

4480200

FlyBase Protein−Coding Genes

predicted intron

predicted intron

8881700

4480300

FlyBase Noncoding Genes

D. melanogaster mRNAs from GenBank

predicted intron

D. melanogaster ESTs That Have Been Spliced

8881800

4480400

D. melanogaster ESTs That Have Been Spliced

predicted intronpredicted intron predicted intron

600 bp

predicted intron

D. melanogaster ESTs That Have Been Spliced

FlyBase Protein−Coding GenesD. melanogaster ESTs That Have Been Spliced

D. melanogaster ESTs That Have Been Spliced

CA805633CA807669CA805453CA807471CO192200CA807690

E

D

B

C

A

CA804813

Conservation

Conservation

EY198607EY198595

CA805394CA805952

CA805663CA804428

CA805031CA805317

CA807678

pncr009:3L−RA

Conservation

BE979091AI944913

EC251326

AY113603

CO334041CO319199

CK135604

dally

EC247591CO295956

EC249419

A) Predicted intron located at 5’UTR (- strand), B) Predicted intron belonging to antisense transcript of

dally, C) EST-confirmed intron prediction, D) Predicted EST-confirmed intron revising current FlyBase

annotation, E) Clustered predictions at a putative novel protein-coding gene (blastx hits in several species)

• area under ROC: 0.983, p>0.95: 80% TP at 0.12% FP

• 369 predictions outside of known protein-cod. genes (p>0.95)

• 131 EST/FlyBase-transcript confirmed introns,238 unconfirmed

•Discard novel protein-coding ones: 129 novel mlncRNAs

6. Insects – Exp. verification

•RT-PCR, 5 different developmental stages of Dmel:embryo, larva, pupa, male, female

• 18/29 (62%) experimentally validated:mlncRNAs: 7/12, introns in putative cod. transcripts: 11/17

7. Vertebrates - Refinements

Meet increasing requirements

•Vertebrate introns ! = insect introns (2 % vs 54 % short introns)

•Rather than predicting complete introns, we switch to individualsplice-site prediction → new (SVM-)features needed to distin-guish real from false splice-sites, we propose:

(1) The human MaxEntScan splice-site score.

(2-4) Three log odd substitution score variants stree, spair, smedian.

(5) The total number of species in an alignment

(6) The total number of species with conserved GT/AG dinu-cleotides and a MaxEntScan score >= 0.

(7) The slope of a regression line fitted to the splice-sites’PhastCons sequence conservation profile of [-20,+20].

(8) The average GC content.

(9) The mean pairwise identity.

Improve log odd substitution scores

•Reconstruct ancestral sequences for each splice-site region usingprequel and learn splice-site substitution patterns for each edgee of the 44-species tree:

stree =∑

e∈E log2

(

fpos(x→y)/∑

n∈A fpos(x→n)fneg(x→y)/

n∈A fneg(x→n)

)

8. Vertebrates – Results

• 2 models, AUC: ∼0.93 donor, ∼0.94 acceptor

• intron candidates: arbitrarily defined as adjacent do/acc pairswith distance <=5000 nt on same strand

• chr21: 886 pairs (p > 0.9), 105 with typical PhastCons pro-file/basin, 16 manually chosen for experimental validation (on-going work)

Scalechr22:

chr22 ACC

chr22 DO

GM128 cell tot

GM128 cyto pA-

GM128 cyto pA+

GM128 nucl pA-

GM128 nucl pA+

K562 cell total

K562 psom pA-

K562 cyto pA-

K562 cyto pA+

K562 nucl pA-

K562 nucl pA+

K562 nplsm total

K562 chrm total

K562 nlos tot

Multiz Align

RepeatMasker

1 kb18111500 18112000 18112500 18113000

chr22 ACC

chr22 DO

chr22 phast-filtered introns

ENCODE Affymetrix/CSHL Subcellular RNA Localization by Tiling Array

Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons

Placental Mammal Conservation by PhastCons

Vertebrate Conservation by PhastCons

Repeating Elements by RepeatMasker

intron:chr22:18111410-18112940

GM128 cell tot

GM128 cyto pA-

GM128 cyto pA+

GM128 nucl pA-

GM128 nucl pA+

K562 cell total

K562 psom pA-

K562 cyto pA-

K562 cyto pA+

K562 nucl pA-

K562 nucl pA+

K562 nplsm total

K562 chrm total

K562 nlos tot

Primate Cons

1 _

0 _

Mammal Cons

1 _

0 _

Vertebrate Cons

1 _

0 _

Scalechr22:

CONTRAST

SGP Genes

Geneid Genes

Genscan Genes

Multiz Align

RepeatMasker

100 bases22066000 22066050 22066100 22066150

chr22 phast-filtered introns

CONTRAST Gene Predictions

SGP Gene Predictions Using Mouse/Human Homology

Geneid Gene Predictions

Genscan Gene Predictions

Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons

Placental Mammal Conservation by PhastCons

Vertebrate Conservation by PhastCons

Repeating Elements by RepeatMasker

intron:chr22:22066024-22066128

Primate Cons

1 _

0 _

Mammal Cons

1 _

0 _

Vertebrate Cons

1 _

0 _

Scalechr22:

<---

Gencode Manual

SIB Genes

SGP Genes

Geneid Genes

Genscan Genes

Multiz Align

RepeatMasker

5 bases22659655 22659660 22659665

G A T C G G T G T G A C C C C C C Tchr22 phast-filtered introns

ENCODE Gencode Gene Annotations

Ensembl Gene Predictions

Swiss Institute of Bioinformatics Gene Predictions from mRNA and ESTs

SGP Gene Predictions Using Mouse/Human Homology

Geneid Gene Predictions

Genscan Gene Predictions

Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons

Placental Mammal Conservation by PhastCons

Vertebrate Conservation by PhastCons

Repeating Elements by RepeatMasker

intron:chr22:22659660-22661320

ENST00000405781

Primate Cons

1 _

0 _

Mammal Cons

1 _

0 _

Vertebrate Cons

1 _

0 _

Scalechr22:

<---

Gencode Manual

SIB Genes

SGP Genes

Geneid Genes

Genscan Genes

Multiz Align

RepeatMasker

10 bases22661305 22661310 22661315 22661320 22661325 22661330 22661335

T C G T C G G G T G C C T G G C C A A T G G A G A G T C G G T T C C A C T T C A Gchr22 phast-filtered introns

ENCODE Gencode Gene Annotations

Ensembl Gene Predictions

Swiss Institute of Bioinformatics Gene Predictions from mRNA and ESTs

SGP Gene Predictions Using Mouse/Human Homology

Geneid Gene Predictions

Genscan Gene Predictions

Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons

Placental Mammal Conservation by PhastCons

Vertebrate Conservation by PhastCons

Repeating Elements by RepeatMasker

intron:chr22:22659660-22661320

ENST00000405781

Primate Cons

1 _

0 _

Mammal Cons

1 _

0 _

Vertebrate Cons

1 _

0 _

Acknowledgements

for contributing:

•Micha, Sven, Sandro, Manja, Claudia, Christine, Rolf, Sonja,Gunter and Peter

for funding:

•German Research Foundation (STA 850/7-1 and Hi 1423/2-1)

•Graduiertenkolleg Wissensreprasentation of UniversityLeipzig

• European Network of Excellence “The Epigenome”

• 6th Framework Programme of the European Union (SYNLET)

References

[1] M. Hiller, S. Findeiß, S. Lein, M. Marz, C. Nickel, D. Rose,C. Schulz, R. Backofen, S. J. Prohaska, G. Reuter, P. F.Stadler, Conserved introns reveal novel transcripts in Drosophilamelanogaster, Genome Res. 19 (2009) 1289–1300.

Printed by Universitatsrechenzentrum Leipzig