Conserved introns reveal novel transcripts in eukaryotic genomes
Michael Hiller1,2, Sven Findeiß3,4, Sandro Lein5, Manja Marz3, Claudia Nickel5, Dominic Rose3,
Christine Schulz6, Rolf Backofen1, Sonja J. Prohaska3,4,7, Gunter Reuter5 and Peter F. Stadler3,4,6,7,8
1) Bioinformatics Group, Albert-Ludwigs-University Freiburg, Germany 5) Institute of Genetics, Martin Luther University Halle-Wittenberg, Germany2) Department of Developmental Biology, Stanford University, USA 6) RNomics Group, Fraunhofer Institut fur Zelltherapie und Immunologie, Germany3) Bioinformatics Group, University of Leipzig, Germany 7) Institut fur Theoretische Chemie und Molekulare Strukturbiologie, University of Vienna, Austria4) Interdisciplinary Center of Bioinformatics, University of Leipzig, Germany 8) Sante Fe Institute, Santa Fe, USA
email: [email protected] – www: http://www.bioinf.uni-leipzig.de
1. Introduction & Outline
Introduction
•Most of eukaryotic genomes are transcribed producing large num-bers of non-coding RNAs (ncRNAs), a heterogeneous class of es-sential transcripts exerting their function at the RNA level with-out ever being translated into protein.
•A subclass of them, similar to mRNAs, gets spliced, capped, andpolyadenylated and is therefore called messenger-like non-codingRNAs (mlncRNAs), examples: Xist, H19 (gene regulators).
• Contrary to protein-coding genes, ncRNA gene-finding solelybased on sequence data is a challenging problem (no start-/stopcodon, lack of discernible open reading frames, poor sequenceconservation, in case of long ncRNAs usually not even structureconservation).
Outline
•Novel genome-wide comparative genomics approach.
• Search for conserved introns in eukaryotic genomes.
• Capable to identify novel transcripts/genes.
• Idea: Gene-finding based on intron prediction.
• Intron detection allows to
– extend or revise existing annotation.
– identify novel protein-coding genes.
– identify novel mlncRNAs.
2. Overview
The idea
Functional pair of donor (5’) and acceptor (3’) splice sites will beretained over long evolutionary time scales only if
• the locus is transcribed into a functional transcript and
• accurate intron removal is necessary to produce a functional tran-script.
The data
2 screens, all input data available at the UCSC genome browser:
• 15 insects, already published, see [1](12 drosophila genomes, mosquito, beetle, honeybee)
• 44 vertebrates(human→ teleosts, lamprey)
The plan
Insects: focus on short conserved introns (40-81 nt)
•Apply intronscan (preliminary filter) → build alignments →evaluate characteristic intron evolution → train support vectormachine (SVM)→ classify candidate set of novel introns
Vertebrates: focus on general independent splice-site prediction(they have only few short introns)
•Apply MaxEntScan (preliminary filter) → compile set of real(positive) and “pseudo” (false) donor/acceptor splice-sites →evaluate characteristic splice-site evolution→ train SVM→ clas-sify candidate set of novel splice-sites
3. Insects – Methods
498,231 predictions with orthologs
D.ere
D.mel
D.moj
1,398,939 predicted introns for
B retain orthologous intronscan predictions
A
+ 12 insects
predict introns in individual insect genomes using intronscan
variationdonor score acceptor score
variationvariationintron lengthconservation
scoresscores
splice site
C evaluate characteristic intron evolution
trai
ning
sam
ples
dist
ribut
ions
of
train an SVM with these 5 discriminative features
apply to 342,785 predictions that overlap no protein−coding gene
D. melanogaster
369 conserved introns predicted
negative
positive
substitution
genome
genome
D.ere
D.mel
D.moj+ 12 insects
+ strand intron− strand intron
>>>>>>
>>>>>> >>>>>>
>>>>>>
>>>>>>
>>>>>>>>>>>>
>>>>>> >>>>>> >>>>>>>>>>>>
>>>>>>
>>>>>> >>>>>>>>>>>> >>>>>>
>>>>>>
>>>>>> >>>>>>
>>>>>>
>>>>>> >>>>>>
>>>>>>>>>>>>
>>>>>> >>>>>> >>>>>>
>>>>>> >>>>>>>>>>>>
>>>>>>
>>>>>>
>>>>>>
1False Positive Rate
Tru
e P
ositi
ve R
ate
0
1
independent test setROC curve of
AUC = 0.983
4. Insects – Splice-site evolution
Nucleotide frequencies differ at splice-site positions.
10−10 10−10 10−1010−10 10−10 10−10 10−10 10−10 10−10
G
T
C
A0−20 20
D.sim
D.sec
D.yak
D.ere
D.ana
D.pse
D.per
D.wil
D.moj
D.vir
D.gri
A.gam
T.cas
A.mel
percent
less frequentcompared tomore frequent
D.melcompared to D.meldonor (5’) splice site acceptor (3’) splice site
+5 +6 −7 −6 −5 −4 −3+3 +4
less frequent← | → more frequent (compared to Dmel)e.g. Apis prefers A over G (donor +3) and T over C (acceptor -3)
learn log odd substitution scores: ∀x, y ∈ {A, T,C,G} , x 6= y :
log2
(
freqpos(x→y)freqneg(x→y)
)
→ substitution matrix
Conservation scores (PhastCons)
0
1
0.5
8...20 and −20...−8
average conservationscores for region
substitution scoressum of
0.5
substitution scoressum of
8...20 and −20...−8
average conservationscores for region
Den
sity
0.03
1.0
Den
sity
20
00
0 40 80−10
Den
sity
0.5
40 80 0 1.0
Den
sity
1000
−10 0
average = 0.002sum = 31.6
sum = −3.7
average = 0.92
G G T
negativepositive
D.pse
D.sec
D.melD.sim
D.yakD.ere
D.virD.gri
D.griD.mojD.virD.wilD.perD.pseD.anaD.ere
D.simD.mel
D.sec
G T A A G A T − T A T T C C G A T T T T T A T A G C T T C A T T T T T G A G A A A T T T A A T T T G AT T A A − − − − T T T T T A GG T A A G C C − − − − T T A C A A A A A A C C A T A T A T A T T T T T A G T G A A T C A A T A T T G CC T T A T T − − T T T G T A GG T A G G A T − T A A C C A T C C A G C T A T C T A T A T A T C T G T A G T A A T A T C T T G A A C TA T A A − − − − T T T G C A GG T A A A C − − − G C T A T T A G A A T T C A T T T A C A T T T A C A G A C G A T − A A T A G T G T AT A T C T T C A T A GGG T G A G T G − T A A C C G T A A C C A G C A A C T G G C T C C A G C A G T A G A C C T A T C G A A TA T A − − − − − T C C G C A GG T G A G T G − T A A C C G T A A C C A G C A A C T G G C T C C A G C A G T A G A C C T A T C G A A TA T A − − − − − T C C G C A GG T A A G C T T T T C C G A A G A G A T A G C A T T − − T A T T A T G A T T C A A T T G T T T − − − −− − − − − − − − T T C A C A GG T G A G A A − − A C A C A A G A C A T G C T A T T G C C A A T A A T A T C A T A T − A C C A A G A AC T C A A − − − T T T A C A G
G T G A G A C − − A C C C A A G A C A T T C T A T T G G C A A T A A T A T C C T T T − A C C A A G G AC C C A − − − − T T T A C A GG T G A G A C − − A C C C A A G A C A T T A T A T T G G C A A T A A T A T C A T C T − A C C A A G G GC T C A − − − − T T T A C A G
G T G A G A C − − C C C C A A G A C A T T T T A T T G G C A A T A A T A T C C T A T − A C C A A G G AC C C A − − − − T T T A C A G
A substitution scores B
+20+8 −8−20
+20+8
0
1
−8−20
G T G G G C T C A G − − − T C G G T A C T C C A T T A T G A T T G T T T A T T T A − − − − − − − A T AT G C G C T T G A T T T G A A GG T G G G C T C A G T C T G T G G T A C T C C A T T A T G A T T G T T T A T T T A − − − − − − − A T AT G C G C T T G A T T T G A A GG T G G G C T C A G T C T G T G G T A C T C C A T T A T G A T T G T T T A T T T A − − − − − − − A T AT G C G C T T G A T T T G A A GG T G G G C T C T C − − − T C G G T A C T G C A T T A T G A T T G T T T A T T T T − − − − − − − A T AT G C G C T T G A T T T G A GGG T G G G C T C A G − − − T C G G A A C T C C A T T A C G A T T G T T T A T T T T − − − − − − − A T AT G C G C T T G A T T T G A GGG T G G G C T C A G − A G T C G G T A C T C C A C T G C G A T T A T T T A T T T T − − − − − − − A T TT G C G C C T G A T T T G A GGG T G G T T T G − − − − − − − G A C T C C A T T A T A A T T A T T T A T A T T − A C C C G T G T T T GC G C T T G A T T T G A A GA TG T G G A T C T − − − − G G G G A C T C C A T T A T A A T T A T T T A T A T T T G C T C G T A T T T GC G C T T G A T T T G A A GGA
distribution of positive training samples
distribution of negative training samplesclassified as false prediction (SVM probability 0.001)
classified as real intron (SVM probability 0.999)
Conservation scores (PhastCons)
5. Insects – Results
chr3R:
chr2R:
chr3L:
chr3L:
chrX:
Conservation
Conservation
CG14614
21856300
13232800
19480100
4479900
8881300
21856400
13232900
8881400
4480000 4480100
19480200500 bp
500 bp
600 bp
300 bp21856500
13233000
8881500
19480300
21856600
8881600
13233100
19480400
4480200
FlyBase Protein−Coding Genes
predicted intron
predicted intron
8881700
4480300
FlyBase Noncoding Genes
D. melanogaster mRNAs from GenBank
predicted intron
D. melanogaster ESTs That Have Been Spliced
8881800
4480400
D. melanogaster ESTs That Have Been Spliced
predicted intronpredicted intron predicted intron
600 bp
predicted intron
D. melanogaster ESTs That Have Been Spliced
FlyBase Protein−Coding GenesD. melanogaster ESTs That Have Been Spliced
D. melanogaster ESTs That Have Been Spliced
CA805633CA807669CA805453CA807471CO192200CA807690
E
D
B
C
A
CA804813
Conservation
Conservation
EY198607EY198595
CA805394CA805952
CA805663CA804428
CA805031CA805317
CA807678
pncr009:3L−RA
Conservation
BE979091AI944913
EC251326
AY113603
CO334041CO319199
CK135604
dally
EC247591CO295956
EC249419
A) Predicted intron located at 5’UTR (- strand), B) Predicted intron belonging to antisense transcript of
dally, C) EST-confirmed intron prediction, D) Predicted EST-confirmed intron revising current FlyBase
annotation, E) Clustered predictions at a putative novel protein-coding gene (blastx hits in several species)
• area under ROC: 0.983, p>0.95: 80% TP at 0.12% FP
• 369 predictions outside of known protein-cod. genes (p>0.95)
• 131 EST/FlyBase-transcript confirmed introns,238 unconfirmed
•Discard novel protein-coding ones: 129 novel mlncRNAs
6. Insects – Exp. verification
•RT-PCR, 5 different developmental stages of Dmel:embryo, larva, pupa, male, female
• 18/29 (62%) experimentally validated:mlncRNAs: 7/12, introns in putative cod. transcripts: 11/17
7. Vertebrates - Refinements
Meet increasing requirements
•Vertebrate introns ! = insect introns (2 % vs 54 % short introns)
•Rather than predicting complete introns, we switch to individualsplice-site prediction → new (SVM-)features needed to distin-guish real from false splice-sites, we propose:
(1) The human MaxEntScan splice-site score.
(2-4) Three log odd substitution score variants stree, spair, smedian.
(5) The total number of species in an alignment
(6) The total number of species with conserved GT/AG dinu-cleotides and a MaxEntScan score >= 0.
(7) The slope of a regression line fitted to the splice-sites’PhastCons sequence conservation profile of [-20,+20].
(8) The average GC content.
(9) The mean pairwise identity.
Improve log odd substitution scores
•Reconstruct ancestral sequences for each splice-site region usingprequel and learn splice-site substitution patterns for each edgee of the 44-species tree:
stree =∑
e∈E log2
(
fpos(x→y)/∑
n∈A fpos(x→n)fneg(x→y)/
∑
n∈A fneg(x→n)
)
8. Vertebrates – Results
• 2 models, AUC: ∼0.93 donor, ∼0.94 acceptor
• intron candidates: arbitrarily defined as adjacent do/acc pairswith distance <=5000 nt on same strand
• chr21: 886 pairs (p > 0.9), 105 with typical PhastCons pro-file/basin, 16 manually chosen for experimental validation (on-going work)
Scalechr22:
chr22 ACC
chr22 DO
GM128 cell tot
GM128 cyto pA-
GM128 cyto pA+
GM128 nucl pA-
GM128 nucl pA+
K562 cell total
K562 psom pA-
K562 cyto pA-
K562 cyto pA+
K562 nucl pA-
K562 nucl pA+
K562 nplsm total
K562 chrm total
K562 nlos tot
Multiz Align
RepeatMasker
1 kb18111500 18112000 18112500 18113000
chr22 ACC
chr22 DO
chr22 phast-filtered introns
ENCODE Affymetrix/CSHL Subcellular RNA Localization by Tiling Array
Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons
Placental Mammal Conservation by PhastCons
Vertebrate Conservation by PhastCons
Repeating Elements by RepeatMasker
intron:chr22:18111410-18112940
GM128 cell tot
GM128 cyto pA-
GM128 cyto pA+
GM128 nucl pA-
GM128 nucl pA+
K562 cell total
K562 psom pA-
K562 cyto pA-
K562 cyto pA+
K562 nucl pA-
K562 nucl pA+
K562 nplsm total
K562 chrm total
K562 nlos tot
Primate Cons
1 _
0 _
Mammal Cons
1 _
0 _
Vertebrate Cons
1 _
0 _
Scalechr22:
CONTRAST
SGP Genes
Geneid Genes
Genscan Genes
Multiz Align
RepeatMasker
100 bases22066000 22066050 22066100 22066150
chr22 phast-filtered introns
CONTRAST Gene Predictions
SGP Gene Predictions Using Mouse/Human Homology
Geneid Gene Predictions
Genscan Gene Predictions
Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons
Placental Mammal Conservation by PhastCons
Vertebrate Conservation by PhastCons
Repeating Elements by RepeatMasker
intron:chr22:22066024-22066128
Primate Cons
1 _
0 _
Mammal Cons
1 _
0 _
Vertebrate Cons
1 _
0 _
Scalechr22:
<---
Gencode Manual
SIB Genes
SGP Genes
Geneid Genes
Genscan Genes
Multiz Align
RepeatMasker
5 bases22659655 22659660 22659665
G A T C G G T G T G A C C C C C C Tchr22 phast-filtered introns
ENCODE Gencode Gene Annotations
Ensembl Gene Predictions
Swiss Institute of Bioinformatics Gene Predictions from mRNA and ESTs
SGP Gene Predictions Using Mouse/Human Homology
Geneid Gene Predictions
Genscan Gene Predictions
Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons
Placental Mammal Conservation by PhastCons
Vertebrate Conservation by PhastCons
Repeating Elements by RepeatMasker
intron:chr22:22659660-22661320
ENST00000405781
Primate Cons
1 _
0 _
Mammal Cons
1 _
0 _
Vertebrate Cons
1 _
0 _
Scalechr22:
<---
Gencode Manual
SIB Genes
SGP Genes
Geneid Genes
Genscan Genes
Multiz Align
RepeatMasker
10 bases22661305 22661310 22661315 22661320 22661325 22661330 22661335
T C G T C G G G T G C C T G G C C A A T G G A G A G T C G G T T C C A C T T C A Gchr22 phast-filtered introns
ENCODE Gencode Gene Annotations
Ensembl Gene Predictions
Swiss Institute of Bioinformatics Gene Predictions from mRNA and ESTs
SGP Gene Predictions Using Mouse/Human Homology
Geneid Gene Predictions
Genscan Gene Predictions
Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons
Placental Mammal Conservation by PhastCons
Vertebrate Conservation by PhastCons
Repeating Elements by RepeatMasker
intron:chr22:22659660-22661320
ENST00000405781
Primate Cons
1 _
0 _
Mammal Cons
1 _
0 _
Vertebrate Cons
1 _
0 _
Acknowledgements
for contributing:
•Micha, Sven, Sandro, Manja, Claudia, Christine, Rolf, Sonja,Gunter and Peter
for funding:
•German Research Foundation (STA 850/7-1 and Hi 1423/2-1)
•Graduiertenkolleg Wissensreprasentation of UniversityLeipzig
• European Network of Excellence “The Epigenome”
• 6th Framework Programme of the European Union (SYNLET)
References
[1] M. Hiller, S. Findeiß, S. Lein, M. Marz, C. Nickel, D. Rose,C. Schulz, R. Backofen, S. J. Prohaska, G. Reuter, P. F.Stadler, Conserved introns reveal novel transcripts in Drosophilamelanogaster, Genome Res. 19 (2009) 1289–1300.
Printed by Universitatsrechenzentrum Leipzig