Download - Michael Hiller Christine Schulz - uni-leipzig.dedominic/stuff/pub/introns_poster.pdf · Conserved introns reveal novel transcripts in eukaryotic genomes Michael Hiller1,2, Sven Findeiß3,4,

Transcript
Page 1: Michael Hiller Christine Schulz - uni-leipzig.dedominic/stuff/pub/introns_poster.pdf · Conserved introns reveal novel transcripts in eukaryotic genomes Michael Hiller1,2, Sven Findeiß3,4,

Conserved introns reveal novel transcripts in eukaryotic genomes

Michael Hiller1,2, Sven Findeiß3,4, Sandro Lein5, Manja Marz3, Claudia Nickel5, Dominic Rose3,

Christine Schulz6, Rolf Backofen1, Sonja J. Prohaska3,4,7, Gunter Reuter5 and Peter F. Stadler3,4,6,7,8

1) Bioinformatics Group, Albert-Ludwigs-University Freiburg, Germany 5) Institute of Genetics, Martin Luther University Halle-Wittenberg, Germany2) Department of Developmental Biology, Stanford University, USA 6) RNomics Group, Fraunhofer Institut fur Zelltherapie und Immunologie, Germany3) Bioinformatics Group, University of Leipzig, Germany 7) Institut fur Theoretische Chemie und Molekulare Strukturbiologie, University of Vienna, Austria4) Interdisciplinary Center of Bioinformatics, University of Leipzig, Germany 8) Sante Fe Institute, Santa Fe, USA

email: [email protected] – www: http://www.bioinf.uni-leipzig.de

1. Introduction & Outline

Introduction

•Most of eukaryotic genomes are transcribed producing large num-bers of non-coding RNAs (ncRNAs), a heterogeneous class of es-sential transcripts exerting their function at the RNA level with-out ever being translated into protein.

•A subclass of them, similar to mRNAs, gets spliced, capped, andpolyadenylated and is therefore called messenger-like non-codingRNAs (mlncRNAs), examples: Xist, H19 (gene regulators).

• Contrary to protein-coding genes, ncRNA gene-finding solelybased on sequence data is a challenging problem (no start-/stopcodon, lack of discernible open reading frames, poor sequenceconservation, in case of long ncRNAs usually not even structureconservation).

Outline

•Novel genome-wide comparative genomics approach.

• Search for conserved introns in eukaryotic genomes.

• Capable to identify novel transcripts/genes.

• Idea: Gene-finding based on intron prediction.

• Intron detection allows to

– extend or revise existing annotation.

– identify novel protein-coding genes.

– identify novel mlncRNAs.

2. Overview

The idea

Functional pair of donor (5’) and acceptor (3’) splice sites will beretained over long evolutionary time scales only if

• the locus is transcribed into a functional transcript and

• accurate intron removal is necessary to produce a functional tran-script.

The data

2 screens, all input data available at the UCSC genome browser:

• 15 insects, already published, see [1](12 drosophila genomes, mosquito, beetle, honeybee)

• 44 vertebrates(human→ teleosts, lamprey)

The plan

Insects: focus on short conserved introns (40-81 nt)

•Apply intronscan (preliminary filter) → build alignments →evaluate characteristic intron evolution → train support vectormachine (SVM)→ classify candidate set of novel introns

Vertebrates: focus on general independent splice-site prediction(they have only few short introns)

•Apply MaxEntScan (preliminary filter) → compile set of real(positive) and “pseudo” (false) donor/acceptor splice-sites →evaluate characteristic splice-site evolution→ train SVM→ clas-sify candidate set of novel splice-sites

3. Insects – Methods

498,231 predictions with orthologs

D.ere

D.mel

D.moj

1,398,939 predicted introns for

B retain orthologous intronscan predictions

A

+ 12 insects

predict introns in individual insect genomes using intronscan

variationdonor score acceptor score

variationvariationintron lengthconservation

scoresscores

splice site

C evaluate characteristic intron evolution

trai

ning

sam

ples

dist

ribut

ions

of

train an SVM with these 5 discriminative features

apply to 342,785 predictions that overlap no protein−coding gene

D. melanogaster

369 conserved introns predicted

negative

positive

substitution

genome

genome

D.ere

D.mel

D.moj+ 12 insects

+ strand intron− strand intron

>>>>>>

>>>>>> >>>>>>

>>>>>>

>>>>>>

>>>>>>>>>>>>

>>>>>> >>>>>> >>>>>>>>>>>>

>>>>>>

>>>>>> >>>>>>>>>>>> >>>>>>

>>>>>>

>>>>>> >>>>>>

>>>>>>

>>>>>> >>>>>>

>>>>>>>>>>>>

>>>>>> >>>>>> >>>>>>

>>>>>> >>>>>>>>>>>>

>>>>>>

>>>>>>

>>>>>>

1False Positive Rate

Tru

e P

ositi

ve R

ate

0

1

independent test setROC curve of

AUC = 0.983

4. Insects – Splice-site evolution

Nucleotide frequencies differ at splice-site positions.

10−10 10−10 10−1010−10 10−10 10−10 10−10 10−10 10−10

G

T

C

A0−20 20

D.sim

D.sec

D.yak

D.ere

D.ana

D.pse

D.per

D.wil

D.moj

D.vir

D.gri

A.gam

T.cas

A.mel

percent

less frequentcompared tomore frequent

D.melcompared to D.meldonor (5’) splice site acceptor (3’) splice site

+5 +6 −7 −6 −5 −4 −3+3 +4

less frequent← | → more frequent (compared to Dmel)e.g. Apis prefers A over G (donor +3) and T over C (acceptor -3)

learn log odd substitution scores: ∀x, y ∈ {A, T,C,G} , x 6= y :

log2

(

freqpos(x→y)freqneg(x→y)

)

→ substitution matrix

Conservation scores (PhastCons)

0

1

0.5

8...20 and −20...−8

average conservationscores for region

substitution scoressum of

0.5

substitution scoressum of

8...20 and −20...−8

average conservationscores for region

Den

sity

0.03

1.0

Den

sity

20

00

0 40 80−10

Den

sity

0.5

40 80 0 1.0

Den

sity

1000

−10 0

average = 0.002sum = 31.6

sum = −3.7

average = 0.92

G G T

negativepositive

D.pse

D.sec

D.melD.sim

D.yakD.ere

D.virD.gri

D.griD.mojD.virD.wilD.perD.pseD.anaD.ere

D.simD.mel

D.sec

G T A A G A T − T A T T C C G A T T T T T A T A G C T T C A T T T T T G A G A A A T T T A A T T T G AT T A A − − − − T T T T T A GG T A A G C C − − − − T T A C A A A A A A C C A T A T A T A T T T T T A G T G A A T C A A T A T T G CC T T A T T − − T T T G T A GG T A G G A T − T A A C C A T C C A G C T A T C T A T A T A T C T G T A G T A A T A T C T T G A A C TA T A A − − − − T T T G C A GG T A A A C − − − G C T A T T A G A A T T C A T T T A C A T T T A C A G A C G A T − A A T A G T G T AT A T C T T C A T A GGG T G A G T G − T A A C C G T A A C C A G C A A C T G G C T C C A G C A G T A G A C C T A T C G A A TA T A − − − − − T C C G C A GG T G A G T G − T A A C C G T A A C C A G C A A C T G G C T C C A G C A G T A G A C C T A T C G A A TA T A − − − − − T C C G C A GG T A A G C T T T T C C G A A G A G A T A G C A T T − − T A T T A T G A T T C A A T T G T T T − − − −− − − − − − − − T T C A C A GG T G A G A A − − A C A C A A G A C A T G C T A T T G C C A A T A A T A T C A T A T − A C C A A G A AC T C A A − − − T T T A C A G

G T G A G A C − − A C C C A A G A C A T T C T A T T G G C A A T A A T A T C C T T T − A C C A A G G AC C C A − − − − T T T A C A GG T G A G A C − − A C C C A A G A C A T T A T A T T G G C A A T A A T A T C A T C T − A C C A A G G GC T C A − − − − T T T A C A G

G T G A G A C − − C C C C A A G A C A T T T T A T T G G C A A T A A T A T C C T A T − A C C A A G G AC C C A − − − − T T T A C A G

A substitution scores B

+20+8 −8−20

+20+8

0

1

−8−20

G T G G G C T C A G − − − T C G G T A C T C C A T T A T G A T T G T T T A T T T A − − − − − − − A T AT G C G C T T G A T T T G A A GG T G G G C T C A G T C T G T G G T A C T C C A T T A T G A T T G T T T A T T T A − − − − − − − A T AT G C G C T T G A T T T G A A GG T G G G C T C A G T C T G T G G T A C T C C A T T A T G A T T G T T T A T T T A − − − − − − − A T AT G C G C T T G A T T T G A A GG T G G G C T C T C − − − T C G G T A C T G C A T T A T G A T T G T T T A T T T T − − − − − − − A T AT G C G C T T G A T T T G A GGG T G G G C T C A G − − − T C G G A A C T C C A T T A C G A T T G T T T A T T T T − − − − − − − A T AT G C G C T T G A T T T G A GGG T G G G C T C A G − A G T C G G T A C T C C A C T G C G A T T A T T T A T T T T − − − − − − − A T TT G C G C C T G A T T T G A GGG T G G T T T G − − − − − − − G A C T C C A T T A T A A T T A T T T A T A T T − A C C C G T G T T T GC G C T T G A T T T G A A GA TG T G G A T C T − − − − G G G G A C T C C A T T A T A A T T A T T T A T A T T T G C T C G T A T T T GC G C T T G A T T T G A A GGA

distribution of positive training samples

distribution of negative training samplesclassified as false prediction (SVM probability 0.001)

classified as real intron (SVM probability 0.999)

Conservation scores (PhastCons)

5. Insects – Results

chr3R:

chr2R:

chr3L:

chr3L:

chrX:

Conservation

Conservation

CG14614

21856300

13232800

19480100

4479900

8881300

21856400

13232900

8881400

4480000 4480100

19480200500 bp

500 bp

600 bp

300 bp21856500

13233000

8881500

19480300

21856600

8881600

13233100

19480400

4480200

FlyBase Protein−Coding Genes

predicted intron

predicted intron

8881700

4480300

FlyBase Noncoding Genes

D. melanogaster mRNAs from GenBank

predicted intron

D. melanogaster ESTs That Have Been Spliced

8881800

4480400

D. melanogaster ESTs That Have Been Spliced

predicted intronpredicted intron predicted intron

600 bp

predicted intron

D. melanogaster ESTs That Have Been Spliced

FlyBase Protein−Coding GenesD. melanogaster ESTs That Have Been Spliced

D. melanogaster ESTs That Have Been Spliced

CA805633CA807669CA805453CA807471CO192200CA807690

E

D

B

C

A

CA804813

Conservation

Conservation

EY198607EY198595

CA805394CA805952

CA805663CA804428

CA805031CA805317

CA807678

pncr009:3L−RA

Conservation

BE979091AI944913

EC251326

AY113603

CO334041CO319199

CK135604

dally

EC247591CO295956

EC249419

A) Predicted intron located at 5’UTR (- strand), B) Predicted intron belonging to antisense transcript of

dally, C) EST-confirmed intron prediction, D) Predicted EST-confirmed intron revising current FlyBase

annotation, E) Clustered predictions at a putative novel protein-coding gene (blastx hits in several species)

• area under ROC: 0.983, p>0.95: 80% TP at 0.12% FP

• 369 predictions outside of known protein-cod. genes (p>0.95)

• 131 EST/FlyBase-transcript confirmed introns,238 unconfirmed

•Discard novel protein-coding ones: 129 novel mlncRNAs

6. Insects – Exp. verification

•RT-PCR, 5 different developmental stages of Dmel:embryo, larva, pupa, male, female

• 18/29 (62%) experimentally validated:mlncRNAs: 7/12, introns in putative cod. transcripts: 11/17

7. Vertebrates - Refinements

Meet increasing requirements

•Vertebrate introns ! = insect introns (2 % vs 54 % short introns)

•Rather than predicting complete introns, we switch to individualsplice-site prediction → new (SVM-)features needed to distin-guish real from false splice-sites, we propose:

(1) The human MaxEntScan splice-site score.

(2-4) Three log odd substitution score variants stree, spair, smedian.

(5) The total number of species in an alignment

(6) The total number of species with conserved GT/AG dinu-cleotides and a MaxEntScan score >= 0.

(7) The slope of a regression line fitted to the splice-sites’PhastCons sequence conservation profile of [-20,+20].

(8) The average GC content.

(9) The mean pairwise identity.

Improve log odd substitution scores

•Reconstruct ancestral sequences for each splice-site region usingprequel and learn splice-site substitution patterns for each edgee of the 44-species tree:

stree =∑

e∈E log2

(

fpos(x→y)/∑

n∈A fpos(x→n)fneg(x→y)/

n∈A fneg(x→n)

)

8. Vertebrates – Results

• 2 models, AUC: ∼0.93 donor, ∼0.94 acceptor

• intron candidates: arbitrarily defined as adjacent do/acc pairswith distance <=5000 nt on same strand

• chr21: 886 pairs (p > 0.9), 105 with typical PhastCons pro-file/basin, 16 manually chosen for experimental validation (on-going work)

Scalechr22:

chr22 ACC

chr22 DO

GM128 cell tot

GM128 cyto pA-

GM128 cyto pA+

GM128 nucl pA-

GM128 nucl pA+

K562 cell total

K562 psom pA-

K562 cyto pA-

K562 cyto pA+

K562 nucl pA-

K562 nucl pA+

K562 nplsm total

K562 chrm total

K562 nlos tot

Multiz Align

RepeatMasker

1 kb18111500 18112000 18112500 18113000

chr22 ACC

chr22 DO

chr22 phast-filtered introns

ENCODE Affymetrix/CSHL Subcellular RNA Localization by Tiling Array

Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons

Placental Mammal Conservation by PhastCons

Vertebrate Conservation by PhastCons

Repeating Elements by RepeatMasker

intron:chr22:18111410-18112940

GM128 cell tot

GM128 cyto pA-

GM128 cyto pA+

GM128 nucl pA-

GM128 nucl pA+

K562 cell total

K562 psom pA-

K562 cyto pA-

K562 cyto pA+

K562 nucl pA-

K562 nucl pA+

K562 nplsm total

K562 chrm total

K562 nlos tot

Primate Cons

1 _

0 _

Mammal Cons

1 _

0 _

Vertebrate Cons

1 _

0 _

Scalechr22:

CONTRAST

SGP Genes

Geneid Genes

Genscan Genes

Multiz Align

RepeatMasker

100 bases22066000 22066050 22066100 22066150

chr22 phast-filtered introns

CONTRAST Gene Predictions

SGP Gene Predictions Using Mouse/Human Homology

Geneid Gene Predictions

Genscan Gene Predictions

Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons

Placental Mammal Conservation by PhastCons

Vertebrate Conservation by PhastCons

Repeating Elements by RepeatMasker

intron:chr22:22066024-22066128

Primate Cons

1 _

0 _

Mammal Cons

1 _

0 _

Vertebrate Cons

1 _

0 _

Scalechr22:

<---

Gencode Manual

SIB Genes

SGP Genes

Geneid Genes

Genscan Genes

Multiz Align

RepeatMasker

5 bases22659655 22659660 22659665

G A T C G G T G T G A C C C C C C Tchr22 phast-filtered introns

ENCODE Gencode Gene Annotations

Ensembl Gene Predictions

Swiss Institute of Bioinformatics Gene Predictions from mRNA and ESTs

SGP Gene Predictions Using Mouse/Human Homology

Geneid Gene Predictions

Genscan Gene Predictions

Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons

Placental Mammal Conservation by PhastCons

Vertebrate Conservation by PhastCons

Repeating Elements by RepeatMasker

intron:chr22:22659660-22661320

ENST00000405781

Primate Cons

1 _

0 _

Mammal Cons

1 _

0 _

Vertebrate Cons

1 _

0 _

Scalechr22:

<---

Gencode Manual

SIB Genes

SGP Genes

Geneid Genes

Genscan Genes

Multiz Align

RepeatMasker

10 bases22661305 22661310 22661315 22661320 22661325 22661330 22661335

T C G T C G G G T G C C T G G C C A A T G G A G A G T C G G T T C C A C T T C A Gchr22 phast-filtered introns

ENCODE Gencode Gene Annotations

Ensembl Gene Predictions

Swiss Institute of Bioinformatics Gene Predictions from mRNA and ESTs

SGP Gene Predictions Using Mouse/Human Homology

Geneid Gene Predictions

Genscan Gene Predictions

Vertebrate Multiz Alignment & Conservation (44 Species)Primate Conservation by PhastCons

Placental Mammal Conservation by PhastCons

Vertebrate Conservation by PhastCons

Repeating Elements by RepeatMasker

intron:chr22:22659660-22661320

ENST00000405781

Primate Cons

1 _

0 _

Mammal Cons

1 _

0 _

Vertebrate Cons

1 _

0 _

Acknowledgements

for contributing:

•Micha, Sven, Sandro, Manja, Claudia, Christine, Rolf, Sonja,Gunter and Peter

for funding:

•German Research Foundation (STA 850/7-1 and Hi 1423/2-1)

•Graduiertenkolleg Wissensreprasentation of UniversityLeipzig

• European Network of Excellence “The Epigenome”

• 6th Framework Programme of the European Union (SYNLET)

References

[1] M. Hiller, S. Findeiß, S. Lein, M. Marz, C. Nickel, D. Rose,C. Schulz, R. Backofen, S. J. Prohaska, G. Reuter, P. F.Stadler, Conserved introns reveal novel transcripts in Drosophilamelanogaster, Genome Res. 19 (2009) 1289–1300.

Printed by Universitatsrechenzentrum Leipzig