tics Day 1

download tics Day 1

of 96

Transcript of tics Day 1

  • 8/8/2019 tics Day 1

    1/96

    WELCOME TO THEWORKSHOP ON

    BIOINFORMATICS

    M.Sc II REVISED SYLLABUS

  • 8/8/2019 tics Day 1

    2/96

    DAY 1

    30/09/09

    INTRODUCTION

  • 8/8/2019 tics Day 1

    3/96

    NCBI

    National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/

    Entrez, the Life Sciences Search Engine

    The Entrez page is home to the EntrezGlobal Query database search engine (the

    Entrez cross-database search page).

    The entire group of individual Entrez databases

    is organized on this page with literaturedatabases at the top including PubMed, PubMed

    Central, Journals, Books, OMIM and OMIA.

    http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi?term=http://www.ncbi.nlm.nih.gov/
  • 8/8/2019 tics Day 1

    4/96

    NCBI

    The NCBI Site Search is also listed.

    The sequence databases include Nucleotide, Protein,Genome, Structure, and SNPs.

    The remaining databases are Taxonomy, Gene,UniGene, HomoloGene, Conserved Domains, 3D

    Domains, UniSTS, PopSet, GEO Profiles, GEODatasets, PubChem Bio-Assay, PubChem Compound,PubChem Substance, Cancer Chromosomes, Probe,MeSH, Journals and NLM Catalog.

    Links to popular NCBI Web pages, such as PubMed,Human Genome, Map Viewer, and BLAST, are on thetoolbar.

    There is also a link to the "GenBank" database, leadingto the Nucleotide database.

  • 8/8/2019 tics Day 1

    5/96

    NCBI By using the Entrez Global query, a search

    across all Entrez databases is performed byentering a simple search term or phrase in the"Search across databases" query box.

    Select the Go button to execute the search, or

    press the Enter button on your keyboard. The CLEAR button erases search terms in the

    query box; use it to begin a new search. The results found in each database are

    displayed on the Global Query page. Click on the result number or its adjacentdatabase name to get to the specific results.

    See the link to the Global Query Help document,which is to the right of the CLEAR button.

  • 8/8/2019 tics Day 1

    6/96

  • 8/8/2019 tics Day 1

    7/96

    Nucleotide Database When a search is done in the Nucbleotide database,

    Entrez search results are also shown for the threecomponent Nucleotide databases on the Search statisticline.

    The component Nucleotide databases together containall the sequence data from GenBank, EMBL, and DDBJ,the members of the International Collaboration ofSequence Databases.

    The new component databases are included within theEntrez linking scheme and Links within and betweendatabases can be selected as usual from the variousdatasets.

    Popular search strategies such as the Limits,Preview/Index, History, and MyNCBI can be used withineach individual database.

    Nucleotide database also includes the ReferenceSequence (RefSeq) records. RefSeqs are an NCBI-

    curated non-redundant set of sequences.

  • 8/8/2019 tics Day 1

    8/96

    Protein Database

    The Protein database contains sequence

    data from the translated coding regions

    from DNA sequences in GenBank, EMBL,

    and DDBJ as well as protein sequences

    submitted to Protein Information Resource

    (PIR), SWISS-PROT, Protein Research

    Foundation (PRF), and Protein Data Bank(PDB) (sequences from solved structures).

  • 8/8/2019 tics Day 1

    9/96

    Genome Database

    The Genome database provides views for

    a variety of genomes, complete

    chromosomes, sequence maps with

    contigs, and integrated genetic and

    physical maps.

  • 8/8/2019 tics Day 1

    10/96

    Structure Database

    The Structure database or MolecularModeling Database (MMDB) containsexperimental data from crystallographic

    and NMR structure determinations. The data for MMDB are obtained from the

    Protein Data Bank (PDB).

    The NCBI has cross-linked structural datato bibliographic information, to thesequence databases, and to the NCBItaxonomy.

  • 8/8/2019 tics Day 1

    11/96

    Conserved Domains

    Conserved Domains is a database of

    protein domains.

    The source databases for Conserved

    Domains are Pfam, Smart, and COG.

  • 8/8/2019 tics Day 1

    12/96

    3D Domains

    3D Domains contains protein domains

    from the Entrez Structure Database.

  • 8/8/2019 tics Day 1

    13/96

    UniSTS

    UniSTS is a unified, non-redundant view ofsequence tagged sites (STSs).

    UniSTS integrates marker and mapping datafrom a variety of public resources.

    Data sources include dbSTS, RHdb, GDB,various human maps (Genethon genetic map,Marshfield genetic map, Whitehead RH map,Whitehead YAC map, Stanford RH map, NHGRI

    chr 7 physical map, and WashU chrX physicalmap), and various mouse maps (Whitehead RHmap, Whitehead YAC map, and JacksonLaboratory's MGD map).

  • 8/8/2019 tics Day 1

    14/96

    Gene

    Gene provides a unified query

    environment for genes defined by

    sequence and/or in NCBI's Map Viewer.

    You can query on names, symbols,

    accessions, publications, GO terms,

    chromosome numbers, E.C. numbers, and

    many other attributes associated withgenes and the products they encode.

  • 8/8/2019 tics Day 1

    15/96

    Taxonomy Database

    The Taxonomy database contains the

    names of all organisms that are

    represented in the NCBI genetic database

    by at least one nucleotide or proteinsequence.

  • 8/8/2019 tics Day 1

    16/96

    PubMed Central

    PubMed Central (PMC) is the U.S.

    National Library of Medicine's digital

    archive of life sciences journal literature.

    Access to the full text of articles in PMC is

    free, except where a journal requires a

    subscription for access to recent articles.

  • 8/8/2019 tics Day 1

    17/96

    Journals

    The Journals database can be searched

    using the journal title, MEDLINE

    abbreviation, NLM ID, ISO abbreviation, or

    ISSN.

    The database includes the journals in all

    Entrez databases, e.g., PubMed,

    Nucleotide, Protein.

  • 8/8/2019 tics Day 1

    18/96

    MeSH

    MeSH (Medical Subject Headings) is the

    National Library of Medicine's controlled

    vocabulary used for indexing articles in

    PubMed.

    MeSH terminology provides a consistent

    way to retrieve information that may use

    different terminology for the sameconcepts.

  • 8/8/2019 tics Day 1

    19/96

    Bookshelf

    The Bookshelf has a collection of

    Biomedical books that are linked in Entrez.

    The NCBI Handbook is also available from

    the Bookshelf.

  • 8/8/2019 tics Day 1

    20/96

    OMIM Database

    The OMIM (Online Mendelian Inheritance

    in Man) database is a catalog of human

    genes and genetic disorders.

  • 8/8/2019 tics Day 1

    21/96

    OMIA Database

    Online Mendelian Inheritance in Animals (OMIA)is a database of genes, inherited disorders and

    traits in animal species (other than human and

    mouse) authored by Professor Frank Nicholas of

    the University of Sydney, Australia, with helpfrom many people over the years.

    The database contains textual information and

    references, as well as links to relevant records

    from OMIM, PubMed, Gene, and soon to NCBI'sPhenotype database.

    B l O t d i

  • 8/8/2019 tics Day 1

    22/96

    Boolean Operators used inEntrez

    AND: To AND two search terms togetherinstructs Entrez to find all documents thatcontain BOTH terms

    OR: To OR two search terms together instructs

    Entrez to find all documents that containEITHER term.

    NOT: To NOT two search terms togetherinstructs Entrez to find all documents that

    contain search term 1 BUT NOT search term 2. Boolean operators AND, OR, NOT must beentered in UPPERCASE (e.g., promoters ORresponse elements).

    Searching for Unique

  • 8/8/2019 tics Day 1

    23/96

    Unique identifiers can be accessionnumbers, which apply to a completesequence record, orsequenceidentification numbers, which apply to

    the individual sequences within a record. The format ofaccession numbers varies,

    depending upon the source database.

    Each data domain in Entrez containsrecords from a number of differentsources.

    Searching for UniqueIdentifiers

    Searching for Unique

  • 8/8/2019 tics Day 1

    24/96

    The unique identifier for a sequence record. An accession number applies to the complete record and

    is usually a combination of a letter(s) and numbers, suchas a single letter followed by five digits (e.g., U12345) ortwo letters followed by six digits (e.g., AF123456).

    Some accessions might be longer, depending on thetype of sequence record.

    Accession numbers do not change, even if information inthe record is changed at the author's request.

    Sometimes, however, an original accession numbermight become secondary to a newer accession number,if the authors make a new submission that combinesprevious sequences, or if for some reason a newsubmission supercedes an earlier record.

    Searching for UniqueIdentifiers

    Searching for Unique

  • 8/8/2019 tics Day 1

    25/96

    Records from the RefSeq database of

    reference sequences have a

    different accession number format that

    begins with two letters followed by anunderscore bar and six or more digits, for

    example:

    NT_123456 constructed genomic contigs NM_123456 mRNAs NP_123456 proteins

    NC_123456 chromosomes

    Searching for UniqueIdentifiers

    Searching for Unique

    http://www.ncbi.nlm.nih.gov/RefSeq/http://www.ncbi.nlm.nih.gov/RefSeq/key.htmlhttp://www.ncbi.nlm.nih.gov/RefSeq/key.htmlhttp://www.ncbi.nlm.nih.gov/RefSeq/
  • 8/8/2019 tics Day 1

    26/96

    GI numbers:a series of digits that are assigned consecutivelyby NCBI to each sequence it processes

    "GenInfo Identifier" sequence identificationnumber, in this case, for the nucleotidesequence.

    If a sequence changes in any way, a new GI

    number will be assigned. A separate GI number is also assigned to eachprotein translation within a nucleotide sequencerecord, and a new GI is assigned if the protein

    translation changes in any way

    Searching for UniqueIdentifiers

    Searching for Unique

  • 8/8/2019 tics Day 1

    27/96

    Nucleotide sequence:

    GI: 6995995

    VERSION: NM_000492.2

    Protein translation:

    GI: 6995996

    VERSION: NP_000483.2

    Searching for UniqueIdentifiers

  • 8/8/2019 tics Day 1

    28/96

    EBI

    The European Bioinformatics Institute (EBI) is a

    non-profit academic organisation that forms part

    of the European Molecular Biology Laboratory (

    EMBL). The EBI is a centre for research and services in

    bioinformatics.

    http://www.ebi.ac.uk

    The Institute manages databases of biological

    data including nucleic acid, protein sequences

    and macromolecular structures.

    http://www.embl.org/http://www.embl.org/
  • 8/8/2019 tics Day 1

    29/96

    EBI It is the European node for globally coordinated efforts to

    collect and disseminate biological data. Many of their databases are household names tobiologists they include EMBL-Bank (DNA and RNAsequences), Ensembl (genomes), ArrayExpress(microarray-based gene-expression data), UniProt

    (protein sequences), InterPro (protein families, domainsand motifs) and PDBe (macromolecular structures).

    Others, such as IntAct (proteinprotein interactions),Reactome (pathways) and ChEBI (small molecules), arenew resources that help researchers to understand notonly the molecular parts that go towards constructing anorganism, but how these parts combine to createsystems.

    The details of each database vary, but they all uphold

    the same principles of service provision.

    http://www.ebi.ac.uk/embl/http://www.ensembl.org/http://www.ebi.ac.uk/arrayexpress/http://www.ebi.ac.uk/uniprot/http://www.ebi.ac.uk/interpro/http://www.ebi.ac.uk/pdbe/http://www.ebi.ac.uk/intact/http://www.reactome.org/http://www.ebi.ac.uk/chebi/http://www.ebi.ac.uk/chebi/http://www.reactome.org/http://www.ebi.ac.uk/intact/http://www.ebi.ac.uk/pdbe/http://www.ebi.ac.uk/interpro/http://www.ebi.ac.uk/uniprot/http://www.ebi.ac.uk/arrayexpress/http://www.ensembl.org/http://www.ebi.ac.uk/embl/
  • 8/8/2019 tics Day 1

    30/96

    EMBL - Bank

    EMBL-Bank is produced as part of the InternationalNucleotide Sequence Database Collaboration (see sidepanel and figure).

    Each of the three groups DDBJ, EMBL Bank, GenBank

    collects a proportion of the total sequence data reportedworldwide, and all new and updated database entries areexchanged between the groups on a daily basis.

    EMBL-Bank contains over 150 million DNA and RNAsequences, ranging from as few as ten base pairs to

    entire genomes. Its sequences come from three main sources: individual

    research groups, genome-sequencing projects andpatent applications.

  • 8/8/2019 tics Day 1

    31/96

    ENSEMBL Ensembl provides a comprehensive resource for the

    scientific community which allows analysis of geneticinformation within and between species.

    Hence, the resource is of use in a wide range ofresearch fields from evolutionary biology to clinicalresearch.

    Ensembl annotates chordate genomes (i.e. vertebratesand closely related invertebrates such as the sea squirt).

    Gene sets from model organisms such as yeast and flyare also imported for comparative analysis.

    All Ensembl genes are placed according to theexperimental evidence of protein and mRNA sequencesobtained from UniProt/Swiss-Prot, UniProt/TrEMBL andRefSeq.

    Sequence data is obtained from relevant genomesequencing centres and consortia.

  • 8/8/2019 tics Day 1

    32/96

    EMSEMBL

    With Ensembl you can: Retrieve all or part of a genome sequence. Use the sequence alignment search tools BLAST and

    BLAT against any

    Ensembl genome. Link to genome annotation from microarray results. View expressed sequence tags (ESTs), clones, mRNA

    and proteins for any chromosomal region. Examine genes, markers and single nucleotide

    polymorphisms (SNPs) in a chromosomal region. View variations such as SNPs across strains (rat,

    mouse) or populations (human). View all alternative transcripts (splice variants) for a

    gene.

  • 8/8/2019 tics Day 1

    33/96

    UniProt

    UniProt is produced by the UniProt Consortium,a collaboration between the EuropeanBioinformatics Institute (EBI), the Swiss Instituteof Bioinformatics (SIB) and the Protein

    Information Resource (PIR). UniProt comprises four components:The UniProt Knowledgebase (UniProtKB)UniProt Reference Clusters (UniRef)

    UniProt Archive (UniParc)UniProt Metagenomic and Environmental

    Sequences (UniMES)

  • 8/8/2019 tics Day 1

    34/96

    UniProt

    The UniProt Knowledgebase, and in particular

    UniProtKB/Swiss-Prot, is used to access

    functional information on proteins.

    Every UniProtKB entry contains the amino acid

    sequence, protein name or description,taxonomic data and citation information but in

    addition to this, annotation are added.

    This includes widely accepted biological

    ontologies, classifications and cross-references,as well as clear indications on the quality of

    annotation in the form of evidence attribution to

    experimental and computational data.

  • 8/8/2019 tics Day 1

    35/96

    UniProt

    The UniRef databases provide clustered

    sets of sequences from UniProtKB and

    selected UniParc records to provide

    complete coverage of sequence space at

    several resolutions.

    UniRef90 and UniRef50 yield a database

    size reduction of approximately 40% and

    65%, respectively, providing significantlyfaster sequence searches.

  • 8/8/2019 tics Day 1

    36/96

    UniProt

    UniParc is the most comprehensivepublicly accessible non-redundant proteinsequence database available, providing

    links to all underlying sources andversions of these sequences.

    You can instantly find out whether asequence of interest is already in thepublic domain and, if not, identify itsclosest relatives.

  • 8/8/2019 tics Day 1

    37/96

    UniProt

    UniMES is a repository specifically for

    metagenomic and environmental data.

  • 8/8/2019 tics Day 1

    38/96

    DAY 130/09/09

    Session II

  • 8/8/2019 tics Day 1

    39/96

    Sequence Alignment

    Quite simply, the comparison of two or

    more DNA or protein sequences to each

    other.

    The purpose of alignment is to highlight

    similarity between the sequences.

  • 8/8/2019 tics Day 1

    40/96

    SEQUENCE ALIGNMENT

    Sequence alignment is a standard

    technique in bioinformatics for visualizing

    the relationships between residues in a

    collection of evolutionarily or structurally

    related proteins An alignment provides a birds eye view of

    the underlying evolutionary, structural, or

    functional constraints characterizing aprotein family in a concise, visually

    intuitive format.

  • 8/8/2019 tics Day 1

    41/96

    Fundamental assumption of sequence

    alignment:

    Sequences that are similar share a

    common ancestral sequence.

    Due to common ancestry, similar

    sequences have similar functionality.

    Sequences that share a common

    ancestorare said to be homologous.

    SEQUENCE ALIGNMENT

  • 8/8/2019 tics Day 1

    42/96

    Why do we have divergent copies of the

    same sequence in genomes?

    Speciation Events

    - Divergence of a single species into two

    or more new species

    Gene Duplication

    - Errors in replication

    SEQUENCE ALIGNMENT

  • 8/8/2019 tics Day 1

    43/96

    What kind of Alignment?

    - Global vs. Local

    - Pairwise vs. Multiple Sequence Alignment

    SEQUENCE ALIGNMENT

  • 8/8/2019 tics Day 1

    44/96

    BLAST

    The comparison of nucleotide or proteinsequences from the same or different organismsis a very powerful tool in molecular biology.

    By finding similarities between sequences,scientists can infer the function of newlysequenced genes, predict new members of genefamilies, and explore evolutionary relationships.

    Now that whole genomes are being sequenced,sequence similarity searching can be used topredict the location and function of protein-coding and transcription-regulation regions ingenomic DNA.

  • 8/8/2019 tics Day 1

    45/96

    BLAST

    Basic Local Alignment Search Tool(BLAST) is the tool most frequently usedfor calculating sequence similarity.

    BLAST comes in variations for use withdifferent query sequences against differentdatabases.

    All BLAST applications, as well asinformation on which BLAST program touse and other help documentation, arelisted on the BLAST homepage.

    BLAST

    http://www.ncbi.nlm.nih.gov/BLAST/http://www.ncbi.nlm.nih.gov/BLAST/
  • 8/8/2019 tics Day 1

    46/96

    BLAST Nucleotide BLAST searches allow one to input

    nucleotide sequences and compare these against other

    nucleotides. Standard nucleotide-nucleotide BLAST - Takes

    nucleotides sequences in FASTA format, GenBankAccession numbers or GI numbers and compares them

    against the NCBI nucleotide databases. MEGABLAST - This program uses a "greedy algorithm"(Webb Miller et al.) for nucleotide sequence alignmentsearches and concatenates many queries to save timespent scanning the database. It is optimized for aligning

    sequences that differ slightly and is up to 10 times fasterthan more common sequence similarity programs. It canbe used to swiftly compare two large sets of sequencesagainst each other.

    S

    http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10890397&dopt=Abstracthttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=10890397&dopt=Abstracthttp://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html
  • 8/8/2019 tics Day 1

    47/96

    BLAST Protein BLAST allows one to input protein sequences and

    compare these against other protein sequences.

    Standard protein-protein BLAST - Takes protein sequences inFASTA format, GenBank Accession numbers or GI numbers andcompares them against the NCBI protein databases.

    PSI-BLAST - Position Specific Iterated BLAST uses an iterativesearch in which sequences found in one round of searching areused to build a score model for the next round of searching.

    Highly conserved positions receive high scores and weaklyconserved positions receive scores near zero. The profile is usedto perform a second (etc.) BLAST search and the results of each"iteration" used to refine the profile. This iterative searchingstrategy results in increased sensitivity.

    PHI-BLAST - Pattern Hit Initiated BLAST combines matching of

    regular expression pattern with a Position Specific iterativeprotein search. PHI-BLAST can locate other protein sequenceswhich both contain the regular expression pattern and arehomologous to a query protein sequence.

    http://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.htmlhttp://www.ncbi.nlm.nih.gov/blast/html/blastcgihelp.html
  • 8/8/2019 tics Day 1

    48/96

    BLAST

    Pairwise BLAST performs a comparison between twosequences using the BLAST algorithm. Not that theprogram considers a "Sequence 1" to be the Querysequence and "Sequence 2" to be the Subject sequence.There are the following program options:

    blastn - for nucleotide - nucleotide comparisons blastp - for protein - protein comparisons tblastn - compares the protein "Sequence 1" against the

    nucleotide "Sequence 2" which has been translated in allsix reading frames

    blastx - compares the nucleotide "Sequence 1" againstthe protein "Sequence 2"

    tblastx - compares nucleotide "Sequence 1" translatedin all six reading frames against the nucleotide"Sequence 2" translated in all six reading frames.

  • 8/8/2019 tics Day 1

    49/96

    Why would you translate nucleotide

    sequence to protein before comparing it?

    More information in proteins!

    - Detect more distant homology.

    BLAST

  • 8/8/2019 tics Day 1

    50/96

    Most proteins are modular in nature, with functionaldomains often being repeated within the same protein aswell as across different proteins from different species.

    The BLAST algorithm is tuned to find these domains or

    shorter stretches of sequence similarity. The local alignment approach also means that a mRNA

    can be aligned with a piece of genomic DNA, as isfrequently required in genome assembly and analysis.

    If instead BLAST started out by attempting to align two

    sequences over their entire lengths (known as a globalalignment), fewer similarities would be detected,especially with respect to domains and motifs.

    BLAST

  • 8/8/2019 tics Day 1

    51/96

    A gap is a space introduced into an alignment to

    compensate for insertions and deletions in one

    sequence relative to another.

    To prevent the accumulation of too many gaps inan alignment, introduction of a gap causes the

    deduction of a fixed amount (the gap score) from

    the alignment score.

    Extension of the gap to encompass additionalnucleotides or amino acid is also penalized in

    the scoring of an alignment.

    BLAST

    BLAST

  • 8/8/2019 tics Day 1

    52/96

    BLAST Once BLAST has found a similar sequence to the query

    in the database, it is helpful to have some idea of

    whether the alignment is good and whether it portraysa possible biological relationship, or whether thesimilarity observed is attributable to chance alone.

    BLAST uses statistical theory to produce a bit score andexpect value (E-value) for each alignment pair (query to

    hit). The bit score gives an indication of how good the

    alignment is; the higher the score, the better thealignment.

    The E-value gives an indication of the statistical

    significance of a given pairwise alignment and reflectsthe size of the database and the scoring system used.The lower the E-value, the more significant the hit. Asequence alignment that has an E-value of 0.05 meansthat this similarity has a 5 in 100 (1 in 20) chance ofoccurring by chance alone.

    The BLAST report header.

    http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.htmlhttp://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
  • 8/8/2019 tics Day 1

    53/96

    The top line gives information about the type of program (in this case,

    BLASTP), the version (2.2.1), and a version release date.

    The research paper that describes BLAST is then cited, followed by the

    request ID (issued by QBLAST), the query sequence definition line, and

    a summary of the database searched.

    The Taxonomy reports link displays this BLAST result on the basis of

    information in the Taxonomy database

    Graphical overview of BLAST results

    http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.glossary.1237
  • 8/8/2019 tics Day 1

    54/96

    Graphical overview of BLAST results. The query sequence is represented by the numbered red

    barat the top of the figure. Database hits are shown aligned to the query, belowthe

    red bar. Of the aligned sequences, the most similar are shown

    closest to the query. In this case, there are three high-scoring database

    matches that align to most of the query sequence. The next twelve bars represent lower-scoring matches

    that align to two regions of the query, from aboutresidues 360 and residues 220500.

    The cross-hatched parts of the these bars indicate thatthe two regions of similarity are on the same protein, butthat this intervening region does not match.

    The remaining bars show lower-scoring alignments. Mousing over the bars displays the definition line for that

    sequence to be shown in the window above the graphic.

  • 8/8/2019 tics Day 1

    55/96

  • 8/8/2019 tics Day 1

    56/96

    One-line descriptions in the BLAST report.

    Each line is composed of four fields:

    (a) the gi number, database designation, Accession

    number, and locus name for the matched sequence,separated by vertical bars (Appendix 1);

    (b) a brief textual description of the sequence, thedefinition. This usually includes information on theorganism from which the sequence was derived, the typeof sequence (e.g., mRNA or DNA), and some informationabout function or phenotype. The definition line is oftentruncated in the one-line descriptions to keep the displaycompact;

    (c) the alignment score in bits. Higher scoring hits arefound at the top of the list; and

    (d) the E-value, which provides an estimate of statisticalsignificance.

    F th fi t hit i th li t th i b i

  • 8/8/2019 tics Day 1

    57/96

    For the first hit in the list, the gi number is

    116365, the database designation is sp (for

    SWISS-PROT), the Accession number is

    P26374, the locus name is RAE2_HUMAN, thedefinition line is Rab proteins, the score is 1216,

    and the E-value is 0.0.

    Note that the first 17 hits have very low E-values(much less than 1) and are either RAB proteins

    or GDP dissociation inhibitors.

    The other database matches have much higher

    E-values, 0.5 and above, which means thatthese sequences may have been matched by

    chance alone.

  • 8/8/2019 tics Day 1

    58/96

    A pairwise sequence alignment from a BLAST report. The alignment is preceded by the sequence identifier the full definition

  • 8/8/2019 tics Day 1

    59/96

    The alignment is preceded by the sequence identifier, the full definitionline, and the length of the matched sequence, in amino acids.

    Next comes the bit score (the raw score is inparentheses) and thenthe E-value.

    The following line contains information on the number of identical

    residues in this alignment (Identities), the number of conservativesubstitutions (Positives), and if applicable, the number of gaps in thealignment.

    Finally, the actual alignment is shown, with the query on top, and thedatabase match is labeled as Sbjct, below.

    The numbers at leftand rightrefer to the position in the amino acid

    sequence. One or more dashes () within a sequence indicate insertions or

    deletions. Amino acid residues in the query sequence that have been masked

    because of low complexity are replaced by Xs (see, for example, thefourth and lastblocks).

    The line between the two sequences indicates the similarities betweenthe sequences. If the query and the subject have the same amino acid at a given

    location, the residue itself is shown. Conservative substitutions, as judged by the substitution matrix, are

    indicated with +.

  • 8/8/2019 tics Day 1

    60/96

    BLAST

  • 8/8/2019 tics Day 1

    61/96

    BLAST

    Steps to BLAST a particular sequence

    Search all databases for any protein/genename

    Click on the hit obtained and derive the

    FASTA sequence Go to the NCBI BLAST home page

    Choose the type of BLAST program to be

    used Copy the FASTA format of the sequence

    and paste it in the search field.

  • 8/8/2019 tics Day 1

    62/96

  • 8/8/2019 tics Day 1

    63/96

  • 8/8/2019 tics Day 1

    64/96

    DAY 130/09/09

    Session III

    BLAST ORTHOLOGS AND

  • 8/8/2019 tics Day 1

    65/96

    PARALOGS

    Homology refers to any similarity betweencharacteristics oforganisms that is due to their

    shared ancestry.

    Homologous sequences are orthologous if they

    were separated by a speciation event: when a

    species diverges into two separate species, the

    divergent copies of a single gene in the resulting

    species are said to be orthologous. Orthologs, or orthologous genes, are genes in

    different species that are similar to each other

    because they originated from a common

    ancestor.

    BLAST ORTHOLOGS AND

    http://en.wikipedia.org/wiki/Characteristichttp://en.wikipedia.org/wiki/Organismshttp://en.wikipedia.org/wiki/Common_descenthttp://en.wikipedia.org/wiki/Speciationhttp://en.wikipedia.org/wiki/Speciationhttp://en.wikipedia.org/wiki/Common_descenthttp://en.wikipedia.org/wiki/Organismshttp://en.wikipedia.org/wiki/Characteristic
  • 8/8/2019 tics Day 1

    66/96

    BLAST ORTHOLOGS ANDPARALOGS

    Homologous sequences are paralogous if they wereseparated by a gene duplication event: if a gene in anorganism is duplicated to occupy two differentpositions in the same genome, then the two copiesare paralogous.

    A set of sequences that are paralogous are calledparalogs of each other.

    Paralogs typically have the same or similar function,but sometimes do not: due to lack of the original

    selective pressure upon one copy of the duplicatedgene, this copy is free to mutate and acquire newfunctions.

    BLAST ORTHOLOGS AND

    http://en.wikipedia.org/wiki/Gene_duplicationhttp://en.wikipedia.org/wiki/Gene_duplication
  • 8/8/2019 tics Day 1

    67/96

    PARALOGS

    Orthologs and paralogs are two fundamentally differenttypes of homologous genes that evolved, respectively,by vertical descent from a single ancestral gene and byduplication.

    Orthology and paralogy are key concepts of evolutionary

    genomics. A clear distinction between orthologs and paralogs is

    critical for the construction of a robust evolutionaryclassification of genes and reliable functional annotationof newly sequenced genomes.

    Genome comparisons show that orthologousrelationships with genes from taxonomically distantspecies can be established for the majority of the genesfrom each sequenced genome.

    BLAST Orthology

  • 8/8/2019 tics Day 1

    68/96

    BLAST - Orthology

    http://oxytricha.princeton.edu/BlastO/

    Derive a sequence in the FASTA format

    for a particular protein

    Paste it in the search query field Keep all parameters as default

    Click the search button and wait for the

    results to be displayed Database variation can change results

    BLAST - Paralogy

    http://oxytricha.princeton.edu/BlastO/http://oxytricha.princeton.edu/BlastO/
  • 8/8/2019 tics Day 1

    69/96

    BLAST Paralogy

    Perform blastp with human angiogenin

    Perform blastp with :-

    Phosphoglycerate kinase

    PhospholipasesSerine proteinase

    Zinc metalloproteinase

  • 8/8/2019 tics Day 1

    70/96

    DAY 130/09/09

    Session IV

  • 8/8/2019 tics Day 1

    71/96

    Sequence Alignment

    Quite simply, the comparison of two or

    more DNA or protein sequences to each

    other.

    The purpose of alignment is to highlightsimilarity between the sequences.

    SEQUENCE ALIGNMENT

  • 8/8/2019 tics Day 1

    72/96

    SEQUENCE ALIGNMENT

    Sequence alignment is a standard

    technique in bioinformatics for visualizingthe relationships between residues in a

    collection of evolutionarily or structurally

    related proteins An alignment provides a birds eye view of

    the underlying evolutionary, structural, or

    functional constraints characterizing aprotein family in a concise, visually

    intuitive format.

  • 8/8/2019 tics Day 1

    73/96

    Fundamental assumption of sequence

    alignment:

    Sequences that are similar share a

    common ancestral sequence.

    Due to common ancestry, similar

    sequences have similar functionality.

    Sequences that share a common ancestor

    are said to be homologous.

    SEQUENCE ALIGNMENT

  • 8/8/2019 tics Day 1

    74/96

    Why do we have divergent copies of the

    same sequence in genomes?

    Speciation Events

    - Divergence of a single species into two

    or more new species

    Gene Duplication

    - Errors in replication

    SEQUENCE ALIGNMENT

  • 8/8/2019 tics Day 1

    75/96

    What kind of Alignment?

    - Global vs. Local

    - Pairwise vs. Multiple Sequence Alignment

    SEQUENCE ALIGNMENT

  • 8/8/2019 tics Day 1

    76/96

    Global Vs Local ALIGNMENT

  • 8/8/2019 tics Day 1

    77/96

    A global alignment is one that comparesthe two sequences over their entirelengths, and is appropriate for comparing

    sequences that are expected to sharesimilarity over the whole length.

    The alignment maximises regions ofsimilarity and minimises gaps using the

    scoring matrices and gap parametersprovided to the program.

    SEQUENCE ALIGNMENT

  • 8/8/2019 tics Day 1

    78/96

    Global sequence alignment algorithms

    align sequences over their entire lengths.

    A second comparison method, local

    alignment, searches for regions of localsimilarity and need not include the entire

    length of the sequences.

    SEQUENCE ALIGNMENT

  • 8/8/2019 tics Day 1

    79/96

    EMBOSS Pairwise Alignment Algorithms http://www.ebi.ac.uk/Tools/emboss/align/index.html

    This tool is used to compare 2 sequences.

    When you want an alignment that covers the whole

    length of both sequences, use needle. When you are trying to find the best region of similarity

    between two sequences, use water.

    Wateris for aligning the best matching subsequences of

    two sequences. It does not necessarily align wholesequences against each other; you should use needle ifyou wish to align closely related sequences along theirwhole lengths.

    PAIRWISE SEQUENCE ALIGNMENT

    http://www.ebi.ac.uk/Tools/emboss/align/index.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/help.htmlhttp://www.ebi.ac.uk/Tools/emboss/align/index.html
  • 8/8/2019 tics Day 1

    80/96

    The %id is the percentage of identical matchesbetween the two sequences over the reportedaligned region.

    The %similarity is the percentage of matches

    between the two sequences over the reportedaligned region where the scoring matrix value isgreater or equal to 0.0.

    The Overall %id and Overall %similarity arecalculated in a similar manner for the number ofmatches over the length of the longest of the twosequences.

    PAIRWISE SEQUENCE ALIGNMENT

    PAIRWISE SEQUENCE

  • 8/8/2019 tics Day 1

    81/96

    What will you use at NCBI for pairwise

    sequence alignment????

    PAIRWISE SEQUENCEALIGNMENT

    PAIRWISE SEQUENCE

  • 8/8/2019 tics Day 1

    82/96

    PAIRWISE SEQUENCE

    ALIGNMENT

    LALIGN

    http://www.ch.embnet.org/software/LALIGN_f

    MULTIPLE SEQUENCE

    http://www.ch.embnet.org/software/LALIGN_form.htmlhttp://www.ch.embnet.org/software/LALIGN_form.html
  • 8/8/2019 tics Day 1

    83/96

    A multiple sequence alignment (MSA) is asequence alignment of three or more

    biological sequences, generally protein, DNA, or

    RNA.

    In many cases, the input set of query sequences are

    assumed to have an evolutionary relationship by

    which they share a lineage and are descended from

    a common ancestor.

    From the resulting MSA, sequence homology can be

    inferred and phylogeneticanalysis can be conducted

    to assess the sequences' shared evolutionary origins.

    MULTIPLE SEQUENCEALIGNMENT

    MULTIPLE SEQUENCE

    http://en.wikipedia.org/wiki/Sequence_alignmenthttp://en.wikipedia.org/wiki/Biological_sequencehttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/DNAhttp://en.wikipedia.org/wiki/RNAhttp://en.wikipedia.org/wiki/Evolutionhttp://en.wikipedia.org/wiki/Homology_%28biology%29http://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Molecular_phylogenyhttp://en.wikipedia.org/wiki/Homology_%28biology%29http://en.wikipedia.org/wiki/Evolutionhttp://en.wikipedia.org/wiki/RNAhttp://en.wikipedia.org/wiki/DNAhttp://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Biological_sequencehttp://en.wikipedia.org/wiki/Sequence_alignment
  • 8/8/2019 tics Day 1

    84/96

    Multiple sequence alignment is often usedto assess sequence conservation of

    protein domains, tertiary and secondary

    structures, and even individual aminoacids or nucleotides.

    http://pbil.univ-lyon1.fr/alignment.html

    http://au.expasy.org/tools/#align

    MULTIPLE SEQUENCEALIGNMENT

    MULTIPLE SEQUENCE

    http://en.wikipedia.org/wiki/Conservation_%28genetics%29http://en.wikipedia.org/wiki/Protein_domainhttp://en.wikipedia.org/wiki/Tertiary_structurehttp://en.wikipedia.org/wiki/Secondary_structurehttp://pbil.univ-lyon1.fr/alignment.htmlhttp://au.expasy.org/tools/http://au.expasy.org/tools/http://pbil.univ-lyon1.fr/alignment.htmlhttp://en.wikipedia.org/wiki/Secondary_structurehttp://en.wikipedia.org/wiki/Tertiary_structurehttp://en.wikipedia.org/wiki/Protein_domainhttp://en.wikipedia.org/wiki/Conservation_%28genetics%29
  • 8/8/2019 tics Day 1

    85/96

    Multiple alignments of protein sequences areimportant tools in studying sequences.

    The basic information they provide is

    identification of conserved sequence regions. This is very useful in designing experiments to

    test and modify the function of specific proteins,

    in predicting the function and structure of

    proteins, and in identifying new members ofprotein families.

    MULTIPLE SEQUENCEALIGNMENT

    MULTIPLE SEQUENCE

  • 8/8/2019 tics Day 1

    86/96

    ClustalW is a fully automatic program for globalmultiple alignment of DNA and proteinsequences.

    The alignment is progressive and considers the

    sequence redundancy. Trees can also be calculated from multiple

    alignments.

    The program has some adjustable parameterswith reasonable defaults

    http://www.ebi.ac.uk/Tools/clustalw2/index.html

    MULTIPLE SEQUENCEALIGNMENT

    MULTIPLE SEQUENCE

    http://www.ebi.ac.uk/Tools/clustalw2/index.htmlhttp://www.ebi.ac.uk/Tools/clustalw2/index.html
  • 8/8/2019 tics Day 1

    87/96

    Show ColorsA button labeled 'Show Colors' will be displayed in theAlignment section of results page. If you press thisbutton the alignment will be show in color according tothe table below.

    NOTE: This option only works when you have chosenALN or GCG as the output format.

    AVFPMILW - RED - Small (small+ hydrophobic(incl.aromatic -Y))

    DE BLUE ACIDIC RK - MAGENTA - BASIC STYHCNGQ GREEN - Hydroxyl + Amine + Basic - Q Others - Gray

    MULTIPLE SEQUENCEALIGNMENT

    MULTIPLE SEQUENCE

  • 8/8/2019 tics Day 1

    88/96

    CONSENSUS SYMBOLS:

    An alignment will display by default the following symbolsdenoting the degree of conservation observed in eachcolumn:

    * " means that the residues or nucleotides in thatcolumn are identical in all sequences in the alignment.

    : " means that conserved substitutions have beenobserved, according to the COLOUR table above.

    . " means that semi-conserved substitutions areobserved.

    MULTIPLE SEQUENCEALIGNMENT

    MULTIPLE SEQUENCE

  • 8/8/2019 tics Day 1

    89/96

    PHYLOGENETIC TREE

    Phylogram is a branching diagram (tree) assumed to be an estimateof a phylogeny, branch lengths are proportional to the amount ofinferred evolutionary change.

    A Cladogram is a branching diagram (tree) assumed to be an

    estimate of a phylogeny where the branches are of equal length,thus cladograms show common ancestry, but do not indicate theamount of evolutionary "time" separating taxa.

    Tree distances can be shown, just click on the diagram to get amenu of options. The ".dnd" file is a file that describes thephylogenetic tree.

    These are now in controlled with new buttons in the output file aswell as a pop up menu, that is available by right-clicking on theapplet.

    The buttons on the page include "Show as Phylogram Tree", "Showas Cladogram Tree" and "Show Distances".

    U S QU CALIGNMENT

    MULTIPLE SEQUENCE

  • 8/8/2019 tics Day 1

    90/96

    QALIGNMENT

    KALIGN

    T-COFFEE

    COBALT

    DIALIGN

    KALIGN

  • 8/8/2019 tics Day 1

    91/96

    KALIGN

    Kalign is a fast alignment method forprotein and nucleotide sequences.

    It uses a fast approximate string matching

    algorithm to estimate sequence distancesquickly and accurately.

    As a result Kalign is very fast compared to

    other programs and can align 1500sequences in under 10 seconds.

    T-COFFEE

  • 8/8/2019 tics Day 1

    92/96

    T COFFEE T-Coffee is a multiple sequence alignment program.

    Multiple sequence alignment programs are meant toalign a set of sequences previously gathered using otherprograms such as blast,

    The main characteristic of T-Coffee is that it will allowyou to combine results obtained with several alignmentmethods.

    For instance if you have an alignment coming fromClustalW2, an other alignment coming from Dialign, anda structural alignment of some of your sequences, T-Coffee will combine all that information and produce anew multiple sequence having the best agreement whith

    all these methods. By default, T-Coffee will compare all you sequences two

    by two, producing a global alignment and a series oflocal alignments (using lalign). The program will thencombine all these alignments into a multiple alignment.

    COBALT

    http://www.ebi.ac.uk/Tools/clustalw2/http://www.ebi.ac.uk/Tools/clustalw2/
  • 8/8/2019 tics Day 1

    93/96

    COBALT

    COBALT (Constraint based MultipleAlignment Tool) is a multiple sequencealignment tool that finds a collection of

    pairwise constraints derived fromconserved domain database, protein motifdatabase, and sequence similarity, usingRPS-BLAST, BLASTP, and PHI-BLAST.

    Pairwise constraints are then incorporatedinto a progressive multiple alignment.

    DIALIGN

  • 8/8/2019 tics Day 1

    94/96

    DIALIGN

    DIALIGN is a software program for multiple sequencealignment developed by BurkhardMorgensternet al.

    While standard alignment methods rely on comparingsingle residues and imposing gap penalties, DIALIGNconstructs pairwise and multiple alignments bycomparing entire segments of the sequences.

    No gap penalty is used.

    This approach can be used for both global and localalignment, but it is particularly successful in situations

    where sequences share only localhomologies. http://bibiserv.techfak.uni-bielefeld.de/dialign/submission.html

    DIALIGN

    http://www.gobics.de/burkhard/http://www.gobics.de/burkhard/http://bibiserv.techfak.uni-bielefeld.de/dialign/submission.htmlhttp://bibiserv.techfak.uni-bielefeld.de/dialign/submission.htmlhttp://www.gobics.de/burkhard/http://www.gobics.de/burkhard/
  • 8/8/2019 tics Day 1

    95/96

    DIALIGN

    Names of the aligned sequences are shown onthe left hand side of the alignment.

    Numbers on the left hand side of the alignmentdenote the position of the first residue in a line

    within the respective sequence. Capital letters denote aligned residues. Lower-case letters denote residues not

    considered to be aligned by DIALIGN. Thus, ifa lower-case letter is standing in the samecolumn with other letters, this is pure chance;these residues are not considered to behomologous.

    DIALIGN

  • 8/8/2019 tics Day 1

    96/96

    DIALIGN

    The number of `*' characters below thealignment reflects the degree of local similarity

    among sequences.

    The number of `*' characters is normalized suchthat regions of maximum similarity have N`*'

    characters per column. Ncan be specified by

    the user. By default, N= 5. Note that the number

    of `*' characters depicts the relative degree ofsimilarity within an alignment, since in every

    alignment, the region of maximum similarity gets