nomenclature - i-med.ac.at · 2018-06-02 · using r packagesdeseq, deseq2, edger htseq can be used...

35
1 Symbol Meaning Description R A or G puRine Y C or T pYrimidine W A or T Weak hydrogen bonds S G or C Strong hydrogen bonds M A or C aMino groups K G or T Keto groups H A, C, or T (U) not G, (H follows G) B G, C, or T (U) not A, (B follows A) V G, A, or C not T (U), (V follows U) D G, A, or T (U) not C, (D follows C) N G, A, C or T (U) aNy nucleotide Nomenclature of nucleic acids Base Symbol Occurrence Adenin A DNA, RNA Guanin G DNA, RNA Cytosin C DNA, RNA Thymin T DNA Uracil U RNA + strand 5´-ACGGTCGCTGTCGGTAGC-3´ - strand 3´-TGCCAGCGACAGCCATCG-5´ e.g. in fasta format : >gene sequence|gi12345|chr17|- GCTACCGACAGCGACCGT DNA sequences are always from 5‘ to 3‘ Positions in the genome (genome assembly) are chromosome wise e.g. human GRCh37/hg19 chr11:1100 chr11:49,686,77749,689,777 Positions in the chromosome start for both!! strands from position 1 + strand 5´-ACGGTCGCTG…………TCGGTAGC-3´ - strand 3´-TGCCAGCGAC…………AGCCATCG-5´ chr11:1 2523 2529 chr11:1 2523 2529 Nomenclature

Upload: others

Post on 08-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

1

Symbol Meaning Description

R A or G puRineY C or T pYrimidineW A or T Weak hydrogen bondsS G or C Strong hydrogen bondsM A or C aMino groupsK G or T Keto groupsH A, C, or T (U) not G, (H follows G)B G, C, or T (U) not A, (B follows A)V G, A, or C not T (U), (V follows U)D G, A, or T (U) not C, (D follows C)N G, A, C or T (U) aNy nucleotide

Nomenclature of nucleic acids

Base Symbol Occurrence

Adenin A DNA, RNAGuanin G DNA, RNACytosin C DNA, RNAThymin T DNAUracil U RNA

+ strand 5´-ACGGTCGCTGTCGGTAGC-3´- strand 3´-TGCCAGCGACAGCCATCG-5´

e.g. in fasta format :  >gene sequence|gi12345|chr17|-GCTACCGACAGCGACCGT

DNA sequences are always from 5‘ to 3‘

Positions in the genome (genome assembly) are chromosome wise

e.g. human GRCh37/hg19

chr11:1‐100      chr11:49,686,777‐49,689,777

Positions in the chromosome start for both!! strands from position 1

+ strand 5´-ACGGTCGCTG…………TCGGTAGC-3´- strand 3´-TGCCAGCGAC…………AGCCATCG-5´

chr11:1                             2523        2529

chr11:1                              2523        2529

Nomenclature

Page 2: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

2

A Ala alanine

B Asx aspartic acid or asparagineC Cys cysteineD Asp aspartic acidE Glu glutamic acid

F Phe phenylalanineG Gly glycineH His histidineI Ile isoleucine

K Lys lysineL Leu leucineM Met methionineN Asn asparagine

P Pro prolineQ GIn glutamineR Arg arginineS Ser serine

T Thr threonineV Val valineW Trp tryptophanX Xaa unknown or 'other' amino acid

Y Tyr tyrosineZ Glx glutamic acid or glutamine

Amino Acids

Translation, genetic code and reading frames

Page 3: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

3

Peptid chain, amino acid sequence, proteins

Protein sequences are always form N‐terminal end to C‐terminal end 

backbone

sidechains

E.g.. SCD sequence in fasta format

Regulation of eukaryotic transcription

Page 4: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

4

Different levels of regulation

Transcriptional regulation has largest effect on phenotype!

Chromatin states

Ernst et al. Nature 2011. 

Page 5: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

5

DNA methylation

Cytosine 5-Methylcytosine

microRNA and siRNA

Page 6: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

6

Organization of the human genome

E.g. > 1 million copies of Alu-repeats

Sequence alignmentExact string matching

NaiveZ‐box algorithm (Boyer‐Moore, Knuth‐Morris‐Pratt,…)Suffix arrays, Suffix triesBurrows‐wheeler transformation (BWT), FM‐indexHash tables (spaced seeds)

Aligning 2 sequences

Dot matrixGaps (gap open and extension, linear and convex penalty function)Distances between sequences (Hamming distance, Levenshtein distance, Edit operations)Substitution matrices

Odds ratio (random model [independent= qxiqxj] vs match model [joint=pxi yi])Log odds ratio (scores are additive)PAM, BLOSSUM

Dynamic programmingIdea: new best alignment = previous best alignment + local best alignmentGlobal alignment (Needleman‐Wunsch)

Construct matrix F, F(0,0)=0Backtracing from bottom right (=best score) to up left (=start of sequences)

Local alignment (Smith‐Waterman)Construct matrix F, F(0,j) = F(i,0)=0, include 0 as optionBacktracing from max score to 0

Used for read mapping in NGS applications

Page 7: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

7

Sequence alignment

Search similar sequences in database (db)

W‐mer indexing (hash tables)FASTA (evaluate position differences of small words (2‐6 characters) in query and db; hash tables)BLAST (Basic Local Alignment Search Tool)

Query words (W=3), find neighbourhood words with score> threshold T (e.g. T=13) = seedsExtend seeds until score drops off under X = high‐scoring segment pairs (HSP)Evaluate significance (E(S)=Kmne‐λS)BLAST variants (blastn, blastp,blastx, tblastx, tblastn, MegaBlast) 

BLAT (Blast‐like alignment tool; mostly for highly similar DNA sequences)

Multiple sequence alignment

Dynamic programming in n‐dimensions (very computer intensive)Compute  pairwise alignments to found upperboundCreate heuristic multiple alignment to found lowerboundSearch in n‐dimensional scoring matrix

Progressive tree alignment (ClustalW)Perform hierarchical clustering using distances between sequences (e.g. edit distance): Merge sequences to find ancestor sequences by finding sequences with minimum edit distance to the two children sequences

Assign weights to each branch of tree, based on distance between sequences Align sequences (starting from the closest, using a version of dynamic programming) using weights in the score function 

Profile Hidden Markov Model

ACA---ATGTCAACTATCACAC--AGCAGA---ATCACCG--ATC

P(A)=0.8P(C)=0.0P(G)=0.0P(T)=0.2

P(A)=0.8P(C)=0.2P(G)=0.0P(T)=0.0

P(A)=1.0P(C)=0.0P(G)=0.0P(T)=0.0

P(A)=0.0P(C)=0.0P(G)=0.2P(T)=0.8

P(A)=0.0P(C)=0.8P(G)=0.2P(T)=0.0

P(A)=0.0P(C)=0.8P(G)=0.2P(T)=0.0

P(A)=0.2P(C)=0.4P(G)=0.2P(T)=0.4

1.0

0.4

1.0 0.4

0.60.6

1.0 1.01.0

[AT][CG][AC][ACGT]*A[TG][GC]

Regular Expressions

p(ACACATC)=0.8*1*0.8*1*0.8*0.6 *0.4*0.6*1*1*0.8*1*0.8=0.047log‐odds=log(p(S)/0.25L)=log(0.047/0.257)

insertion state

- For multiple alignments (e.g. DNA sequences)

Page 8: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

8

Protein Sequence Analysis

Sequence alignmentBLASTFASTA

Uses collective characteristics of a family of proteinsPosition specific score matrix (PSSM)Profile HMMProfileScan, Pfam, CDD, Prosite, BLOCKSPSI‐Blast

Amino Acid CompositionHydrophobicityChargeTheorteical pI,Molecular weight

Secondary structure(Alpha helix, betastrand, beta sheet)Specialized structuresTertiary structure

Neuronal network for secondary structure prediction

Page 9: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

9

PredictProtein

•   Multi‐step predictive algorithm (Rost et al., 1994)

– Protein sequence queried against SWISS‐PROT– MaxHom used to generate iterative, profile‐basedmultiple sequence alignment (Sander and Schneider,1991)

– Multiple alignment fed into neural network (PHDsec)

• Accuracy: Average > 70%, Best‐case > 90%

• http://www.predictprotein.org/

SignalP

•   Neural network trained based on phylogeny– Gram‐negative prokaryotic– Gram‐positive prokaryotic– Eukaryotic

•  Predicts secretory signal peptides•  http://www.cbs.dtu.dk/services/SignalP/

Signal peptide score (S)

Cleavage site score (C)

Combined Score (Y)

Page 10: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

10

Two‐color microarrays

– Oligonucleotides of 60‐80 mers length– cDNA fragments from a library 

(varying lengths)

Two color microarray analysis

Experimental design (Biological replicates, Technical replicates, Dye swap, Reference design)

Image analysis (align grid, identify spots and background)

Preprocessing (subtract background, filter saturated or bad spots)

Normalization (idea: expression of majority of genes is not changing across conditions)

Normalization factor N=sum[Ri]/sum[Gi] => Gi‘=N*Gi, Ri‘=R

MA‐plot [M=log2(R/G); A=log2(R*G)/2]

=> M (=log ratios) are dependent on A (average intensities)

=> LOWESS normalization

Identification of differentially expressed genes

Moderated t‐test (t=mean(M)/[(a+s)/sqrt(n)]; a is estimated from all genes)

R‐package limma

As a result for each gene log2‐fold change(=M), p‐value, adjusted p‐value (Benjamini‐Hochbergcorrected p‐value based on the false discovery rate FDR) is calculated

All genes with adjusted p‐value<0.1 are considered statistically significant differentially expressed

Page 11: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

11

Affymetrix microarrays

Affymetrix microarrays

Page 12: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

12

Affymetrix chips

Analysis of one‐color arrays (Affymetrix)

Preprocessing (apply model of perfect match (PM) and mismatch (MM)=background,

PM‐MM is not correct)

Normalization have to be done between arrays (not within array)

Quantile normalization

Variance stabilizing normalization (VSN)

Probe summarization

Median Polish summarization

The R‐package RMA (Robust Multiarray Average) is the

method of choice which includes all 3 steps, the

resulting intensity values are log2‐transformed 

Identification of differentially expressed genes

(as for two color arrays)  using the R/Biconductor

Package limma

Page 13: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

13

Methods to correct p‐values for multiple testing

In case of 1000 tests 50 false positives are expected at an significance level of 0.05 which are declared significant.

To account for multiple testing following parameter were used: 

Family wise error (FWER): p(V>0)

False discovery rate (FDR): E(V/R)

p(i) *n/i > p(i+i) *n/(i+1)  => p(i) *n/i = p(i+i) *n/(i+1)

• Potential for surveying the entire transcriptome, including novel, un‐annotated regions.

• Helps to identify expression and function of regulatory none‐coding RNAs (e.g. lincRNA)

• Potential for determining gene structure and isoform level expression using reads mapping to splice junctions.

• Potential for making better presence/absence calls on regions.

• More expensive than microarrays

• Don‘t need to design probes

Transcriptome sequencing (RNAseq)

Page 14: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

14

Transcriptome sequencing (RNAseq)

Wang et al., Nature Rev Gen, 2009

Base calling (Phred score) 

Phred quality score Q and base‐calling error probabilities P

QPhred = ‐10 log10 P  QSolexa = ‐10 log10P

1 ‐ P

For P=0.05 the quality score Q=13 

Page 15: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

15

Base calling (FastQ format) 

@EAS54_6_R1_2_1_413_324CCCTTCTTGTCTTCAGCGTTTCTCC+;;3;;;;;;;;;;;;7;;;;;;;88

Quality scores are encoded in ASCII

1. Read mapping

2. Transcriptome reconstruction

3. Expression quantification

4. Differential expression analysis

Analysis steps

Page 16: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

16

Read mapping

Unspliced aligners Spliced aligners (Exon first vs Seed extend)

Bowtie Tophat (using Bowtie which map Exons first)SpliceMap

Tools: Bowtie, BWA, Eland, MAQ, SOAP2, GSNAP, STAR

Result is a Sequence Alignment/Map (SAM/BAM) format file

GFF/GTF files (General Feature Format, General Transfer Format) keeps information about exon, gene and transcript positions in genome assembly

FastQ files => Trim adaptors and bad quality reads (FASTQC)

GTF files 

FPKM NormalizationEstimate uncertaintyof mapped read toisoform

Reference genome

Advanced transformationand applying t-test

SAM/BAM file

Differentially expressed genes and isoformsbetween conditions

Page 17: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

17

Expression quantification

Garber et al., Nature Methods,  2011

RNAseq normalization

Reads per kilobase per million (RPKM) (divide by library size and transcirpt length)

Quantile normalization

TMM (trimmed mean of M values).

Fragments per kilobase per million reads (FPKM) (for paired‐end sequencing)

Page 18: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

18

Differential expression analysis for sequencing count data

Expect Poisson distribution (Mean=Variance) as it is typical for count data

A B C D

Gene1 1 23 2 6

Gene2 0 74 8 7

Gene3 33 4 14 8

But counts for the same gene from different biological replicates have a varianceexceeding the mean (overdispersion) can be estimated by negative‐binominalmodel.

The dispersion is estimated from the raw count distribution of all genes

Differential expressed genes are tested by negative binomial test, Wald test, orlikelihood ratio test

Using R packages DESeq, DESeq2, edgeR

HTSeq can be usedto generate genematrixraw count data

How many reads are needed (depth)?

two mouse libraries (ES,EB) yeast

E.g. 20-40 mio reads should be sufficient for human

Page 19: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

19

Clustering

• Unsupervised or supervised (classification)

• AgglomerativeBottom up approach, whereby  single expression

profiles are successively joined to form nodes.

• DivisiveTop down approach, each cluster is successively   split in the same fashion, until each cluster consists  of one single profile.

• Pearson correlation

• Euclidian distance

• Manhattan distance

Similarity distance measures

1

( )n

M i ii

d x y

-1 r 1

Page 20: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

20

Hierarchical clustering

• Agglomerative (bottom up), unsupervized• Cluster genes or samples (or both= biclustering)• Distances are encoded in dendogram (tree)• Cut tree to get clusters• Pearson correlation (usually used)• Computational intensive (correlation matrix)

1. Identify clusters (items) with closest distance2. Join to new clusters3. Compute distance between clusters (items) (see linkage)4. Return to step 1

6 cluster 

15 cluster 

Linkage

Single‐linkage clusteringMinimal distance

Complete‐linkage clusteringMaximal distance

Average‐linkage clusteringCalculated using average distance (UPGMA)Average from distances not! expression values

Weighted pair‐group averageLike UPGMA but weighted according cluster size

Within‐groups clusteringAverage of merged cluster is used instead of cluster elements

Ward’s methodSmallest possible increase in the sum of squared errors

Page 21: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

21

• partition n genes into k  clusters, where k has to be  predetermined

• k‐means clustering minimizes  the variability within and maximize between clusters

• Moderate memory and time consumption

K‐means

1. Generate random points (“cluster centers”) in n  dimensions (results are depending on these seeds).

2.Compute distance of each data point to each of the cluster centers.

3.Assign each data point to the closest cluster center.

4.Compute new cluster center position as average of points assigned.

5.Loop to (2), stop when cluster centers do not move very much.

Principal component analysis (PCA)

PCA is a data reduction technique that allows to simplify multidimensional data setsinto smaller number of dimensions (r<n).

Variables are summarized by a linear combination to the principal components. The origin of coordinate system is centered to the center of the data (mean centering) . The coordinate system is then rotated to a maximum of the variance in the first axis.

Subsequent principal components are orthogonal to the 1st PC. With the first 2 PCs usually 80‐90% of the variance can already be explained.This analysis can be done by a special matrix decomposition (singular valuedecomposition SVD).

Page 22: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

22

Classification (e.g. support vector machines)

Cross validation

K‐fold cross validation (LKOCV)

If k=1 it is called leave‐one‐out cross‐validation (LOOCV)Variance bias trade‐off

Receiver operating characteristics

Sensitivity=TP/(TP+FN)Specificity=TN/(TN+FP)

Area under curve (AUC) measure forclassifier performanceA ideal classifier AUC=1B good classifier AUC~0.8C random AUC=0.5

Sensitivity

1‐Specificity

Page 23: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

23

Biological meaning of the gene sets

?

• Gene ontology terms

• Pathway mapping

• Linking to Pubmed abstracts or associated MESH terms

• Regulation by the same transcription factor (module)

• Protein families and domains

• Gene set enrichment analysis

• Over representation analysis

Gene Ontology

The three organizing principles of GO are 

• cellular component (e.g. mitochondrium)• biological process (e.g. lipid metabolism)• molecular function (e.g. hydrolase activity)

Each entry in GO has a unique numerical identifier of the form GO:nnnnnnn, and a term name(e.g. fibroblast growth factor receptor binding). 

URL: http://www.geneontology.org/

Different evidence code (e.g. IDA inferred from direct assay)Directed acyclic graph (2 relation part of and is a)Different levels (specific terms sphingolipid metabolism vs general terms e.g. metabolism)GO terms can be occur 

The Gene Ontology project provides a controlled vocabularyto describe gene and gene product attributes in any organism. 

Page 24: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

24

Overrepresentation analysis

m

g

gene universe (whole microarray)

GO term

ci

genes in cluster(gene list)

all genes with GO term

genes in clusterwith GO term

Fisher exact test for contingency table 

m-g c-i

g i

Regulatory sequencesExperimental methods

Electro mobility shift assays (EMSA)DNase I and Exonulease FootprintingChromatin immuno precipitation (ChIP)

‐ ChIP‐chip‐ ChIP‐seq

Systematic evolution of ligands by exponential enrichment (SELEX)Reporter gene assays (luciferase) 

Computational methods

Matrix based (know in advance which transcription factor)Alignment of experimental verified transcription factor binding sites (TFBS)Position frequency matrix (PFM)Position weight matrix (PWM), position specific scoring matrix (PSSM)

W(b,i)=log2(p(b,i)/p(b));   P(b,i)=f(b,i)/N   b..base,i..position, f..frequencyEvaluation of sequence

S=∑W(n,i)

Information content Di = 2+ ∑p(b,i)log2p(b,i)Sequence LogoTF/PWM databases: Transfac, JASPAR, GenomatixMatInspector (based on information content)

SIM=∑Ci(j)*score(b,j)/ ∑ Ci(j)*max_score(j);  Ci=K*(∑ p(b,i)*ln p(b,i)+ln 4)Threshold for similarity (e.g. allow max 1 match in 10000 bp of coding sequencesBackground sequnces (Markov chains)Phylogenetic footprinting (predicted binding site is conserved, helps to reduce false positives)Profile Hiden Markov Model (HMM)

Page 25: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

25

Motif discovery

Word based counting Expectation maximum (MEME, ChIP‐MEME)Gibbs sampling

Associate regulatory sequences with expressionLinear regression

MicroRNA target predictionSequence complementarity (seed matches)ConservationThermodynamicsSite accessibilityUTR ContextCorrelation of expression profiles (GenMir++)

Databases at the NCBI

• Pubmed• Protein• Nucleotides• Structure• Genome• Books• CancerChromosomes• Conserved Domains• 3D Domains• Gene• Genome Project• dbGAP• GEO Profiles• GEO Datasets• GeneSat

• HomoloGene• Journals• MeSH• NLM Catalogs• OMIA• OMIM• PMC• PopSet• Probe• Protein Cluster• SNP• Taxonomy• UniGene• UniSTS

Page 26: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

26

GenBank (see also NCBI Nucleotide, Protein)GenBank is the NIH genetic sequence database of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain (ASN.1 format, GenBank Flat file)

Other databases

Gene (One record represents a single gene from an organism)

Gene ID 5091Official Symbol  PCOfficial Full Name pyruvate carboxylase

For human provided fromHUGO Gene NomenclatureCommitee (HGNC)

Refseq (Curated database, one per transcript per organism)

NT_ Genomic contigNM_ mRNANP_ proteinNR_ None‐coding RNAXM_ mRNAXP_ protien automatic annotation

SwissProt/UniProt (protein sequences)PDB (protein structures)HPRD (protein‐protein interaction ppi)ENSEMBL/BiomartUCSC genome browser/table browser

Page 27: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

27

Orthologs

Homologs: A – B – COrthologs: B1 – C1 Paralogs: C1 – C2 –C3 Inparalogs: C2 – C3 Outparalogs: B2 – C1Xenologs: A1 – AB1 

Ortholog predictionBest reciprocal hits (blastp)

Databases:

HomoloGene (NCBI) Inparanois (Stockholm)YOGY (eukarYotic OrtholoGY) (Sanger)

Gene set enrichment analysis

1. Given an a priori defined set of genes S.

2. Rank genes (e.g. by t‐value between 2 groups of   microarray samples)  ranked gene list L.

3. Calculation of an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L.

4.  Estimation the statistical significance (nominal P value) of the ES by using an empirical phenotype‐based permutation test procedure.

5.  Adjustment for multiple hypothesis testing by controlling the false discovery rate (FDR).

Page 28: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

28

Gene set enrichment analysis

Subramanian A et al. Proc Natl Acad Sci (2005)

Biochemical,  Metabolic, Signaling Pathways

Boehringer Mannheim mapNCI curated pathway maps (http://pid.nci.nih.gov)Signal Transduction Knowledge Environment (http://stke.sciencemag.org/cm)Kyoto Encylopedia of Genes and Genomes, KEGG (http://www.genome.jp/kegg/)BioCyc (EcoCyc)Biocarta (http://www.biocarta.com/genes/index.asp)TranspathReactome (http://www.reactome.org/)PathwayCommons (http://www.pathwaycommons.org/)

Page 29: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

29

Pathway Commons

• Aim: convenient access to pathway information• Facilitate creation and communication of pathway data• Aggregate pathway data in the public domain• Provide easy access for pathway analysis

Cytoscape

• Access pathway commons from cytoscape

• http://www.cytoscape.org• Open source software for network visualization• Active community• >40 plugins extend functionality

e.g. Bingo, ClueGO (for gene ontology)• Easy to use and good documentation

VizMapper Various layout

Cline MS et al Nat Protoc. 2007

Page 30: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

30

Map gene expression to pathways

• GenMAPP, Cytoscape, Pathway Explorer

• Pairwise similarity measures (Pearson correlation, Spearman rank correlation,Partial correlation, Mutual information)

• Connection strength (adjacency functions, weighted vs unweighted)• Network modules (measures of node dissimilarity)

(hierarchical clustering with 1‐TOM)• Reverse engineering (Boolean, Differential equation, Bayesian network)• Different network representation (metabolic, transcriptional, signaling, ppi)• Network measures (connectivity, clusteruing coefficients • Connectivity and scale‐free network topolgy

• Network motifs 

Concepts for network analysis

Page 31: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

31

Gene association network

MICO

• Discretizing expression profiles→ groups of genes with iden cal profile

• REVEAL algorithm based onMutual information

M(x,y)=I(x)+I(y)‐I(x,y)M(x,y)=I(x)→ directed 

• Correlation

Pparg

Apmap

Bogner‐Strauss et al. Cell Mol Life Sci. 2010

Adjacency function 

weighted 

unweighted 

Page 32: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

32

Weighted gene coexpression network analysis (WGCNA)

modules (subnetworks)

Hierarchical clustering of 

Topological overlap measure TOM (common neighbors)

Connectivity of gene i Connectivity of gene j Adjacency function  between gene i and gene j

Number of common neighbors

Reverse Engineering

Temporal series of dataInput ReverseEngineeringTemporal 

series of data

InputSystem Modeling

Predictive power vs. Inferential powerInstantanous model‐Synchronous model, constrains: system have to be stableBoolean networks (advantage a lot of knowledge from information theory, not 

quantitative but topology is correct allows sensitivity/robustness analyses)Differential equations (problem number of samples<<number of genes=> underdetermined,

re‐sampling, simmulated annealing, genetic algorithm)Bayesian network (acyclic graph, conditional probability, conditional independence, Bayesian sore to select best

model, causal relations, can introduce some a priori knowledge)

Perturbation (add in cellculture hormon coktail to

start differentiation)

Page 33: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

33

Different network representation

Clustering coefficient

Connectivity (degree)                        

Topological overlap (TOM)

Network measures

Page 34: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

34

• Every node can be reached from every other by a small number of hops or steps

• High clustering coefficient and low mean‐shortest path length (random graphs don’t necessarily have high clustering coefficients

• Social networks, the Internet, and biological networks all exhibit small‐world network characteristics

• Six degrees of separation (Kevin Bacon Game)

Small‐world network

Complex network models                   

Scale‐free network

Modular networks

Hierarchical networks(metabolic networks)

power lawmany genes with few neighborsmew genes with many neighbors (hubs)

Clustering coeffients

Connectivity(degree)

constant

Page 35: Nomenclature - i-med.ac.at · 2018-06-02 · Using R packagesDESeq, DESeq2, edgeR HTSeq can be used to generate gene matrix raw count data How many reads are needed (depth)? yeast

35

Scale‐free networks are robust

• Complex systems (cell, internet, social networks), are resilient to component failure

• Network topology plays an important role in this robustness (even if ~80% of nodes fail, the remaining ~20% still maintain network connectivity

• Attack vulnerability if hubs are selectively targeted

• In yeast, only ~20% of proteins are lethal when deleted,  and are 5 times more likely to have degree k>15 than k<5.

• Cellular networks are assortative, hubs tend not to interact directly with other hubs.

• Hubs tend to be “older” proteins (so far claimed for protein‐protein interaction networks only) 

• Hubs also seem to have more evolutionary pressure—their protein sequences are more conserved than average between species (shown in yeast vs. worm)

• Experimentally determined protein complexes tend to contain solely essential or non‐essential proteins—further evidence for modularity.

Network motifs

NAR speeds up the response time of gene circuitsNAR can reduce cell‐cell variation in protein levelsPAR works in the opposite way

ab

c

Feedforward loopNegative and positive autoregulatory loop

Suppress short signals