week 1 notes/slides/information for the purdue big data ... 1 day... · big data challenges in...

55
Week 1 Notes/Slides/Information for the Purdue Big Data For Biologists Workshop 2016 Fleet 2016

Upload: others

Post on 07-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Week1Notes/Slides/InformationforthePurdueBigDataFor

BiologistsWorkshop

2016

Fleet 2016

Page 2: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Welcome!

DAY1Session1:

WELCOME

EncouragingBigDataThinkinginBiomedicalresearch

Page 3: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Introductions

Who are you? Where are you from?

What is your research interest?

Why are you interested in “big data”?

Fleet 2016

Page 4: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

InstructorsandTeachingAssistants

Main Instructors:James C. Fleet, PhD (Nutrition Science)Wanqing Liu, PhD (Medicinal Chemistry and Molecular

Pharmacology)Pete Pascuzzi, PhD (Libraries)Min Zhang, PhD (Statistics)

Teaching Assistants:Chen Chen (Statistics)Min Ren (Statistics)

Harley SchawadronFleet 2016

Page 5: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

GuestLecturers

Week 1:Doug Crabill (Purdue University)Sean Davis (National Cancer Institute)Nadia Atallah (Purdue University)Ingenuity Systems Staff

Week 2:Jennifer Neville (Purdue University)Shi Li Lin (Ohio State University)Martin Lundquist (John Hopkins University)

Fleet 2016

Page 6: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Strengths/WeaknessesOf Hypothesis-DrivenScience

A B

C D

I

E

F

G

H

K. Hokusai

Fleet 2016

Page 7: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

DisruptiveTechnologiesPropelScientificAdvancement

GenomicGEO

ENCODEGTEx

ProteomicProteomicsDB

PhenotypesJAX

TCGA

MetabolomicHMDB

Genetic: WebQTL/JAX1000 genomes

What can we do with all this information?

Fleet 2016

Sequenced genomes helped scientists see that data generation need not be a barrier to scientific progress.

2001 Breakthrough of the Year

Page 8: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Newtechnologiesbringchallengestousingand

integratingdata

12 Brain-imaging modalities

Requires Multidimensional

Thinking and Analytical Tools

Dinov (2016) J Med Stat Inform 4:3

Bra

in-i

mag

ing

mod

alit

y

Fleet 2016

Page 9: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

WhatisBiologicalBigDataScience(BBDS)?

An approach that integrates multi-dimensional, high-density, variously formatted data types to gain insight in the function or regulation of biological systems.

Data Velocity

Data Volume

DataVariety

Mb Gb Tb Pbbatch

periodic

Near Real Time

Real Time

“omic”

Image

population

physiologic

Two new “V’s”:VeracityValue

Fleet 2016

Page 10: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

BigDatainBiomedicineHasTwo“Flavors”

“Omics” Driven• Genotyping• Gene expression• NGS data

Payer-provider Driven• Electronic Medical

Records• Pharmacy information• Insurance Records

Basic Biological Understanding and Treatment

Discovery

Optimize Healthcare

Delivery and Economics

Fleet 2016

Page 11: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

BiologicalBigDataScience(BBDS)

CharacterizationAnalysis

Focus

Levels of ScaleGeneCell

OrganismCommunity

Global

Analytical Disciplines

Computational Disciplines

Life Sciences

Fleet 2016

Where do traditional biomedical researchers

fit?

Page 12: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

CentralDogmaofBiologyHasBeenReshapedbyBigData

Modified from Doerge (2002) NRG 3:43

Phenotype

Fleet 2016

1000 Genomes,

dbSNP

GTex, BioGPS

ENCODE

ProteomicsDB

Page 13: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

IsTraditionalScienceOutdated?

• Needs ‘falsifiable’ hypothesis

• Needs appropriate data• Statistical significance • Small effects in lots of data• Determines Causation

• No hypothesis needed• Exploratory?

• Full data not needed• Post-hoc explanation• Statistical

significance?• “Thin” data?

Hypothesis-driven vs Data-driven

“faced with massive data, this approach to science - hypothesize, model, test — is becoming obsolete.”

Chris Anderson, Editor, Wired Magazine, 2008

Fleet 2016

Page 14: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

TheWrongWaytoDoBioinformatics

Experimental Scientist

Well Designed Experiment 1

Results

And then a miracle

happens….

Fleet 2016

Page 15: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

TheChallenge ofBBDS

Statistics

Functional AnalysisVisualization

ComputingAlgorithms Statistics

Experimental Scientist

Well Designed Experiment 1

ResultsProcessed Data “n”

Computing

AlgorithmsStatistics

Sample Analysis

Raw Data

Processed Data 1

ComputingAlgorithmsStatistics

Core Facility

Fleet 2016

Page 16: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

BigDataChallengesinBiomedicine

1) Locating data2)Getting access to data3) Organizing, Managing, and Processing data4)Computational issues

• Infrastructure• Speed (computer and algorithms)

5) Analytical methods6) Training researchers to use BBD effectively

Fleet 2016

From the NIH “Big Data to Knowledge” (BD2K) program

Page 17: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

StagesofBigDataEducation

AWARENESS

FAMILIARITY

LOW LEVEL ABILITY

HIGH LEVEL ABILITY

NAIVE

EXPERTSkill +

Independence

Appreciate Value

Vocabulary to Talk with Experts to Define Goals

Use Pre-existing Software

Code, Develop Tools

Creative Work, Novel Approaches

Fleet 2016

Page 18: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

WhatWe’llcoverduringtheworkshop

Week 1Unit 1: Microarray

Unit 2: Next Generation Sequencing

Week 2Unit 3: Biomarker Discovery

Unit 4: Genetic Variation

Biological Goals:

• Regulation of gene expression

• Phenotype prediction

• Genotype-phenotype relationships

Fleet 2016

Page 19: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

WhatWewillcoverduringtheworkshopWeek 1Unit 1: Microarray

Unit 2: Next Generation Sequencing

Week 2Unit 3: Biomarker Discovery

Unit 4: Genetic Variation

Technical Goals:

• Analysis pipelines• Statistical issues• Visualization• Functional

annotation• Databases• Project management• Computation and

programming

Fleet 2016

Page 20: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Day1Session2:WorkingwiththePurdueComputerInfrastructure

Doug CrabillDepartment of Statistics

Purdue University

Page 21: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

SitestoUnderstandComputing

� UNIX operating system� Learn UNIX� http://www.tutorialspoint.com/unix/index.htm

� Linux operating system� http://www.tutorialspoint.com//operating_system/os_lin

ux.htm� R coding

� http://bioinformatics.knowledgeblog.org/2011/06/21/using-r-a-guide-for-complete-beginners/

� https://www.r-project.org/about.html

Fleet 2016

Page 22: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Monday

BREAK #1

Page 23: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Day 1Session 3: Data Repositories and Pre-processed Data Sites

James C. Fleet, PhDDistinguished ProfessorDepartment of Nutrition Science

Pete Pascuzzi, PhDAssistant ProfessorPurdue Libraries

Page 24: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Data Archives Web link Description

NIH Data Sharing Repositories

https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html

Trans-NIH BioMedical Informatics Coordinating Committee (BMIC) sites

Gene Expression Omnibus (GEO)

http://www.ncbi.nlm.nih.gov/geo/

NCBI; transcriptome and ChIP-seq datasets

Array Expresshttp://www.ebi.ac.uk/arrayexpress/

EMBL-EBI repository to archive functional genomics data

European Nucleotide Archive (ENA) http://www.ebi.ac.uk/enaComprehensive record of worlds nucleotide sequencing information

The Cancer Genome Atlas (TCGA) http://cancergenome.nih.gov/Multi "omic" phenotype characterization of tumors

Proteomics IDEntifications (PRIDE)http://www.ebi.ac.uk/pride/archive/ European proteomics datasets

Metabolomics Workbench

http://metabolomicsworkbench.org/standards/nominatecompounds.php metabolomic datasets

Fleet 2016

Page 25: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Fleet 2016

Page 26: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

GEO Datasets Week 1• GSE15947: Time course of 1,25(OH)2 D treated RWPE1 cells• GSE54783: The Osteoblast to Osteocyte Transition: Epigenetic

changes and response to the vitamin D3 hormone

GSE #: Accession number for a complete dataset that is submitted to GEO

GSM #: Accession number for a specific sample within a dataset

GPL #: The platform used to generate a dataset

SRX #: Accession number for a sample generated by NGS that is deposited in the Short Read Archive (SRA)

Fleet 2016

Page 27: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Data Archives Web link Description

Oncominehttps://www.oncomine.org/resource/login.html 715 microarray datasets from 19 cancers

Gene Expression across Normal and Tumor tissue (GENT)

http://medical-genome.kribb.re.kr/GENT/

gene expression patterns in human cancer from Affy Chips (+ 1000 cell lines)

BioGPShttp://biogps.org/#goto=welcome

Human/rat/mouse transcript data by tissue

cBioPrortal http://www.cbioportal.org/ TCGA cancer genomics

Genotype-Tissue Expression project (Gtex)

http://www.gtexportal.org/home/

human, multi-tissue gene expression and gene variation for eQTL

Immunological Genome Project (Immgen) http://www.immgen.org/

transcriptome data from cultured mouse immune cells

Human Brain Transcriptome http://hbatlas.org/ transcriptome and associated metadata for developing and adult human brain.

NHLBI Kidney Transcriptome database

https://hpcwebapps.cit.nih.gov/ESBL/Database/Transcriptomic/index.html

Segment-specific expression in rat kidney

Kidney Systems Biology Projecthttps://hpcwebapps.cit.nih.gov/ESBL/Database/

Multi-omic database from rat and mouse studies

Saccharomyces Genome Databasehttp://www.yeastgenome.org/transcriptome-data-in-yeastmine

Integrated biological information on budding yeast

miRBase http://mirbase.org/

published miRNA sequences,annotation. Expression dataset links available

Fleet 2016

Page 28: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Day 1Session 4: Understanding R and Bioconductor

James C. Fleet, PhDDistinguished ProfessorDepartment of Nutrition Science

Pete Pascuzzi, PhDAssistant ProfessorPurdue Libraries

Page 29: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Fleet 2016

Page 30: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Fleet 2016

Page 31: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Fleet 2016

Page 32: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Rstudio is a Graphical User Interface (GUI) that let’s you use R more conveniently

ScriptWindow

CommandWindow

Bioconductor packageWindow

Fleet 2016

Page 33: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Day 1Session 5-6:

Microarray Overview

Data Processing and QC

James C. Fleet, PhDDistinguished ProfessorDepartment of Nutrition Science

Pete Pascuzzi, PhDAssistant ProfessorPurdue Libraries

Page 34: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

1,25(OH)2 D Induces Gene Transcription through the Vitamin

D Receptor

25-OH D 1, 25 (OH)2 D

CYP27B1

Fleet et al. (2012) Biochem J 441:61

Classical Role: Regulate whole body calcium metabolism

NGS project: Osteoblast and osteocytes

Novel Role:Prevent and treat cancer

Microarray project:Normal prostate epithelial cell

Page 35: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Zinser and Welsh 2004, Carcinogenesis 25:2361Larriba et al., 2011, PLoS One 6:e23524

Colon Cancer

Mordan-Mcombs et al., 2010, JSBMB 121:368

APCmin MiceTumor size

WT HT KO(5 months)

Tum

or s

ize (m

m)

MMTV-neu MiceTumor Incidence

Breast Cancer

WT

KOTum

or-F

ree

Mice

(%)

Age (mo)3 6 9 12 15 18

LPB-Tag MiceTumor Size

Prostate Cancer

Tum

or w

eigh

t (m

gs)

Age (wks)7 9 12 15 18

KO

WT

VDRDeletionAcceleratesCancerDevelopmentinMice

Fleet 2016

Page 36: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

StrategiesofCancerPrevention

Umar et al. (2012) Nat. Rev. Cancer 12:835

Cancer Prevention

Cancer Treatment

Time (y)

Diagnosis

Survival without Cancer

Survival with

Cancer

Fleet 2016

Page 37: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

ResearchParadigm:StudyGeneexpressionwithDNA

microarray

Kovalenko et al. (2010) 1,25 dihydroxyvitamin D-mediated orchestration of anticancer, transcript-level effects in the immortalized, non-transformed prostate epithelial cell line, RWPE1. BMC Genomics 11:26

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2820456/

Fleet 2016

Page 38: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

StudyDesign� RWPE1 cells

� Human prostate epithelial cell line (ATCC CRL-11609)� 54 y old white male

� Peripheral zone of normal health prostate� HPV18 immortalized� Not tumorigenic in nude mice

Fleet 2016Kovalenko et al. (2010) BMC Genomics 11:26

60% Confluent Culture

100 nM 1,25(OH)2 D or EtOH control

6 h 24 h 48 h

Media change

2 x 3 factorial design, n = 4/group (24 total samples)

GEO ID = GSE15947

Page 39: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Affymetrix Microarray Analysis Workflow

High Quality RNA

Affymetixanalysis

Affymetix QC analysis

Process Data

• RMA• Normalize• MAS5 (P/A)

Data Filters• P/A call• Fold

AffymetixRaw Data

Additional QC analysis

Normalized Vetted Data

Statistical Analysis

Differentially Expressed Gene List

Clustering and

visualization

Pathway and GenesetAnalysis

Network building

Interpret Experiment

Well Designed

Study

Fleet 2016

Page 40: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Affymetrix Tiling Arrays

11-25 probes or probe pairs per gene

Perfect Match = 25 bpMismatch in one base

Algorithms from Affymetrixand others define “expression”

Used to define “background”

Fleet 2016

Page 41: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Probe ArrayPost-hybridization

Probe Set

Probe Pair(PM/MM)

Probe Cell

40 x 107 copies/cell

25 base long probes

Probe ArrayPre-hybridization

cDNA library and Biotin Label

Hybridization to Array

U133 Plus 2 GeneChip:54,000 probe sets

Affymetrix GeneChip Microarrays

U133 Plus 2 GeneChip:11 probe pairs/gene

Fleet 2016

Page 42: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Residual Plots of Scanned GeneChips

http://plmimagegallery.bmbolstad.com/

High quality (mostly white)

ResidualsRed = (+)Blue = (-)White = 0

Too Variable (intense red/blue)

ScratchedUneven (edge drying)

Uneven

Fleet 2016

Page 43: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Relative Log Expression (RLE)

Large spread and non-zero

= bad

Control limits

IQR limits

Outside control limits

= bad

provides a measure of reproducibility of gene expression data that can be compared across batches, experiments or trials.

Fleet 2016

Page 44: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Normalized Unscaled Standard Error (NUSE)

Large spread and non-zero = bad

Outside control limits

= bad

Control limits

IQR limits

Values have no units. Used to assess relative quality of arrays within an analysis set

Fleet 2016

Page 45: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

95%

99%

541

Fleet 2016

Page 46: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Monday

BREAK #2

Page 47: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Day 1Session 7: Differential Gene Expression

James C. Fleet, PhDDistinguished ProfessorDepartment of Nutrition Science

Pete Pascuzzi, PhDAssistant ProfessorPurdue Libraries

Page 48: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Statistics for Biologists

http://www.nature.com/collections/qghhqm

Nature Publishing:

Collections of short articles on basic statistical concepts for biologists

Fleet 2016

Page 49: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

What is a p-Value?

the probability of obtaining an effect at least as extreme as the one you observed.

Control Mean = 260Treatment Mean = 330.6

260330.6

J. Frost, MiniTab BlogFleet 2016

Page 50: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

What is type I error?

Rejecting the null hypothesis when it is in fact true….a

false positive.

Null Hypothesis is True

Null Hypothesis is False

Reject Null Hypothesis

Type I error (False positive, a)

Correct Inference (True positive)

Fail to Reject Null Hypothessis

Correct Inference (True negative)

Type II error (False negative, b)

Fleet 2016

Page 51: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Why is type I error a problem for “omics” studies?

“large p, small n problem”

100 repeats of same experiment: 5 potential false positives when a = 0.05

25,000 transcripts on an array = 1,250 false positives at a = 0.05

P(at least one Type I error among m tests) =

e.g. When α = .05 and m = 10, P = 0.401

Fleet 2016

Page 52: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

How do we control type I error in “omics” studies?

Familywise Error Rate Correction (FWER):

the probability of making even one false discovery in a set of comparisons.

e.g. Bonferroni test

a/(# comparisons) = 0.05/25,000 = 0.000002

Very conservative but useful if the goal is to only find the changes that are most reliable (and likely the largest)

Fleet 2016

Page 53: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

How do we control type I error in “omics” studies?

False Detection Rate (FDR):

e.g. Benjamini and Hochberg procedure

i/m(Q) = (rank/(# comparisons) )* (false discovery rate)

Gene P value Rank FDR

Gene 1 0.0001 1 0.000002

Gene 2 0.0004 2 0.000004

Gene 1500 0.049 1500 0.003

Gene 25,000 0.988 25000 0.05

Accepts a set rate of false positives within a number of comparisons (e.g. 5% FDR means 5/100 “significant” comparisons are likely false

positives)Fleet 2016

Page 54: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Data Reduction as a Strategy to Minimize Type I Error Problem

Filter out genes with:• Low expression

or• High variation

Hackstadt and Hess (2009) BMC Informatics 10:11

MAS5 (use M=P) # “Present”

At least 4/24 = P 28,883

75% total = P 21,448

75% P in at least 1 group 25,985

50% total = P 23,793

50% P in at least 1 group 29,260

Drop Bottom 25% # “Present”

75% total = P 40,597

75% P in at least 1 group 45,689

Drop genes SD/mean>0.25 # “Present”

75% total = P 18,661

54,677 genes on the U133 Plus 2.0 array

Fleet 2016

Page 55: Week 1 Notes/Slides/Information for the Purdue Big Data ... 1 Day... · Big Data Challenges in Biomedicine 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing

Processing and Statistics Influence the DEG List

Processing Statistic DEG at 48 h

gcRMA SAM 3566

RMA SAM 2249

RMA limma 1021

Fleet 2016