03 wright carolina ebrc

8/8/2019 03 Wright Carolina EBRC

1/32

1

Fred Wright, Ph.D.

University of North Carolina at Chapel Hill

The CarolinaEnvironmental

Bioinformatics Research Center


2/32

2

Organization of Center Three major Research Projects: (1) Biostatistics,

(2) Cheminformatics, and (3) ComputationalInfrastructure for Systems Toxicology

Administrative Unit

Public Outreach and Training Activity (POTA) Functional areas ofAnalysis, Methods

Development and Tools Development overseen

by a panel of experienced investigators


3/32

3

Director

Dr. Fred WrightExternal Science

Advisory Committee

Administrative Co-Director

Dr. Mary Sym

Analysis - Dr. Lawrence KupperMethods - Dr. J. Stephen Marron

Tools - Dr. Jan Prins

Integration

Translation(POTA)

Dr. BradHemminger

QualityAssurance

Dr. Mary Sym

Research Oversight

Project 1

Biostatistics

Dr. Fred Wright

Project 2

Chem-informatics

Dr. AlexTropsha

Project 3

Computationand SystemsToxicology

Dr. Ivan Rusyn

Dr. David Stotts

CEBRC Center Organization

Scientific Co-Director

Dr. Ivan Rusyn


4/32

4

(1) Biostatistics inComputational Toxicology

Emphasis on strengths inmicroarray analysis,

elucidation of

networks/pathways,Bayesian approaches

Stresses existing

capabilities


5/32

5

(2) Chem-informatics seeks to establish a

universally applicable androbust predictive toxicologymodeling framework

Focuses on Quantitative

Structure Activity/PropertyRelationships (QSAR)

Establishes a modeling

workflow, toxicityprediction scheme and planfor software development

Figure 4. Predictive QSPR Modeling Workflow

InputStructure

File

ConvertStructures

dbtranslateBabeletc.

MolconnZGenAP

etc.

GenerateDescriptors

Utility(UNC)

NormalizeDescriptors

Descriptor Generation

MolconnZ-ToDescr

(UNC)

ReformatDescriptors

Descriptor formatting

InputDescriptor

File

InputActivity

File

Train & TestSet Selection

SE6(UNC)

Build & Test Models

Randomize(UNC)

RandomizeActivities

RWKNN,SAPLS (UNC),

etc

QSARAlgorithm

KNNPredict,

SAPLSPred (UNC)etc.

PredictTest Set

Report & Visualize Results

ModStat(UNC)

CompileResults

Weblab, TSAR,MOE, Spotfire,

etc.

VisualizeResults Database

to Screen

Screen Database

Utility

NormalizeDescriptors

DBMine,KNNPredict,

etc.

MineDatabase

TSAR, MOE,Spotfire,

etc.

VisualizeHits

QSAR

Model(s)

programsfunctions User input

Do y-randomTesting?

Randomize

Y-rand

test

QSARModel

Predict

Pass orFail?

Train/testSelection?

Ye s

No

PassFailUpdate Status Db

To Complete

DescriptorPreparation

QSARModeling/

Predict

No

UpdateStatus DB

To Complete

SE6

QSARModelpredict

Are there moretraining sets?

No

Ye s

CollectGRID

Process

Results

Ye s

Figure 5. Workflow Logic of Model Generation Process


6/32

6

(3) Computational

Framework for

Systems Toxicology

Uses model for toxicityprofiling in multiple strains

of mice to set up

computational infrastructure Some data mining activity

will develop user-friendly

software tools from methodsin Projects 1 and 2

Control C57BL/10J BUB/BnJ MSM/Ms

Liver Injury


7/32

7

Project 1Biostatistics in Computational Toxicology Fred Wright, Ph.D. (P.I.) statistical genetics, genomic analysis

Mayetri Gupta, Ph.D. sequence analysis, motif detection Young Troung, Ph.D. Bayesian network genetic analysis, SVM

methods for metabolomic data

Joseph Ibrahim, Ph.D. Bayesian analysis of microarray data Danyu Lin, Ph.D. haplotype-phenotype analysis, microarray

analysis

Fei Zou, Ph.D. statistical genetics, genomic analysis

Andrew Nobel, Ph.D. clustering, data dimensional reduction,

genetic pathway analysis

Masters trained personnel


8/32

8

Project 1 objectives

to provide statistical analysis capability

to develop appropriate new methods to apply to

computational toxicology problems

to develop computational tools

to disseminate research findings, train students, and

coordinate additional statistical research incomputational toxicology.


9/32

9

Methods (to name a few)

Sample size estimation for high-throughput data

P-value computation, significance testing Multiple-testing issues, false discovery rates

Dose-response modeling

New measures of differential expression

Transcriptional regulation and motif discovery

Network analysis, discrimination methods Pathway analysis


10/32

10

Tools Much of initial code has been implemented in

R/Bioconductor. This is directly useful to other

statistical investigators.

Working with project 3 investigators and students toproduce user-friendly web-based and/or standalone

applications

Working to increase utility of methods by integrationwith informatics and biological annotation

We view the SAM software as a model for

independent successful dissemination. Project 3

personnel are working on implementing appropriate

procedures in ArrayTrack


11/32

11

Log(sample mean)

Expression measurementsshow a mean-variancerelationship

Which we can exploit to

reduce the false discoveryrate

Grey envelope shows allthe SAM procedures

Example 1. New ways of detecting

differential expression

Hu and Wright, in press

Lowest curve is a newmaximum likelihoodprocedure

Log(sample mean)


12/32

12

Example 2. Significant genes/pathways/categories: theSignificance Analysis ofFunction and Expression

procedure (honest pathway significance testing)

Category

p-value

Genes in

category

Shading

indicatesindividually

significant

genes

Barry et al. Bioinformatics 21:1943-1949 , 2005


13/32

13

Key: blue (p


14/32

14

Example 3. Isotonic regression: gene expression dose-response data

Model -

fshould be

strictly

increasing or

decreasing

Hu et al., 2005, Bioinformatics21: 3524-3529).


15/32

15

Pyrethroid Biomarker Project (J. Harrill, K.

Crofton and colleagues, U.S. E.P.A)

Problem: Lack of a cost efficient biomarker of effect

hampers assessments of the cumulative risk ofpyrethroid insecticides.

Aim: Develop a biochemical biomarker of effect for

pyrethroids that reflects changes in neuronal firingrates.

Methods: Use gene arrays and RT-PCR to identify

dose-responsive transcripts in rat CNS. Permethrin

and deltamethrin each examined at four doses,

Affymetrix arrays.


16/32

16

( ) ( )highest lowest f dose f doseM

v

=

Dose-response, cont.: a statistic to rank genes

Standard error estimate.Could be improved.

Dose-response dataon pyrethroid in ratbrains, courtesy ofJ. Harrill and K.Crofton, U.S. E.P.A.


17/32

17

Project 2Chem-informatics

Alex Tropsha, Ph.D. (P.I.) computational chemistry, QSAR Weifan Zheng, Ph.D. computational methods in drug discovery,

QSAR

Alexander Golbraikh, Ph.D. mathematical approaches in QSAR

development

Yufeng Liu, Ph.D. Support vector machines, semi-supervised

machine learning

additional personnel


18/32

18

Project 2 objectives

to develop an innovative QSPR modeling workflow

based on the principles of combinatorial QSPR

modeling, model validation and consensus prediction

to develop toxicity predictors using the workflow

to integrate modeling tools and endpoint predictors

using workflow design middleware and workflow

deployment in a predictive toxicology web portal

Applied to toxicology datasets


19/32

19

Project 2 builds on years of research in the Tropsha lab

on QSAR/QSPR modeling and developing robustpredictors

Many of the machine learning and cross-validation ideas

are used in statistical genomics

Descriptors topological molecular indices, size and

shape, hydrophilic/phobic indices, physical properties,

etc.

Try to predict biological activity

Analysis of the Carcinogenic Potency Database

(collaboration with Dr. A. Richard, EPA) has been

performed, applied to 693 compounds, with

classification kNN QSAR prediction accuracies

estimated at 85%-90%.


20/32

20

Only accept models

that have aq2 > 0.6

R2 > 0.6, etc.

MultipleTraining Sets

Validated Predictive

Models with High Internal& External Accuracy

OriginalDataset

MultipleTest Sets

Combi-QSARModelingSplit into

Training andTest Sets

ActivityPrediction

Y-Randomization

Database

Screening

Only accept models

that have aq2 > 0.6

R2 > 0.6, etc.

MultipleTraining Sets

Validated Predictive

Models with High Internal& External Accuracy

OriginalDataset

MultipleTest Sets

Combi-QSARModelingSplit into

Training andTest Sets

ActivityPrediction

Y-Randomization

Database

Screening

Flowchart of predictive toxicology framework based on

validated combi-QSAR models. Numerous public datasets

proposed.


21/32

21

Project 3Computational Infrastructure for Systems Toxicology

David Stotts, Ph.D. (co-P.I.) computer science, software

engineering

Ivan Rusyn, Ph.D. (co-P.I.) toxicology, genomics Wei Wang, Ph.D. computer science, data mining

David Threadgill, Ph.D. mammalian genetics, genomics

Additional programmers and students


22/32

22

Project 3 objectivesDevelop and implement algorithms that streamlineanalysis of multi-dimensional data streams.

Facilitate the development of a standard workflow for

(i) analysis of -omics data, (ii) linkages to classical

indicators of adverse health effects, and (iii) integration

with other types of biological information such as

sequence and cross-species comparisons.

Implement user-friendly software for approaches fromProjects 1-3


23/32

23

A driving biological problem:

Toxicogenetic analysis of susceptibility to

toxicant-induced organ injury The model being used by Drs. Threadgill and

Rusyn involves extensive profiling of numerous

mouse strains (over 40) for relevant organs Early data on acetominophen and alcohol on liver

Proposals for trichloroethylene and other toxicants

on liver, kidney, and other organs


24/32

24Image courtesy of D.W. Threadgill

The Mouse as a Model for StudyingGenotype-Phenotype Interactions


25/32

25

Strain-specific susceptibility toacetaminophen (APAP)-induced

liver injury. Serum ALT levels(top panel) and tissuehistopathological changes(bottom panel) were assessed24 hrs after a single doseexposure to APAP (300mg/kg, i.g., 24 hrs).

A large variation in response by

genetic background


26/32

26

Toxicological and expression analysis of genotype-specific responses toethanol in liver. Serum and liver tissues were collected from mice of 6different strains after acute (5 g/kg, 6 hrs; A) or subchronic (4 weeks, B)treatment with ethanol.


27/32

27

Systems Biology Approach


28/32

28

Transcriptome map for the murine brain and liver.

Examination of genetic networks that regulate

gene expression in liver (webQTL and beyond)

Source: Ivan Rusyn and colleagues


29/32

29

Correlation between gene expression of CYP2C29 and several liver-specific phenotypes recorded for BXD strains.

Source: Ivan Rusyn and colleagues


30/32

30

Development of new methods for -omics data analysis:Finding associations between gene expression profiles and

strain-specific genotyping data

SiZer

Smoothing

approach


31/32

31

Data analysis procedures in concert with project 1,including principal component analyses, distance-weighted discrimination, SAFE, etc.

Specific data mining approaches also proposed, suchas subspace clustering (SNPs vs. phenotypes, geneexpression), that fall outside of typical statistical

framework The computational challenges are immense when wecompare different omics platforms (e.g., 100,000SNPs X 30,000 transcripts)

This requires serious computer science (activities ofUNC SNP group).


32/32

32

Collaborations with the NCCT and NJ ebCTC Several collaborative projects identified and ongoing

Several graduate students working on projects at

NCCT or shared advising with CEBRC members

Implementation of ArrayTrack extensions underway

in Project 3

Plan for deliberate expansion of collaborations with

NCCT

The ebCTC is highly complementary to our Center.

Coordination with ebTrack/ArrayTrack development

is one key area of interchange.

03 wright carolina ebrc

Documents