03 wright carolina ebrc

Upload: srikantapatra

Post on 10-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 03 Wright Carolina EBRC

    1/32

    1

    Fred Wright, Ph.D.

    University of North Carolina at Chapel Hill

    The CarolinaEnvironmental

    Bioinformatics Research Center

  • 8/8/2019 03 Wright Carolina EBRC

    2/32

    2

    Organization of Center Three major Research Projects: (1) Biostatistics,

    (2) Cheminformatics, and (3) ComputationalInfrastructure for Systems Toxicology

    Administrative Unit

    Public Outreach and Training Activity (POTA) Functional areas ofAnalysis, Methods

    Development and Tools Development overseen

    by a panel of experienced investigators

  • 8/8/2019 03 Wright Carolina EBRC

    3/32

    3

    Director

    Dr. Fred WrightExternal Science

    Advisory Committee

    Administrative Co-Director

    Dr. Mary Sym

    Analysis - Dr. Lawrence KupperMethods - Dr. J. Stephen Marron

    Tools - Dr. Jan Prins

    Integration

    Translation(POTA)

    Dr. BradHemminger

    QualityAssurance

    Dr. Mary Sym

    Research Oversight

    Project 1

    Biostatistics

    Dr. Fred Wright

    Project 2

    Chem-informatics

    Dr. AlexTropsha

    Project 3

    Computationand SystemsToxicology

    Dr. Ivan Rusyn

    Dr. David Stotts

    CEBRC Center Organization

    Scientific Co-Director

    Dr. Ivan Rusyn

  • 8/8/2019 03 Wright Carolina EBRC

    4/32

    4

    (1) Biostatistics inComputational Toxicology

    Emphasis on strengths inmicroarray analysis,

    elucidation of

    networks/pathways,Bayesian approaches

    Stresses existing

    capabilities

  • 8/8/2019 03 Wright Carolina EBRC

    5/32

    5

    (2) Chem-informatics seeks to establish a

    universally applicable androbust predictive toxicologymodeling framework

    Focuses on Quantitative

    Structure Activity/PropertyRelationships (QSAR)

    Establishes a modeling

    workflow, toxicityprediction scheme and planfor software development

    Figure 4. Predictive QSPR Modeling Workflow

    InputStructure

    File

    ConvertStructures

    dbtranslateBabeletc.

    MolconnZGenAP

    etc.

    GenerateDescriptors

    Utility(UNC)

    NormalizeDescriptors

    Descriptor Generation

    MolconnZ-ToDescr

    (UNC)

    ReformatDescriptors

    Descriptor formatting

    InputDescriptor

    File

    InputActivity

    File

    Train & TestSet Selection

    SE6(UNC)

    Build & Test Models

    Randomize(UNC)

    RandomizeActivities

    RWKNN,SAPLS (UNC),

    etc

    QSARAlgorithm

    KNNPredict,

    SAPLSPred (UNC)etc.

    PredictTest Set

    Report & Visualize Results

    ModStat(UNC)

    CompileResults

    Weblab, TSAR,MOE, Spotfire,

    etc.

    VisualizeResults Database

    to Screen

    Screen Database

    Utility

    NormalizeDescriptors

    DBMine,KNNPredict,

    etc.

    MineDatabase

    TSAR, MOE,Spotfire,

    etc.

    VisualizeHits

    QSAR

    Model(s)

    programsfunctions User input

    Do y-randomTesting?

    Randomize

    Y-rand

    test

    QSARModel

    Predict

    Pass orFail?

    Train/testSelection?

    Ye s

    No

    PassFailUpdate Status Db

    To Complete

    DescriptorPreparation

    QSARModeling/

    Predict

    No

    UpdateStatus DB

    To Complete

    SE6

    QSARModelpredict

    Are there moretraining sets?

    No

    Ye s

    CollectGRID

    Process

    Results

    Ye s

    Figure 5. Workflow Logic of Model Generation Process

  • 8/8/2019 03 Wright Carolina EBRC

    6/32

    6

    (3) Computational

    Framework for

    Systems Toxicology

    Uses model for toxicityprofiling in multiple strains

    of mice to set up

    computational infrastructure Some data mining activity

    will develop user-friendly

    software tools from methodsin Projects 1 and 2

    Control C57BL/10J BUB/BnJ MSM/Ms

    Liver Injury

  • 8/8/2019 03 Wright Carolina EBRC

    7/32

    7

    Project 1Biostatistics in Computational Toxicology Fred Wright, Ph.D. (P.I.) statistical genetics, genomic analysis

    Mayetri Gupta, Ph.D. sequence analysis, motif detection Young Troung, Ph.D. Bayesian network genetic analysis, SVM

    methods for metabolomic data

    Joseph Ibrahim, Ph.D. Bayesian analysis of microarray data Danyu Lin, Ph.D. haplotype-phenotype analysis, microarray

    analysis

    Fei Zou, Ph.D. statistical genetics, genomic analysis

    Andrew Nobel, Ph.D. clustering, data dimensional reduction,

    genetic pathway analysis

    Masters trained personnel

  • 8/8/2019 03 Wright Carolina EBRC

    8/32

    8

    Project 1 objectives

    to provide statistical analysis capability

    to develop appropriate new methods to apply to

    computational toxicology problems

    to develop computational tools

    to disseminate research findings, train students, and

    coordinate additional statistical research incomputational toxicology.

  • 8/8/2019 03 Wright Carolina EBRC

    9/32

    9

    Methods (to name a few)

    Sample size estimation for high-throughput data

    P-value computation, significance testing Multiple-testing issues, false discovery rates

    Dose-response modeling

    New measures of differential expression

    Transcriptional regulation and motif discovery

    Network analysis, discrimination methods Pathway analysis

  • 8/8/2019 03 Wright Carolina EBRC

    10/32

    10

    Tools Much of initial code has been implemented in

    R/Bioconductor. This is directly useful to other

    statistical investigators.

    Working with project 3 investigators and students toproduce user-friendly web-based and/or standalone

    applications

    Working to increase utility of methods by integrationwith informatics and biological annotation

    We view the SAM software as a model for

    independent successful dissemination. Project 3

    personnel are working on implementing appropriate

    procedures in ArrayTrack

  • 8/8/2019 03 Wright Carolina EBRC

    11/32

    11

    Log(sample mean)

    Expression measurementsshow a mean-variancerelationship

    Which we can exploit to

    reduce the false discoveryrate

    Grey envelope shows allthe SAM procedures

    Example 1. New ways of detecting

    differential expression

    Hu and Wright, in press

    Lowest curve is a newmaximum likelihoodprocedure

    Log(sample mean)

  • 8/8/2019 03 Wright Carolina EBRC

    12/32

    12

    Example 2. Significant genes/pathways/categories: theSignificance Analysis ofFunction and Expression

    procedure (honest pathway significance testing)

    Category

    p-value

    Genes in

    category

    Shading

    indicatesindividually

    significant

    genes

    Barry et al. Bioinformatics 21:1943-1949 , 2005

  • 8/8/2019 03 Wright Carolina EBRC

    13/32

    13

    Key: blue (p

  • 8/8/2019 03 Wright Carolina EBRC

    14/32

    14

    Example 3. Isotonic regression: gene expression dose-response data

    Model -

    fshould be

    strictly

    increasing or

    decreasing

    Hu et al., 2005, Bioinformatics21: 3524-3529).

  • 8/8/2019 03 Wright Carolina EBRC

    15/32

    15

    Pyrethroid Biomarker Project (J. Harrill, K.

    Crofton and colleagues, U.S. E.P.A)

    Problem: Lack of a cost efficient biomarker of effect

    hampers assessments of the cumulative risk ofpyrethroid insecticides.

    Aim: Develop a biochemical biomarker of effect for

    pyrethroids that reflects changes in neuronal firingrates.

    Methods: Use gene arrays and RT-PCR to identify

    dose-responsive transcripts in rat CNS. Permethrin

    and deltamethrin each examined at four doses,

    Affymetrix arrays.

  • 8/8/2019 03 Wright Carolina EBRC

    16/32

    16

    ( ) ( )highest lowest f dose f doseM

    v

    =

    Dose-response, cont.: a statistic to rank genes

    Standard error estimate.Could be improved.

    Dose-response dataon pyrethroid in ratbrains, courtesy ofJ. Harrill and K.Crofton, U.S. E.P.A.

  • 8/8/2019 03 Wright Carolina EBRC

    17/32

    17

    Project 2Chem-informatics

    Alex Tropsha, Ph.D. (P.I.) computational chemistry, QSAR Weifan Zheng, Ph.D. computational methods in drug discovery,

    QSAR

    Alexander Golbraikh, Ph.D. mathematical approaches in QSAR

    development

    Yufeng Liu, Ph.D. Support vector machines, semi-supervised

    machine learning

    additional personnel

  • 8/8/2019 03 Wright Carolina EBRC

    18/32

    18

    Project 2 objectives

    to develop an innovative QSPR modeling workflow

    based on the principles of combinatorial QSPR

    modeling, model validation and consensus prediction

    to develop toxicity predictors using the workflow

    to integrate modeling tools and endpoint predictors

    using workflow design middleware and workflow

    deployment in a predictive toxicology web portal

    Applied to toxicology datasets

  • 8/8/2019 03 Wright Carolina EBRC

    19/32

    19

    Project 2 builds on years of research in the Tropsha lab

    on QSAR/QSPR modeling and developing robustpredictors

    Many of the machine learning and cross-validation ideas

    are used in statistical genomics

    Descriptors topological molecular indices, size and

    shape, hydrophilic/phobic indices, physical properties,

    etc.

    Try to predict biological activity

    Analysis of the Carcinogenic Potency Database

    (collaboration with Dr. A. Richard, EPA) has been

    performed, applied to 693 compounds, with

    classification kNN QSAR prediction accuracies

    estimated at 85%-90%.

  • 8/8/2019 03 Wright Carolina EBRC

    20/32

    20

    Only accept models

    that have aq2 > 0.6

    R2 > 0.6, etc.

    MultipleTraining Sets

    Validated Predictive

    Models with High Internal& External Accuracy

    OriginalDataset

    MultipleTest Sets

    Combi-QSARModelingSplit into

    Training andTest Sets

    ActivityPrediction

    Y-Randomization

    Database

    Screening

    Only accept models

    that have aq2 > 0.6

    R2 > 0.6, etc.

    MultipleTraining Sets

    Validated Predictive

    Models with High Internal& External Accuracy

    OriginalDataset

    MultipleTest Sets

    Combi-QSARModelingSplit into

    Training andTest Sets

    ActivityPrediction

    Y-Randomization

    Database

    Screening

    Flowchart of predictive toxicology framework based on

    validated combi-QSAR models. Numerous public datasets

    proposed.

  • 8/8/2019 03 Wright Carolina EBRC

    21/32

    21

    Project 3Computational Infrastructure for Systems Toxicology

    David Stotts, Ph.D. (co-P.I.) computer science, software

    engineering

    Ivan Rusyn, Ph.D. (co-P.I.) toxicology, genomics Wei Wang, Ph.D. computer science, data mining

    David Threadgill, Ph.D. mammalian genetics, genomics

    Additional programmers and students

  • 8/8/2019 03 Wright Carolina EBRC

    22/32

    22

    Project 3 objectivesDevelop and implement algorithms that streamlineanalysis of multi-dimensional data streams.

    Facilitate the development of a standard workflow for

    (i) analysis of -omics data, (ii) linkages to classical

    indicators of adverse health effects, and (iii) integration

    with other types of biological information such as

    sequence and cross-species comparisons.

    Implement user-friendly software for approaches fromProjects 1-3

  • 8/8/2019 03 Wright Carolina EBRC

    23/32

    23

    A driving biological problem:

    Toxicogenetic analysis of susceptibility to

    toxicant-induced organ injury The model being used by Drs. Threadgill and

    Rusyn involves extensive profiling of numerous

    mouse strains (over 40) for relevant organs Early data on acetominophen and alcohol on liver

    Proposals for trichloroethylene and other toxicants

    on liver, kidney, and other organs

  • 8/8/2019 03 Wright Carolina EBRC

    24/32

    24Image courtesy of D.W. Threadgill

    The Mouse as a Model for StudyingGenotype-Phenotype Interactions

  • 8/8/2019 03 Wright Carolina EBRC

    25/32

    25

    Strain-specific susceptibility toacetaminophen (APAP)-induced

    liver injury. Serum ALT levels(top panel) and tissuehistopathological changes(bottom panel) were assessed24 hrs after a single doseexposure to APAP (300mg/kg, i.g., 24 hrs).

    A large variation in response by

    genetic background

  • 8/8/2019 03 Wright Carolina EBRC

    26/32

    26

    Toxicological and expression analysis of genotype-specific responses toethanol in liver. Serum and liver tissues were collected from mice of 6different strains after acute (5 g/kg, 6 hrs; A) or subchronic (4 weeks, B)treatment with ethanol.

  • 8/8/2019 03 Wright Carolina EBRC

    27/32

    27

    Systems Biology Approach

  • 8/8/2019 03 Wright Carolina EBRC

    28/32

    28

    Transcriptome map for the murine brain and liver.

    Examination of genetic networks that regulate

    gene expression in liver (webQTL and beyond)

    Source: Ivan Rusyn and colleagues

  • 8/8/2019 03 Wright Carolina EBRC

    29/32

    29

    Correlation between gene expression of CYP2C29 and several liver-specific phenotypes recorded for BXD strains.

    Source: Ivan Rusyn and colleagues

  • 8/8/2019 03 Wright Carolina EBRC

    30/32

    30

    Development of new methods for -omics data analysis:Finding associations between gene expression profiles and

    strain-specific genotyping data

    SiZer

    Smoothing

    approach

  • 8/8/2019 03 Wright Carolina EBRC

    31/32

    31

    Data analysis procedures in concert with project 1,including principal component analyses, distance-weighted discrimination, SAFE, etc.

    Specific data mining approaches also proposed, suchas subspace clustering (SNPs vs. phenotypes, geneexpression), that fall outside of typical statistical

    framework The computational challenges are immense when wecompare different omics platforms (e.g., 100,000SNPs X 30,000 transcripts)

    This requires serious computer science (activities ofUNC SNP group).

  • 8/8/2019 03 Wright Carolina EBRC

    32/32

    32

    Collaborations with the NCCT and NJ ebCTC Several collaborative projects identified and ongoing

    Several graduate students working on projects at

    NCCT or shared advising with CEBRC members

    Implementation of ArrayTrack extensions underway

    in Project 3

    Plan for deliberate expansion of collaborations with

    NCCT

    The ebCTC is highly complementary to our Center.

    Coordination with ebTrack/ArrayTrack development

    is one key area of interchange.