hop indonesia

Upload: irvan-teha

Post on 09-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 hop Indonesia

    1/80

    1

    Dr. Virendrakumar (Virendra) C. BhavsarProfessorDean 2003-2008

    Director, Advanced Computational Research Lab. 2000-10Faculty of Computer Science

    University of New Brunswick (UNB)Fredericton, Canada

    Visiting ProfessorCenter for Development of Advanced Computing (C-DAC)

    Pune, India

    Bioinformatics An Overview

    2

    Outline

    Introduction UNB, C-DAC, Bioinformatics

    Genome Genes, Proteomes, Evolution

    Databases and Information Retrieval

    Sequence Alignment and Phylogenetic trees

    Protein Structure and Drug Discovery

    Proteomics and Systems Biology

    Infrastructure: UNB and C-DAC

    Research Work at the University of New Brunswick and C-DAC

    Future

  • 8/8/2019 hop Indonesia

    2/80

    3

    University of New Brunswick (UNB)

    4

    Faculty of Computer Science

    The First Faculty of CS in Canada

    University of New Brunswick

    Fredericton, New BrunswickFredericton, New BrunswickCanadaCanada

    Oldest English Language University in CanadaOldest English Language University in Canada

    Established in 1785Established in 1785

  • 8/8/2019 hop Indonesia

    3/80

    55

    Fredericton and UNB

  • 8/8/2019 hop Indonesia

    4/80

    77

    8

    Center for Development of AdvancedComputing (C-DAC)

    India

  • 8/8/2019 hop Indonesia

    5/80

    1987

    The Government of India decides to

    launch a national initiative for

    development of indigenoussupercomputers

    Government of USA refuses sale of

    Supercomputer to India

    India requires Supercomputer for

    Weather Forecasting

    History

    Garuda GridComputing

    Social Computingwith participatory

    approach

    1991

    1994

    1998

    2002-03

    200710 TF

    PARAM Padma

    Viable HPC businesscomputing environment

    PARAM 10000

    Platform for User communityto interact/ collaborate

    PARAM 8000

    Technology Denial

    2010100 TF

    2012-131 PF

    PoC

    100 Mbps

    17 Locations

    Main

    PhaseGaruda

    PARAM 9000

    C-DAC: HPC : Evolution and

    Road Map

  • 8/8/2019 hop Indonesia

    6/80

    Headquarter Pune

    Centres

    Pune Knowledge Park, Bangalore

    Electronics City, Bangalore

    Chennai

    Delhi

    Hyderabad

    Kolkata

    Mohali

    Mumbai

    Noida Thiruvananthapuram

    C-DAC HQ

    Centres

    C-DAC Centres

    Total Manpower is 2100 across all the centres of C-DAC

    C-DACs Thrust Areas

    High Performance Computing & Grid Computing Hardware, Software, Systems, Applications, Research, Technology, Infrastructure

    Multilingual Computing Tools, Fonts, Products, Solutions, Research, Technology Development

    Software Technologies OSS, Multimedia, ICT for masses, E-Governance, Geomatics

    Professional Electronics Digital Broadband, Wireless Systems, Network Technologies, Power Electronics, Real-Time

    Systems, Embedded Systems, VLSI/ASIC Design, Agri Electronics

    Cyber Security & Cyber Forensics Cyber Security tools, technologies & solution development, Research & Training

    Health Informatics Hospital Information System, Telemedicine, Decision Support System

    Ubiquitous Computing RFID, Design, Development and Integration of Ubicomp System Components

    Education & Training e-Learning Technologies & Services

  • 8/8/2019 hop Indonesia

    7/80

    Compute Nodes

    No. of Processors : 248 (Power 4 @ 1 GHz)

    Aggregate Peak Computin g : 10 05 GF s (~ 1 TF )

    File Servers

    No. of Processors : 24 (UltraSparc-III@900MHz)

    Aggregate Memory : 96 GigaBytes

    Internal Storage : 0.4 TeraBytes

    File System : QFS

    Operating System : Solaris 8

    Networks

    Primary : PARAMNet-II @ 2.5 Gbps Full Duplex

    Backup : Gigabit Ethernet @ 1 Gbps Full Duplex

    Management : 10/100 MBPs Fast Ethernet

    External Storage

    Storage Array : 5 TeraBytes with 16 T3 disk arrays

    Tape Library : 12 TeraBytes - L700 (5 LTO drives

    Software

    HPCC - C-DACs High performance computing and communication software suite

    Compilers, Parallel Libraries and Tools

    Ranked 171 in 2nd quarter end and 258 as per the latest ranking

    C-DAC

    Advanced Computing Training School (ACTS)

  • 8/8/2019 hop Indonesia

    8/80

    ACTS @ a glance

    An outfit initiated by C-DACR&D in 1993

    Begun with modest 20

    students and grown to over5000 students

    Trained more than quartermillion students

    Grown from one city onecentre to 30 cities and 50centres within India

    Over 150 crores of investmentand 600 plus dedicatedmanpower

    Spread from India toInternational

    From One course to morethan 10 courses

    International Presence

    Tajikistan

    Uzbekistan

    Mauritius

    Ghana

    Seychelles

    Myanmar

    Russia

    Tanzania

    Turkmenistan

    Lesotho

    BelarusSaudi ArabiaAzerbaijan

    Armenia

  • 8/8/2019 hop Indonesia

    9/80

    Post Graduate CoursesDAC : Diploma in Advanced Computing

    DACA : Diploma in Advanced Computer ArtsDVLSI : Diploma in VLSI Design

    WiMC : Diploma in Wireless & Mobile Computing

    DSSD : Diploma in System Software Development

    DGi : Diploma in Geo informatics

    DISCS : Diploma in Information System & Cyber Security

    DHI : Diploma in Healthcare Informatics

    DLC : Diploma in Language Computing

    DIVESD: Diploma in Integrated VLSI & Embedded SystemDesign

    DESD : Diploma in Embedded Systems Design

    DPC : Diploma in Parallel Computing

    Post Graduate Diploma Programs

    M.Tech. Programs

    Computer Science & Engineering

    Software Engineering

    Information Technology

    VLSI

    Artificial Intelligence

    Grid Computing & Storage Management

    Embedded Systems Design

    Wireless & Network Technology

    Process Control & Instrumentation

  • 8/8/2019 hop Indonesia

    10/80

    Training Programmes UNDER Tech sangam

    20

    Bioinformatics

  • 8/8/2019 hop Indonesia

    11/80

    21

    Bioinformatics

    The creation and development of advancedinformation and computational techniques for solving

    problems in biology

    and development of advanced information andHigh Performance Computing (HPC)Hardware and software for high speed computations

    and large storageor solving problems in biology

    Definitions

    22

    Bio Introduction

  • 8/8/2019 hop Indonesia

    12/80

    23

    in biology

    Molecular Biology

    Living organisms (on Earth)

    Lipids - Separate inside from outside

    Proteins Build 3D machinery to perform biological

    functionsDNA: Store information on how to build machinery (DNA)

    Diagram of a cell

    Lipid membranes - provide barrier

    Protein structures - do work

    DNA nucleus - store info

    24

    in biology

    Molecular Biology

    Deoxyribonucleic Acid (DNA)

    Composition

    - Sequence of nucleotides

    0Nucleotide = deoxyribose sugar + phosphate group +base

  • 8/8/2019 hop Indonesia

    13/80

    25

    in biology

    Molecular Biology - DNA

    DNA: contains genetic instructions used in thedevelopment and functioning of all known livingorganisms with the exception of some viruses.

    DNA molecules: long-term storage ofinformation.

    DNA: a set ofblueprints, like a recipe or a code, since it

    contains the instructions needed to construct othercomponents ofcells, such as proteins and RNAmolecules.

    Genes: The DNA segments that contain instructions toconstruct the above components of cells

    Other DNA sequences: structural purposes, or areinvolved in regulating the use of this genetic information.

    Chemically, DNA consists of two long polymers of simple

    units called nucleotides, with backbones made ofsugarsand phosphate groups joined by esterbonds. These twostrands run in opposite directions to each other and aretherefore anti-parallel. Attached to each sugar is one offour types of molecules called bases. It is the sequenceof these four bases along the backbone that encodes

    26

    in biology

    Molecular Biology - DNA

    - two long polymers of simple units called nucleotides,with backbones made ofsugars and phosphate groups

    joined by esterbonds.

    - These two strands run in opposite directions to eachother and are therefore anti-parallel.

    -Attached to each sugar is one of four types of moleculescalled bases. It is the sequence of these four bases alongthe backbone that encodes information. This informationis read using the genetic code, which specifies the

    sequence of the amino acids within proteins.

    -The code is read by copying stretches of DNA into therelated nucleic acid RNA, in a process calledtranscription.

    - Within cells, DNA is organized into long structurescalled chromosomes. These chromosomes areduplicated before cells divide, in a process called DNAreplication. Eukaryotic organisms (animals, plants, fungi,and protists)

  • 8/8/2019 hop Indonesia

    14/80

    27

    in biology

    Molecular Biology - DNA

    -DNA is organized into long structures calledchromosomes.

    - Chromosomes are duplicated before cells divide, in aprocess called DNA replication.

    - Eukaryotic organisms (animals, plants, fungi, andprotists) store most of their DNA inside the cell nucleusand some of their DNA in organelles, such asmitochondria orchloroplasts.

    - Prokaryotes (bacteria and archaea) store their DNA onlyin the cytoplasm.

    28

    in biology

    Molecular Biology

    RNA: Ribonucleic acid (RNA)

    - a long chain of nucleotide units

    - Each nucleotide consists of a nitrogenous base, aribose sugar, and a phosphate

    RNA is very similar to DNA

    RNA is usually single-stranded

    DNA is usually double-stranded

    RNA nucleotides contain ribose while DNA containsdeoxyribose (a type of ribose that lacks one oxygenatom)

    RNA has the base uracil rather than thymine that ispresent in DNA

  • 8/8/2019 hop Indonesia

    15/80

    29

    in biology

    Molecular Biology

    DNA: DNA DNA (Replication)

    RNA: DNA RNA (Transcription / GeneExpression)

    Protein: RNA Protein (Translation)

    DNA, RNA, Proteins

    Proteins and nucleic acids (DNA, RNA) are essentialcomponents for living organisms

    DNA Transcription RNA Translation Proteins

    Chromosome

    DNA

    DNA

    Gene 1 Gene 2 . . . .

    (gene)

  • 8/8/2019 hop Indonesia

    16/80

    Raw Biological data Nucleic Acids (DNA)

    Raw Biological data

    Amino acid residues (proteins)

  • 8/8/2019 hop Indonesia

    17/80

    Standard Genetic Code

    T C A G

    T

    TTT Phe (F)TTC "TTA Leu (L)TTG "

    TCT Ser (S)TCC "TCA "TCG "

    TAT Tyr (Y)TACTAA TerTAG Ter

    TGT Cys (C)TGCTGA TerTGG Trp (W)

    C

    CTT Leu (L)CTC "CTA "CTG "

    CCT Pro (P)CCC "CCA "CCG "

    CAT His (H)CAC "CAA Gln (Q)CAG "

    CGT Arg (R)CGC "CGA "CGG "

    A

    ATT Ile (I)ATC "ATA "ATG Met (M)

    ACT Thr (T)ACC "ACA "ACG "

    AAT Asn (N)AAC "AAA Lys (K)AAG "

    AGT Ser (S)AGC "AGA Arg (R)AGG "

    G

    GTT Val (V)GTC "GTA "

    GTG "

    GCT Ala (A)GCC "GCA "

    GCG "

    GAT Asp (D)GAC "GAA Glu (E)

    GAG "

    GGT Gly (G)GGC "GGA "

    GGG "

    Triplets of DNA called Codons code into a amino acid

    A Protein StructureA Protein Structure

  • 8/8/2019 hop Indonesia

    18/80

    Protein 3D structure

    The structure of the protein sequence determines theThe structure of the protein sequence determines the

    functionalityfunctionality

    http://anatomy.med.unsw.edu.au/cbl/research/cytoskeleton/swissprotactin.htm

    36

    Informatics

  • 8/8/2019 hop Indonesia

    19/80

    FASTA formatted Sequences

    FASTA: "FAST-All alignment; it works with any alphabet- FAST-P for protein- FAST-N for nucleotide alignment

    Sample FASTA formatted Sequences

    FASTA:"FAST-All alignment; it works with any alphabet, an

    extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment.

    EST sequence (A, C, G, T)>gi|39796586|gb|CK247430.1|CK247430 EST731067 potato callus cDNA library,

    mRNA sequence

    ACAAGTCACTATAGGGACATGCTTCAATTTTTTCAAAACATCTTGAATAGTACAAAGTGCACAACATACT

    CCAAAAAACTGAATACATTTTCTATTGTCAATATCTATAGCCATATGACTTTCAGTGCGACCTATGCATT

    CATAACTCCCGCTACCAAATCCACCATGTAGTGCTTACAACAACAAGCCTAGTGAGAACGTAAGCCTGGT

    CTGGAGCCAAAAGCAAATTATGTATACTAAAAAACCCCCTGGCTAAAATGCATATCATGATTAGTAGTGA

    CATT

    Protein Sequence (20 different amino acids)>gi|532319|pir|TVFV2E|TVFV2E envelope protein

    ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT

    QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC

    HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK

    MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK

    TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF

    APTEVRRYTGGHERQKRVPF

  • 8/8/2019 hop Indonesia

    20/80

    Biological Databases

    Genome databases flat files or relational database

    GenBank, EMBL, DDBJ, PDB, SWISSPROT, PIR

    Classification of Biological databases:

    - primary databases (GenBank, EMBL, DDBJ)

    - secondary databases (SWISSPROT, PDB, PIR)

    Biological databases

    Like any other database

    Data organization for optimal analysis

    Data is of different types

    Raw data (DNA, RNA, protein sequences)

    Curated data (DNA, RNA and proteinannotated sequences and structures,expression data)

  • 8/8/2019 hop Indonesia

    21/80

    41

    for solving problems in biology

    Biological databases -Examples

    Nucleotide DatabasesAlternative Splicing, EMBL-Bank, Ensembl, Genomes Server, Genome,MOT, EMBL-Align, Simple Queries, dbSTS Queries, Parasites, Mutations,IMGT

    Genome DatabasesHuman, Mouse, Yeast, C.elegans, FLYBASE, Parasites

    Protein DatabasesSwiss-Prot, TrEMBL, InterPro, CluSTr, IPI, GOA, GO, Proteome Analysis,

    HPI, IntEnz, TrEMBLnew, SP_ML, NEWT, PANDIT Structure Databases

    PDB, MSD, FSSP, DALI Microarray Database

    ArrayExpress Literature Databases

    MEDLINE, Software Biocatalog, Flybase Archives Alignment Databases

    BAliBASE, Homstrad, FSSP

  • 8/8/2019 hop Indonesia

    22/80

    3D Macromolecular structural data

    Data originates from NMR or X-raycrystallography techniques

    If the 3D structure of a protein is solved ...they have it

    PDB Protein Data Bank

    What to take home

    Databases are a collection of data

    Need to access and maintain easily and flexibly

    Biological information is vast and sometimesvery redundant

    Distributed databases bring it all together withquality controls, cross-referencing andstandardization

    Computers can only create data, they do notgive answers

  • 8/8/2019 hop Indonesia

    23/80

    45

    Bioinformatics

    46

    Gene sequences determine biological function

    Genomic DNA Amino acids Proteins Function

    Similar composition similar function?

    - DNA sequences- Amino acid sequences

    - Protein 3-D structure

    Predicting protein function

    - Designer drugs- Personalized treatments solving problems in biology

    Premise of Bioinformatics

  • 8/8/2019 hop Indonesia

    24/80

    47

    Bioinformatics

    Determining protein function

    Hard way

    -Biological / chemical analyses

    - Determine 3D structure w/ x-ray crystallography, NMR

    Easy way?

    - Sequence protein / DNA find close match in database

    - Guess function based on match

    - Validate guess in lab

    Bioinformatics is imprecise

    - Similar to data-mining

    - Only suggests possible relationships

    - Must validate correlation causation

    48

    Growth of Bioinformatics

    1970s

    - DNA sequencing

    - Alignment w/ Smith-Waterman (dynamic programming)

    1980s

    - Sequence databases (EMBL, GenBank)

    - Alignment w/ FASTA (linked lists, hashing)

    1990s

    - Automatic DNA sequencing

    - Alignment w/ BLAST (neighborhood words, probabilities)

    - Internet & WWW

    Now

    - Genomics, Proteomics

  • 8/8/2019 hop Indonesia

    25/80

    49

    Bioinformatics Topics

    Sequence alignments

    - Find similarity between DNA / protein (amino acid) sequences

    Genome assembly

    - Combining genomic fragments to form whole genome

    Gene identification & annotation

    - Identify and classify genes on the genome

    Microarrays & gene expression analysis

    - Use DNA microarray (gene chip) to measure mRNA

    Protein folding

    - Compute 3-D protein structure protein sequence

    Phylogenetic analysis

    - Find genetic relationships between sequences and speciesbetween

    between sequences / species

    What Does Genomics Mean?

    Genomics: a science that studies the geneticmaterial of a species at the molecular level

    A scientific approach to identify and define thefunction of genes, as well as uncover when and howgenes work together to produce traits

    Structural Genomics approaches (mapping) -

    focus on traits controlled by one or a few genes, andoften only provide information regarding thelocation of a gene or genes

    Examine the interrelationships and interactionsbetween thousands of genes

    How do we do this?

  • 8/8/2019 hop Indonesia

    26/80

    Genome Organization

    Leaf Tuber

    Chromosome

    DNA

    Genome Organization Proteins are building blocks for living organisms

    Proteins are derived from DNA transcription the gene (RNA) that codes proteins is formed from DNA Translation RNA triplets (codons) code into amino acids

    DNA Gene can also be known by finding complimentary (cDNA), the activeor expressed gene is termed as Expressed Sequence Tags (ESTs)

    Chromosome

    DNA

    DNA

    Gene 1 Gene 2 . . . .

  • 8/8/2019 hop Indonesia

    27/80

    PromoterSwitch

    Coding ORFMessage

    ....TATACAGCAAAATAGAAAGATCTAGTGTCCCATGGCGATGAGTCGTGTAGCTTCT.

    DNA

    Gene 1 Gene 2 Etc.

    Genome Organization

    cDNA Collections (Libraries)

    Various tissues are collected from the plant,and messages are extracted

    Leaf

    Messages

    Tuber

    Messages

  • 8/8/2019 hop Indonesia

    28/80

    cDNA Collections (Libraries)

    The messages are copied to form double-stranded DNA copies (cDNA) of each message

    Leaf cDNA Tuber cDNA

    Each copy is glued into a piece of bacterial DNAfor easier storage, handling and propagation,resulting in a collection or library of cDNAs

    for each tissue

    cDNA Collections (Libraries)

    The cDNAs are then read or sequenced, to give the

    order of As, Cs, Gs or Ts for each

    We are left with the sequence of each gene that is

    active (expressed) in each cell, tissue or organ studies

    These are Expressed Sequence Tags or ESTs

    Using complex computer resources, these ESTs can

    be analyzed and compared with known sequences

    and proteins

    Look for messages associated with specific organs or

    characteristic/traits

  • 8/8/2019 hop Indonesia

    29/80

    Take Home Points

    Messages from various genes are important,as they dictate which proteins are produced

    Promoters are also important, as they dictatewhere a specific message and protein isproduced

    Genomics involves the study of all of themessages produced by the various plant cells

    A lot of information needs to be organizedand analyzed

    Database

    Contains all the ESTs sequences

    Contains useful annotations

    Blast Searches

    Contig Assemblies

    Transmembrane Spanning Regions Gel Pictures

    EST Information

  • 8/8/2019 hop Indonesia

    30/80

    Data Analysis

    Tens of thousands of ESTs available for study

    Most methods to study message distributions arelow throughput AND time consuming

    Genomics necessitates the large scale study of

    gene expression

    How can we do this?

    Microarray Analysis

    Microarray Analysis

  • 8/8/2019 hop Indonesia

    31/80

    Microarray Analysis

    Microarray Analysis

  • 8/8/2019 hop Indonesia

    32/80

    Microarray Analysis - Processing

    IntensityDepe ndenceComparison

    R2 = 0.2014

    R2 = 0.6185

    -6

    -4

    -2

    0

    2

    4

    6

    8

    10

    12

    0 2 4 6 8 10 12 14 16 18

    0.5*(Log(G)+Log(R))

    Log(R/G) Slide3

    Slide70

    Poly. (Slide70)

    Poly. (Slide3)

    Image Processing

    Data Normalization

    Differential

    GeneExpression

    Cluster

    Analysis

    Pathway

    Analysis

    Analysis

    Microarray Analysis - Processing

  • 8/8/2019 hop Indonesia

    33/80

    Signal

    Background

    Microarray Analysis - Processing

    Irregular size orshape

    Irregular placement

    Low intensity

    Saturation

    Spot variance

    Background variance

    indistinguishable saturated bad print artifactmiss alignment

    Microarray Analysis - Processing

  • 8/8/2019 hop Indonesia

    34/80

    Calculate numeric characteristics of each spot

    Throw out spots that do not meet minimumrequirements for each characteristic

    Throw out spots that do not have minimumoverall combined quality

    Microarray Analysis - Processing

    Microarray Analysis - Data

    Normalization

    Normalize data to correct for variances

    Dye bias

    Location bias

    Intensity bias

    Pin bias

    Slide bias

    Control vs. non-control spots

  • 8/8/2019 hop Indonesia

    35/80

    Cluster genes based on expression profiles

    Gene expression across several treatments

    Hypothesis: Genes with similar function havesimilar expression profiles

    Microarray Analysis -Clustering

    Expression Profile Clustering

  • 8/8/2019 hop Indonesia

    36/80

    Project

    Database

    Engine

    Microarray Analysis - Data Management

    Information Processing and Handling

    Assembly and annotation of genomic data

    EST analysis and databases

    Cluster analysis of microarray data

    Comparisons of various transcriptomic methods

    Integration of sequence, transcriptomic, proteomic,

    metabolomic, transgenic data

  • 8/8/2019 hop Indonesia

    37/80

    73

    Research Problems in Bioinformatics

    Find genomes of all organisms

    Identify and annotate all genes

    Compute sequence 3D structure for all proteins

    Compare DNA / protein sequences for similarity

    Compare families of DNA / protein sequences

    Reason to be optimistic: Biology is finite

    ~30,000 human genes; ~1000 protein superfamilies

    but computers speeds keep increasing

    Fighting Bird FluFighting Bird Flu

  • 8/8/2019 hop Indonesia

    38/80

    Virus in 3-DVirus in 3-D

    76

    Bioinformatics Infrastructure HighPerformance Computing

  • 8/8/2019 hop Indonesia

    39/80

    77

    1974 - 1 MHz clock1988 40 MHz2002 2 GHz2009 P4 3.0 GHz, Quadcore 2.66 MHz

    Intel Montecito chip1.72 Billion transistors

    NVidia 280 series GPU 1.4 Billion transistors

    - Circuit complexity doubles every 18 months Computing power at a given cost doubles every 18

    months

    - Processor clock rates: 40% increase/year + moreinstr./cycle

    - DRAM Access Times: 10% increase/year cachesrequired

    Advances in Microprocessor Technology

    78

    Jaguar

    Oak Ridge National Lab., USA

    - 1.72 Petaflop/s (Quadrilion): million billion (10**15)floating-point operations/sec (Flops) onLinpack benchmark

    -2.332 Petaflops peak (.i.e 2332 Tera flops)

    - Power 1750 Watt/sq ft; ~50 million KWh per year

    - Space 4352 square feet, larger than NBAbasketball court

    -

    Current Supercomputer Nov 2009

  • 8/8/2019 hop Indonesia

    40/80

    79

    Jaguar

    Current Supercomputer Nov 2009

    80

    Jaguar

    Current Supercomputer Nov 2009

  • 8/8/2019 hop Indonesia

    41/80

    Future

    IBM Cyclops64 supercomputer on a chip

    C-DAC initiative for 2010 petaflopmachine

    NCSA, USA 2011 petaflop machine

    NASA, SGI and Intel Pleiades 10petaflop by 2012

    1 Exaflop (10**18 flops) by 2019

    Human brain neural simulations 10exaflop by 2025

    2-week Full Weather modeling 1 zetaflops (10**21 flops) by 2030

    High Performance Computing and Networking@

    University of New Brunswick

  • 8/8/2019 hop Indonesia

    42/80

    Advanced Computational Research Lab(ACRL) Infrastructure

    People, Research, Excellence

    ACEnet: Atlantic Computational ExcellenceNetwork

    Hosting sites:

    Member sites:

  • 8/8/2019 hop Indonesia

    43/80

    ACEnet

    Atlantic Canada is a distributed environment

    $30 million initiative

    Waterways make networkingsolutions difficult (e.g. Cabot Strait)

    ACEnet

    World-class HPC facilities

    Behave as a single, regionally distributedcomputational power grid

    Create and operate sophisticatedcollaboration facilities to bind togethergeographically dispersed researchcommunities.

  • 8/8/2019 hop Indonesia

    44/80

    ACEnet at UNB

    Fundy: SUN cluster, AMD Opeteron, 632 cores

    ACEnet: 3324 cores

    Internet connectivity > 2Gbps at UNB

  • 8/8/2019 hop Indonesia

    45/80

    Collaboration Grid

    Collaboration gear across Atlantic Canada Lecture rooms equipped so ACEnet sites can share

    seminars and participate remotely

    ACEnet cafs at each site sharing continuous videofeeds

    Desktop level collaboration equipment for personalcommunication

    Access Grid streams tens to hundreds ofMbps across the CANARIEnetwork

    ACEnet

    Bioinformatics Research@

    University of New Brunswick

  • 8/8/2019 hop Indonesia

    46/80

    The Canadian Potato Genome Project

    Collaborators

    Dr.Patricia Evans (UNB), Dr.Barry Flinn (BioAtlantech), Dr. David Dekoyer (PotatoResearch Center), Carleton University, Nova Scotia Agricultural College

    Students: Aijazuddin Syed (MCS Student), En Zhang (MCS Student),

    Zheng Wang (MCS Student), Marc Cooper (MCS Student),

    Rachita Sharma (PhD Student)

    Potato

    Integral part of diet French fries,mashed potatoes

    Provides 12 essential vitamins

    Fourth important crop worldwide

    Potato has not been explored in termsof functional and bio-chemical traits

    Potato genome is much unknownregarding the control of potatodevelopment and processing/qualitytraits (disease resistance, stress tolerance, carbohydratemetabolism, tuber shape)

  • 8/8/2019 hop Indonesia

    47/80

    Economic Importance Of The Potato

    Integral part of the diet of a largeproportion of the worlds population

    Supplies at least 12 essential vitamins

    and minerals

    Still much unknown regarding the

    control of potato development and

    processing/quality traits(ie. disease resistance, stress tolerance, carbohydrate metabolism, tuber shape)

    The Canadian Potato Genome Project (CPGP)

    46% of national potato production $1 Billion/year

    Home of McCain Foods Ltd. $5.5 billion/year

    Potato Research Center (PRC) of AAFC

    Solanum Genomics International Inc./BioAtlantech

    Carleton University

    University of New Brunswick

    Nova Scotia Agricultural College (NSAC)

  • 8/8/2019 hop Indonesia

    48/80

    CPGP Goals

    Leaf Tuber

    CPGP targets genes associated with

    tuber health and tuber quality: Tuber Health Late Blight and

    Common Scab

    Tuber Quality Stable dry matter

    accumulation, cold sweetening and

    after-cooking darkening

    DNA

    Gene 1 Gene 2 . . .

    Project Description

    Identification Of A Differential Gene Expression PatternAnd Genes Related To Resistance In Potato Late Blight

    One of the most devastating disease of potato worldwide

    If left unmanaged, complete destruction of crops can occur

    Attacks leaves and tubers; large necrotic lesions on leaves

    and dry rot that spreads through tubers; 2o bacterial and

    fungi often infect through late blight lesions

  • 8/8/2019 hop Indonesia

    49/80

    Late Blight Project

    Collaborative effort with AAFC Potato Research Centre

    Population of blight-sensitive and blight-resistant plantsof near isogenicity

    cDNA libraries made from leaves of a blight-sensitive and

    a blight resistant plant

    2500 messages were sequenced from each library

    (5000 total ESTs)

    Different ESTs to be profiled for expression

    The tremendous amounts of data generated will need to be

    managed efficiently

    Database - Sequence Info

  • 8/8/2019 hop Indonesia

    50/80

    Late Blight Project

    cDNA Microarray Using SGII Clones

    hybridized with Cy3 (resistant) + Cy5 (susceptible) probes

    (reciprocal labelling experiments)

    ANDLBRLF02345HTF.01 - Class II chitinase

    ANDLBRLF01256HTF.01 - Pathogenesis-related protein

    P23 precursor

    ANDLBRLF02041HTF.01 - Unknown protein

    What Use Is All Of This Information?

    Transgenics:- Enhance tuber quality, processing traits, disease

    resistance, stress tolerance more rapidly than breeding

    Expression Assisted Selection:- Obtain expression profiles for thousands of genes

    associated with specific traits or characteristics- Use these profiles as a baseline to compare with

    the expression profiles of unknown clones; crosses

    New Protein Products :- Identify genes encoding secreted proteins/ligands

    - Test these for growth-promoting/other effects

    - Express genes in batch cultures and purify proteins

  • 8/8/2019 hop Indonesia

    51/80

    GFP expression in tobacco cells

    GA-20 oxidase in potato:

    GA-20 oxidaseknockouts withenhanced tuberproduction

    GA-20 oxidase

    knockouts withreduced tubersprouting

    Example Of Gene Use

    Information Processing and Handling

    Assembly and annotation of genomic data

    EST analysis and databases

    Cluster analysis of microarray data

    Comparisons of various transcriptomic methods

    Integration of sequence, transcriptomic, proteomic,

    metabolomic, transgenic data

  • 8/8/2019 hop Indonesia

    52/80

    The Canadian Potato Genome Project

    Sequence the geneand build cDNA libraries

    [Solanum Genomics Intl. Inc(SGII)]

    EST sequence generation[National Research Council

    at Halifax and SGII]

    Bioinformatics: base-Calling, clustering,

    BLAST, annotations,and Gene expression

    [UNB and PRC]

    Microarray profiling[SGII, PRC, UNB, Ontario

    Canter Institute, and NSAC]

    Leaf and tubercDNA

    FASTA formattedEST sequence& trace files

    Sample FASTA formatted Sequences

    EST sequence>gi|39796586|gb|CK247430.1|CK247430 EST731067 potato callus cDNA library,

    mRNA sequence

    ACAAGTCACTATAGGGACATGCTTCAATTTTTTCAAAACATCTTGAATAGTACAAAGTGCACAACATACT

    CCAAAAAACTGAATACATTTTCTATTGTCAATATCTATAGCCATATGACTTTCAGTGCGACCTATGCATT

    CATAACTCCCGCTACCAAATCCACCATGTAGTGCTTACAACAACAAGCCTAGTGAGAACGTAAGCCTGGT

    CTGGAGCCAAAAGCAAATTATGTATACTAAAAAACCCCCTGGCTAAAATGCATATCATGATTAGTAGTGA

    CATT

    Protein Sequence>gi|532319|pir|TVFV2E|TVFV2E envelope protein

    ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT

    QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC

    HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK

    MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK

    TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF

    APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL

    LAAVEAQQQMLKLTIWGVK

  • 8/8/2019 hop Indonesia

    53/80

    Standard Genetic Code

    T C A G

    T

    TTT Phe (F)TTC "TTA Leu (L)

    TTG "

    TCT Ser (S)TCC "TCA "

    TCG "

    TAT Tyr (Y)TACTAA Ter

    TAG Ter

    TGT Cys (C)TGCTGA Ter

    TGG Trp (W)

    C

    CTT Leu (L)CTC "CTA "CTG "

    CCT Pro (P)CCC "CCA "CCG "

    CAT His (H)CAC "CAA Gln (Q)CAG "

    CGT Arg (R)CGC "CGA "CGG "

    A

    ATT Ile (I)ATC "ATA "ATG Met (M)

    ACT Thr (T)ACC "ACA "ACG "

    AAT Asn (N)AAC "AAA Lys (K)AAG "

    AGT Ser (S)AGC "AGA Arg (R)AGG "

    G

    GTT Val (V)GTC "GTA "GTG "

    GCT Ala (A)GCC "GCA "GCG "

    GAT Asp (D)GAC "GAA Glu (E)GAG "

    GGT Gly (G)GGC "GGA "GGG "

    Database

    Contains all the ESTs sequences

    Contains useful annotations

    Blast Searches

    Contig Assemblies

    Transmembrane Spanning Regions

    Gel Pictures EST Information

  • 8/8/2019 hop Indonesia

    54/80

    Data Analysis - Bioinformatics

    Tens of thousands of ESTs available for study

    Most methods to study message distributions are low

    throughput AND time consuming

    Genomics necessitates the large scale study of geneexpression

    Automation required for routine processes

    Data acquisition for potato genome annotation

    Automated protein classification with rule maintenance

    Use agents to integrate the software and primary databases in

    a flexible and robust way

    Overview of Bioinformatics Researchat UNB

    Automated ProteinClassification and Rule

    Maintenance

    Automated DataAcquisition Pipeline

    TraceScan

    Multi-AgentSystem for Potato

    Genome Annotation

    ESTsequences

    Homologs, Motifs,Fingerprints, Transmembrane,and Signal sites

  • 8/8/2019 hop Indonesia

    55/80

    TraceScan - Keywords

    Chromatogram - visual representation of the digital output producedby an automated sequencing machine. A chromatogram is drawn as aset of four overlapping waveforms, one for each nucleotide base

    Base-calling - determining the set of nucleotide bases for a DNAsequence strand from the analysis of the digital output produced by asequencing machine

    Heterozygosity exists in the chromatogram where the presence of asecond strong peak appears beneath a primary peak. This mayindicate the presence of a secondary nucleotide base at the location inthe sequence

    BLAST Basic Local Alignment Search Tool

    Example of a Chromatogram

  • 8/8/2019 hop Indonesia

    56/80

    The TraceScan Software System

    Designed to investigate sequence quality, potential polymorphisms, andbase heterozygosity in EST sequences.

    Relies on the combined analysis of a DNA sequence trace file, the tracechromatogram, and multiple alignment of sequence homologs.

    Allows base-calls to be substituted where superimposed peaks havebeen detected in the trace.

    Base-calls deemed in error can be corrected to improve sequence qualityand data reliability.

    TraceScan

    Visualizes DNA sequence chromatograms

    Detects overlapping trace peaks using modifications to the PHREDbase-caller

    Paks are highlighted on the user interface.

    Modifications to PHRED enable base-calls with overlapping peaks to besubstituted.

    Base substitutions produce a new set of base quality scores for thesequence.

  • 8/8/2019 hop Indonesia

    57/80

    TraceScan

    An interface to NCBI BLAST provides sequence comparisoncapabilities.

    Sequences are compared using BLASTN and BLASTX.

    BLASTN alignments are analyzed in search of discrepancies that mayidentify base-calling errors or putative polymorphisms in the tracesequence.

    Reading Frames from BLASTX results are analyzed to examine ifsubstituted base-calls result in synonymous or non-synonymous codonsubstitutions.

    TraceScan System Architecture

  • 8/8/2019 hop Indonesia

    58/80

    Overview of BioinformaticsResearch at UNB

    Automated ProteinClassification and Rule

    Maintenance

    Automated DataAcquisition Pipeline

    TraceScan

    Multi-AgentSystem for Potato

    Genome Annotation

    ESTsequences

    Homologs, Motifs,Fingerprints, Transmembrane,and Signal sites

    The Automated Data Acquisition Pipeline

    (ADAP) - Keywords

    Hypothetical Protein: The protein sequence that is obtained fromtranscription and translation of the DNA sequence. It is hypotheticalbecause we do not know if it is the real protein which DNA codes to.

    Homologs: Evolutionarily related protein sequences

    Comparative genomics: A technique where the functional traits of a

    protein sequence are learnt from its homologs

    Motifs: Highly conserved regions of protein sequences

    Fingerprints: Collection of motifs

    BLASTP: Basic Local Alignment Search Tool for Protein to Proteinsearches

  • 8/8/2019 hop Indonesia

    59/80

    Automated Data AcquisitionPipeline (ADAP)

    Gathers data for genome annotation

    ADAP features:

    Uses comparative genomics to learn from the Homologs

    New variant of BLAST, Parameter Regulated Iterative BLAST(PRI-BLAST)

    Uses 7 various analysis/search tools

    A few software design patterns are used

    Perl, MySQL, Perl-DBI, BioPerl, EMBOSS, BLASTP, SGE 5.3,

    and Perl-Gtk on Linux

    ADAP Overview

    Phase 1: Hypotheticalprotein extraction andhomolog generation

    Phase 2: Sequence basedprotein structure

    prediction

    Phase 3: Database searchbased protein family

    prediction

    Potato ADAPdatabase

    Input: FASATAformatted EST

    Sequences

    Homologs and

    HPs

    Perl-MySQLDatabaseInterface

    Legend

    Data Flow

    DatabaseInteractions

  • 8/8/2019 hop Indonesia

    60/80

    Parameter Regulated Iterative BLAST(PRI-BLAST)

    Static set of BLASTP parameters (neighborhood score, E-value, fractionidentical, BLOSUM matrix etc) not good since protein evolves at differentrates

    PRI-BLAST iteratively performs the BLASTP over query sequence andcategorizes the query as

    a Celebrity query (C) many homologs an Average query (A) a few or no homologs an Obscured query (O) some homologs

    PRI-BLAST Rule module

    Decides which set of BLASTP parameters to use Halts the PRI-BLAST

    Statistical module Density of homologs is computed through SQL statements

    Example BLASTP reportBLASTP 2.2.8 [Jan-05-2004]

    Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer

    . . . . . Nucleic Acids Res. 25:3389-3402.

    Query= CK00043.5prime

    (182 letters)

    Database: All non-redundant GenBank CDS

    translations+PDB+SwissProt+PIR+PRF excluding environmental samples

    1,795,144 sequences; 592,604,613 total letters

    Searching..................................................done

    Score E

    Sequences producing significant alignments: (bits) Value

    gb|AAD46849.2| LD03471p [Drosophila melanogaster] 329 5e-90

    ref|NP_651977.1| CG6773-PA [Drosophila melanogaster] >gi|7300991... 285 1e-76

    ref|XP_312881.1| ENSANGP00000014751 [Anopheles gambiae] >gi|2129... 209 7e-54

    gb|AAH54585.1| Unknown (protein for MGC:63980) [Danio rerio] 184 4e-46

    .

    .

    .

    >gb|AAD46849.2| LD03471p [Drosophila melanogaster] Length = 386

    Score = 329 bits (1155), Expect = 5e-90

    Identities = 181/182 (99%), Positives = 181/182 (99%)

    Query: 1 VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG 60

    VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG

    Sbjct: 6 VKRRKKTRLAFNQFIWRPDERISSKMVSLLQEIDTEHEDMVHHAALDFYGLLLATCSSDG 65

    Query: 61 SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK 120

    SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK

    Sbjct: 66 SVRIFHSRKNNKALAELKGHQGPVWQVAWAHPKFGNILASCSYDRKVIVWKSTTPRDWTK 125

    Query: 121 LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPNXHTIGVN 180

    LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPN HTIGVN

    Sbjct: 126 LYEYSNHDSSVNSVDFAPSEYGLVLACASSDGSVSVLTCNTEYGVWDAKKIPNAHTIGVN 185

    Query: 181 AI 182

    AI

    Sbjct: 186 AI 187

  • 8/8/2019 hop Indonesia

    61/80

    motif search based Protein Sequence Analysis(mPSA)

    Motifs are conserved regions of protein sequences, and fingerprint isa collection of motifs in some order

    mPSA (Phases 2 & 3) for the ADAP contains 6 mPSA tools fromEMBOSS

    Phase 2: sequence based mPSA

    secondary structure: transmembranes(Tmap), signal sites(Sigcleave), and general secondary structure (Garnier)

    super secondary structure: DNA binding sites (Helixturnhelix)

    Phase 3: database search based mPSA

    protein motifs from PROSITE (Patmatmotifs) and proteinfingerprints from (Pscan)

    Homologues for Various Ranges of Lengths of Hyp. Proteins

    8768

    5235

    2882

    1633

    550873

    288592

    221495

    2791

    53380434

    516124 226

    979

    2020

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    10

    -15

    15

    -20

    20

    -25

    25

    -30

    30

    -35

    35

    -40

    40

    -45

    45

    -50

    50

    -55

    55

    -60

    60

    -65

    65

    -70

    70

    -75

    75

    -80

    80

    -85

    85

    -90

    90

    -95

    95

    -100

    100

    -105

    105

    -110

    110

    -115

    Length of Hyp. Protein

    NumberofHomologues

    Homologues (Total)

    Shorter protein sequences have more homologs they can be false positives

  • 8/8/2019 hop Indonesia

    62/80

    Homologues with E

  • 8/8/2019 hop Indonesia

    63/80

    Bioinformatics Research at UNB

    Automated ProteinClassification and Rule

    Maintenance

    Automated DataAcquisition Pipeline

    TraceScan

    Multi-AgentSystem for Potato

    Genome Annotation

    ESTsequences

    Homologs, Motifs,Fingerprints, Transmembrane,and Signal sites

    Automated Protein Classification andRule Maintenance

    Use machine-learning techniques to find some rules

    Apply the rules to classify uncharacterized sequences

    Categorizedsequences and

    their related data

    Rule

    ConstructionProcess

    A decision treeconsisting of

    rules

    Uncharacterizedsequences

    Rule applicationprocess

    Newlycharacterized

    sequences

  • 8/8/2019 hop Indonesia

    64/80

    Automated Protein Classification andRule Maintenance

    Source data collection

    Automated rule generation

    Machine-learning algorithms and their comparison

    Automated rule maintenance

    Automated Rule Generation

    C4.5 and CITree algorithms produce decision trees

    WEKA (Waikato Environment for Knowledge Analysis ) will be used foranalyzing the dataset. (http://www.cs.waikato.ac.nz/~ml/index.html)

    Start

    Rule Construction & DecisionTree Creation

    Rule Sieving

    Is the rulequalified?

    End of Rules?

    Yes

    No

    Rule Database

    Apply rules to annotate targetsequences

    Target SequenceDatabaseEnd

    Sequences and theirrelated data

    Update Rule Database

    Update Target Sequence Database

    Rule Generation process

    No

    Yes

  • 8/8/2019 hop Indonesia

    65/80

    Comparison of Algorithms

    The evaluation of criteria for machine learning algorithms: accuracyand AUC (Area Under the ROC (Receiver Operating Characteristics)Curve)

    Performance analysis

  • 8/8/2019 hop Indonesia

    66/80

    Tree Generated using Weka

    Bioinformatics Research at UNB

    Automated ProteinClassification and Rule

    Maintenance

    Automated DataAcquisition Pipeline

    TraceScan

    Multi-AgentSystem for Potato

    Genome Annotation

    ESTsequences

    Homologs, Motifs,Fingerprints, Transmembrane,and Signal sites

  • 8/8/2019 hop Indonesia

    67/80

    Multi-agent Systems

    A multiagent system is one that consists of a number of agents, whichinteract with one-another

    In the most general case, agents will be acting on behalf of users withdifferent goals and motivations

    To successfully interact, they will require the ability to cooperate,coordinate, and negotiate with each other, much as people do

    Multi-Agent System for PotatoGenome Annotation

    Target Sequence

    Database

    Local Database

    NRDB MONTH

    INFORMATION

    AGENT

    PIPELINE

    AGENT

    WEB

    AUTOMATED DATA

    ACQUISITION

    PIPELINE

    Rule DatabaseRULE

    CONSTRUCTION

    AGENT

    DATABASE

    UPDATE AGENT

    PRINTS PROSITE

    Target Sequence

    Database

    CLASSIFICATION

    MODULE

  • 8/8/2019 hop Indonesia

    68/80

    Mapping Transcription factors from aModel to a non-Model Organism

    136

    Transcription Factor

    Group of proteins that initiate transcription transcriptional activators

    transcriptional repressors

    Consists of DNA binding domains Binds to the binding site regions (specific DNA

    sequences)

    Controls the expression of the genes

    Human genome: 2600 proteins contain DNA-binding domains

  • 8/8/2019 hop Indonesia

    69/80

    137

    Transcription Factor Mapping

    Model Organism

    Investigated thoroughly by biologists Nodes: Transcription factors

    Non-Model Organism

    Not much data available Nodes: Predicted transcription factors

    A

    B

    C

    A1

    B1

    C1

    Source Genome Target Genome

    138

    Transcription Factor Mapping

  • 8/8/2019 hop Indonesia

    70/80

    139

    Methodology

    BLASTP is used to map transcription factors from Ecoliand Bacillus subtillisto E.coli group and Bacillus

    group Parameter E-value threshold: 1e-5 to 10

    All transcription factors from one genome cannot bemapped to another genome

    The number of confirmed mappings between any twogenomes is dependent on the definition of confirmed

    mapping used Compare the available transcription factors of the target genome to

    the predicted set of transcription factors

    140

    Summary of Mapping Results

    Transcription factor mapping in bacterialgenomes

    Proposed method is able to map most of thetranscription factors

    Transcription factor sequence motifs arepreserved well

    0.1 and 0.01: best e-value thresholds

    Correct choice of e-value threshold can be moreimportant than selection of evolutionarily closer

    model organism

  • 8/8/2019 hop Indonesia

    71/80

    Bioinformatics @ C-DAC

    Dr. Rajendra JoshiGroup Coordinator: Bioinformatics

    Scientific and Engineering Computing Group

    Centre for Development of Advanced ComputingPune - 411007

    [email protected]://bioinfo-portal.cdac.in

    Bioinformatics Resources &Applications Facility (BRAF)

    Funded by the Department of InformationTechnology (DIT), Ministry of Communications andInformation Technology

    Grid-enabling of numerous bioinformatics codeslike SW, BLAST, ClustalW, AMBER, CHARMM etc

    As part of BRAF, the team interacted withscientists from various CSIR labs, IITs andindustries

  • 8/8/2019 hop Indonesia

    72/80

    AMD processor 2.6Ghz (Total: 204cores, 1060.8 GF)

    4 nos. of SunX4600 (8 socket dual

    core each) giving 64 cores.

    32 nos. of SunX2200 (dual socketdual core each) giving 128 cores.

    Backup server: SunX2200 (4 cores)

    Storage server: two Sun X2200 (8cores)

    Infiniband switch (Mellanox DDR2,48 port)

    Storage: 20 Terabytes, RAID5 Tape library with autoloader

    Benchmarking completed forAMBER, CHARMM, MEME, SW,Fasta, ClustalW, BLAST

    BIOGENE: 1TF machine

    Using BRAF Facility

    Gipsy portal: Use browser andopen the url

    http://gipsy.bioinfo-portal.cdac.in

    Command line login

    ssh -p 30005 gateway.cdac.in

    Help on command line usage isavailable in the README file inthe users home directory.

    Helpline: [email protected]

  • 8/8/2019 hop Indonesia

    73/80

    Bioinformatics Application SoftwareforHigh-End Clusters and Grid

    iMolDock : An interface for Molecular Docking on HPC

    GENOPIPE : Automated Genome Annotation Pipeline on HPC

    Anvaya : A Workflow Environment for High Throughput Comparative Genomics

    Taxo Grid : Phylogeny on Grid

    GenomeGrid : Bioinformatics Problem Solving Environment on Grid

    GIPSY : Bioinformatics Problem Solving Environment on HPC

    High-throughput Workflows forGenome Analysis

  • 8/8/2019 hop Indonesia

    74/80

    Collaboration: Biotechnology andBiological Sciences Research Council (UK)

    A Systems Biology based

    approach for annotation ofSalmonella andMycobacterium genomes

    Establishment of a commonBioinformatics pipeline foranalyses of bacterial genomeswith emphasis on identification

    of virulence and pathogenicfactors

    Collaboration: Institute of AnimalHealth (UK)

    Genome Annotation: Salmonella Causative agent of Typhoid Transmitted via food contamination Economic losses as it affects

    livestock

    Annotation of 5 Salmonella

    genomes with a wide host-rangeFood-borne disease cycle: Salmonella

    Genome Annotation via GENOPIPE

    Single nucleotide polymorphism

  • 8/8/2019 hop Indonesia

    75/80

    Collaboration: University of Surrey (UK)

    Expert curation of Mycobacterium lepraegenome: causative agent of Leprosy

    Development of a tool to calculate molecularweight of metabolites

    Furin Complex

    Collaboration:Oregon Health & Science University (USA)

    Collaborative project initiated with OHSU in December 2009

    Provide computational support to the experimental studies at OHSU,through MD simulations on BIOGENE cluster

    Propeptide domain of serine protease Furin acts as a pH sensor

    Phenomenon has been elucidated in-silico through MD simulations

    Ten sets of simulations performed using NAMD

  • 8/8/2019 hop Indonesia

    76/80

    Collaborations: caBIG (NIH)

    The National Cancer Institute (NCI) is

    involved in deployment of an integratedbiomedical informatics infrastructure,the cancer Biomedical Informatics Grid(caBIG)

    network that will freely connect theentire cancer community

    caBIG would setup node at CDAC GARUDA GRID and BRAF resources

    may be used

    OA1 (GPR143) aGPCR

    Belongs to Class I GPCR,Rhodopsin family

    7TM receptors or heptahelical

    receptors An integral membrane

    glycoprotein of 404 aa

    Protein product ofocularalbinismtype 1 gene

    Ocular albenism, a X-linkedinherited disorder in which theeye lacks melanin pigment

    Homology based approach along

    with CGMD simulation has been

    Collborations: IIT MadrasCGMD studies on GPCR

  • 8/8/2019 hop Indonesia

    77/80

    Collaboration: Jubilant Biosys Simulate fragment binding

    sites by Molecular Dynamicssimulation methods

    To identify most probablesite of interaction ofchemical fragments in theprotein.

    8 large simulations of 10nseach was carried out

    Results handed over toJubilant

    Collaboration: Nicholas Piramal

    Contract Research project

    To understand protein ligandinteractions using MolecularDynamics simulations

    Involves carrying outmolecular dynamics

    simulations on very largebiomolecular systems

    Benefits in designing bettermolecules for known drugtargets.

    Four 20ns moleculardynamics simulations havebeen carried out

  • 8/8/2019 hop Indonesia

    78/80

    155

    Conclusion

    Biology transforming from observational and physicalexperiments computational science

    Bioinformatics - Exciting research area

    Challenges Biology and Computer Science different waysof working and need for close collaboration

    Opportunities new crops, personalized medicine, earlydiagnosis,

    156

    Research Problems in Bioinformatics

    Find genomes of all organisms

    Identify and annotate all genes

    Compute sequence 3D structure for all proteins

    Compare DNA / protein sequences for similarity

    Compare families of DNA / protein sequences

    Reason to be optimistic: Biology is finite

    ~30,000 human genes; ~1000 protein superfamilies

    but computers speeds keep increasing

  • 8/8/2019 hop Indonesia

    79/80

    157

    Business Opportunities

    Clinical research Gene therapy Molecular science Pharmaceutical companies - automated technologies tomanufacture effective therapies and drugs due to increasingconcerns about drug safety and the stringent regulations thatgovern clinical trials for drug discovery. Bioinformatics platform market growing very fast rate Global bioinformatics market: ~ $8.3 billion by 2014 Knowledge management - 2009 -$1.3 billion Bioinformatics platforms market - 2014 - ~ $3.9 bill ion

    158

    Business Opportunities

    Global bioinformatics market segments

    - Bioinformatics platforms- Sequence alignment platforms- Sequence manipulation platforms,- Sequence analysis platforms- Structural analysis platforms

    - Content/ Know ledge management tools- Specialized knowledge management tools- Generalized know ledge management tools- Services- Data Analysis- Sequencing Services- Database & Management services- Applications

  • 8/8/2019 hop Indonesia

    80/80

    Thank You!