semantic modeling of biological sequences sudha ram eller professor department of management...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Semantic Modeling ofBiological Sequences
Sudha RamEller Professor
Department of Management Information Systems
Eller School of Management
The University of Arizona
March 5, 2004
Road Map
Background Semantics of DNA sequences and Primary
protein structures Semantics of 3-D protein structures Summary and Future Work
Background
Human Genome Project (HGP) started 1990 by Department of Energy To sequence the 24 distinct chromosomes comprising the
human genome Completed in April, 2003 – earlier than expected. Achievements:
Determined the complete sequence of 3 billion DNA subunits, identified all human genes
Stored all the data in databases
Post-Genomic Era
“New generalizations and higher order biological laws are being approached but may be obscured by the simple mass of data”
---Morowitz et. al. 1987
More Challenges
Usage and analysis of the data requires: Ad hoc and complicated queries Efficient data browsing and retrieving Integrated data sources Effective and user-friendly data presentation
Find all genes that are structurally similar to a given gene and
expressed similarly over a specific DNA microarray dataset
Current Databases
Major DNA sequence databases: GenBank (Gene Bank) DDBJ (DNA Data Bank of Japan) EMBL (European Molecular Biology Laboratory)
Other databases: Different Types Different Scales Different Models
--Bioinformatics Databases and Systems
Current Data Models
Data models: Flatfile (ASN.1)
– Relational
– XML and its extensions (BSML)
– Others
• Drawbacks?
Research Motivation
Usage and analysis of the data requires: Ad hoc and complicated queries Efficient data browsing and retrieving Integrated data sources Effective and user-friendly data presentation
Existing sequence/structure databases not able to provide these capabilities:
Flatfile format hides semantics of data Relationships/hierarchies are not clear Don’t support ad hoc and complicated queries
DNA Sequences
Linear Sequences DNA sequences
Genetic information carrier
Composed of nucleic acids
Primary protein sequences Composed of amino
acids
Protein Building Blocks
Proteins are the most important macromolecules in the factory of living cells that perform various biological tasks
A protein is composed of 20 kinds of amino acids, also known as subunits or residues
Protein Structures
Protein 3-D structures Intermolecular and
intramolecular chemical forces force the linear primary sequence to be folded into 3-D structures to reach the minimum energy/most stable state
Structures determine properties or functions
Levels of Protein Structures—I
Primary (Linear): each building block (amino acid) can be represented by a letter (of the English alphabet)
Secondary: The chain of covalently linked amino acids is further organized by forming regularly repeating patterns due to hydrogen bondings
Levels of Protein Structures—II
Tertiary: Alpha helices and beta sheets fold themselves further into a "chain", cross-linking with one another via their side chains.
Quaternary: For proteins with more than one chain, interaction can occur between the chains themselves.
Previous Work
“A sequence is a mapping between a collection of similarly structured records and the positions of an ordering domain”
----Seshadri et. al., 1995
Various sequences are just different Ordering Domain and Collection of Records combinations
.
.
.OOOOO....
………
………
Ordering Domain Collection of Records
1-Many Relationship
Many-1 Relationship
Record-Oriented View
Positional View
Time Sequences—I
“For time is just this—number of movements in respect to the ‘before’ and ‘after’”—Aristotle We want to capture attributes of the movements We want to know the order/time of the movements
Time is continuous Temporal databases: deals with semantics of
ordered sequences of data values in the time domain.
Time Sequences—II
“Time sequence is basically the sequence of values in the time domain for a single entity instance”
---Segev et. al., 1987
Time sequences can be: Step-wise constant Discrete Continuous
.
.
.OOOOO....
………
………
Time Points Attribute Data Points
[1:M] [0:1]
Biological Sequences—III
Basic model for sequence can be adapted:
ACGT
.
.
.34567....
………
………
Linear Order Domain Collection of Nucleic Acids
[1:M] [1:1]
AB..YZ
.
.
.(X1, Y1, Z1)(X2, Y2, Z2)
.
.(Xn, Yn, Zn)
.
.
.
.
………
………
Coordinates Domain Collection of Amino Acids
[1:M] [1:1]
Other Sequences
Process Sequences Sequences of processes and subprocesses
Multimedia sequence Streams of multimedia data
ImageAudio
Why a New Model for Biological Sequences?
Time sequences are continuous, biological sequences are discrete
More semantics in biological sequences such as sequences and their subsequences
In time sequences, some time points don’t have data, in biological sequences, each position has its own data
Usage of biological data requires that the sequence data be represented and analyzed in different ways
Relation to Gene Ontologies?
Ontology defines biological properties associated with sequence data , however we model semantics of sequence data
No protein structure ontology exists Both contribute to database integration.
What about Relational technology?
Relational technology doesn’t support data structures as complicated as biological sequences
Hierarchy and semantics are hidden in relations
We can never emphasize semantics too much!
DNA Sequences Semantic Model
DNA sequences and primary protein sequences:
AMINO_ACID OTHER_ATOMS
ATOMSDomain={A1, A2, ...Ai}
S
Is_featured_by
BIOFEATURESrecordsFEATUREINFO
[1:M]
[1:1]
[1:1]
keeps
VERSIONS
[1:M]
POLYNUCLEOTIDE Are_maintained_for[1:1] [1:M]
BIODATABASES Is_connected_to[1:M]
SequentialAggregate
[1:1]Fragment
(O)[M:N] [1:M]
[M:N]
[0:M]
[1:M]
SUBSEQUENCES
),( nlmkXA lk
LINEARORDERDomain={X1, X2, ...Xj}
[1:M]
Atom_identifier
Atom_Symbol
Amino_Acid_Name Nucleic_Acid_Name Atom_Weight
Atom_Category
NUCLEIC_ACID
Sequence_VersionSequence_Length
Sequence_ID
Start_From End_At
Sub_Sequence_ID
SEQUENCES
nj
mi
XA ji
1
1
[1:1]
Ram, S. and Wei, W., Semantic Modeling of Biological Sequences. in Thirteenth Annual Workshop On Information Technology and Systems (WITS'03), Seattle, Washington, December 2003.
Entity Classes - I
ATOMS Superclass of families of atoms Collection of atomic components of biological sequence Domain: Possible components
LINEARORDER Set of positions (integers) in the sequence Domain: (1, j) where j is the length of the sequence
ATOMSDomain={A1, A2, ...Ai}
LINEARORDERDomain={X1, X2, ...Xj}
Entity Classes - II
SEQUENCES Ordered list of (ATOM, LINEARORDER) pairs
SUBSEQUENCES Part of a sequence Associated with biological activities
New Constructs-I
Sequential Aggregate It is aggregation of ATOMS and LINEARORDER It is sequential because order matters
Normal Aggregate To indicate whole-part relationship Example: Course and students
SequentialAggregate
{ (A, 1), (T, 2), (G, 3), (C, 4), (T, 5), (G, 6), (C, 7), (T, 8), (A, 9), (A, 10)}
SequentialAggregate
A-T-G-C-T-G-C-T-A-A
New Constructs - II
Fragment Sequences are segmented Fragments can overlap
Fragment (O)
A-T-G-C-T-G-C-T-A-A-G-T-C-C-A-T-T-A-C-G-G-T-A
A-T-G-C-T-G
1ST TO 6TH
G-C-T-G-C-T-A-A-G-T
3RD TO 12TH
A-A-G-T-C-C-A-T-T-A-C-G-G-T-A
9TH TO 23RD
G-C-T-GOverlap 3rd to 6th
A-A-G-TOverlap 9th to 12th
Relationships
Ternary Sequential Aggregation
Fragment:
ATOMSDomain={A1, A2, ...Ai}
SequentialAggregate
[1:1] [M:N]
LINEARORDERDomain={X1, X2, ...Xj}
[1:M]
SEQUENCES
nj
mi
XA ji
1
1
Fragment (O)
[1:M] [0:M]
SUBSEQUENCES
),( nlmkXA lk
SEQUENCES
nj
mi
XA ji
1
1
Utility of DNA Sequence Model
Semantics of sequence data captured Ad hoc queries are possible
ATOMSDomain={A1, A2, ...Ai}
Is_featured_by
BIOFEATURES
[1:M]
[1:1]
SequentialAggregate
[1:1]Fragment
(O)[M:N] [1:M]
[0:M]
SUBSEQUENCES
),( nlmkXA lk
LINEARORDERDomain={X1, X2, ...Xj}
[1:M]
SEQUENCES
nj
mi
XA ji
1
1
Find a particular sequence and display a segment from 2nd to 200th
How many subsequences are fragmented from a specific sequence
Find all the sequences that share one or more specific subsequences
Protein Databases
The Protein Data Bank, PDB (http://www.rcsb.org/pdb/) is the only worldwide archive of experimentally determined three-dimensional structures of proteins.
Data stored in flatfiles This format records primary and secondary structure of
proteins using groups of coordinates. It does not record the tertiary and quaternary structures. No relationships among structures at different levels is captured.
Protein Structure Semantic Model
ATOMS RESIDUES
SECONDARYSTRUCTURE
TERTIARYSTRUCTURE
[1:1]
PRIMARYSTRUCTURE
[1:1] [1:1]
QUATERNARYSTRUCTURE
[1:N]
Atom_Symbol
Structure_ID
Atom_Type
Atom_Serial Number
Residue_Serial_Number
Residue_Name
Residue_Symbol
Sequential-Aggregate//LL,X/X<=LL
[1:M]Protein_NameProtein_MW
Structure_ID
Experiment
Entry DateSequential_Length
Secondary_Structure_Type
[1:1]
Spatial-Aggregate//P(deg)/P(deg)/P(deg)//T(c)
Spatial-Bonding//B(A1-A2)//BL(A)//BE(kcal/mol)
A1_Type = Amine_HydrogenA2_Type = Carbonyl_Oxygen
A1_Residue_Serial_Number (NE) A2_Residue_Serial_Number|A1_Residue_Serial_Number - A1_Residue_Serial_Number| = 4
If Secondary_Structure_Type = Alpha
Spatial-Bonding//B(A1-A2)//BL(A)//BE(kcal/mol)
If A1_Type = Amine_Hydrogen, then A2_Type = Hydroxyl_OxygenIf A1_Type = Sulfur, then A2_Type = SulfurA1_Charge = Acidic & A2_Charge = Basic
ORA1_Charge = Basic & A2_Charge = Acidic
A1_Residue_Serial_Number (NE) A2_Residue_Serial_Number
Secondary_Structure_ID
[1:1]
[1:1]
Spatial-Bonding//B(A1-A2)//BL(A)//BE(kcal/mol)
If A1_Type = Amine_Hydrogen, then A2_Type = Hydroxyl_OxygenIf A1_Type = Sulfur, then A2_Type = SulfurA1_Charge = Acidic & A2_Charge = Basic
ORA1_Charge = Basic & A2_Charge = Acidic
A1_Residue_Serial_Number (NE) A2_Residue_Serial_NumberA1_Chain_Number (NE) A2_Chain_Number
[1:1][1:1]
Chain_Number
Entity Classes
ATOMS: This entity class is used to model chemical atoms (C, H, O, N etc) in the protein structure with each of them identified uniquely.
RESIDUES: This entity class represents amino acid subunits, which are the basic building blocks of protein structures.
PRIMARY STRUCTURE SECONDARY STRUCTURE TERTIARY STRUCTURE QUARTERNARY STRUCTURE
Relationships—I
Spatial-Aggregate
P: represents a point using x, y and z coordinates in degrees
T is the temperature at which the structure is determined
ATOMS RESIDUES[1:1] [1:1]Spatial-Aggregate
//P(deg)/P(deg)/P(deg)//T(c)
Relationships—II
Sequential-Aggregate
LL is the list length X is the position of the residue in the list Position of any atom has to be less than or equal
to the length
RESIDUES PRIMARYSTRUCTURE
[1:1]Sequential-Aggregate
//LL,X/X<=LL[1:M]
Relationships—III
Spatial-Bonding Represent the strength and length of the chemical forces
among atoms By describing the semantics of these bonds at each level
using additional annotations, we can differentiate between these bonds as they apply to different levels of protein structures
Relationships—IV
An example of annotated relationship
For secondary structures B: Bond BE: Bond energy BL: Bond length
Spatial-Bonding//B(A1-A2)//BL(A)//BE(kcal/mol)
A1_Type = Amine_HydrogenA2_Type = Carbonyl_Oxygen
A1_Residue_Serial_Number (NE) A2_Residue_Serial_Number|A1_Residue_Serial_Number - A1_Residue_Serial_Number| = 4
If Secondary_Structure_Type = Alpha
Utility of Protein Structure Model
ATOMS RESIDUES
SECONDARYSTRUCTURE
[N:M]
PRIMARYSTRUCTURE
[1:1] [1:1]
[1:N]
Sequential-Aggregate//LL(X)/X<=Sequential_Length
[1:M]
[1:1]
Spatial-Aggregate//P(deg)/P(deg)/P(deg)//T(c)
Spatial-Bonding//B(A1-A2)//BL(A)//BE(kcal/mol)
A1_Type = Amine_HydrogenA2_Type = Carbonyl_Oxygen
A1_Residue_Serial_Number (NE) A2_Residue_Serial_Number|A1_Residue_Serial_Number - A1_Residue_Serial_Number| = 4
If Secondary_Structure_Type = Alpha
Spatial-Bonding//B(A1-A2)//BL(A)//BE(kcal/mol)
If A1_Type = Amine_Hydrogen, then A2_Type = Hydroxyl_OxygenIf A1_Type = Sulfur, then A2_Type = SulfurA1_Charge = Acidic & A2_Charge = Basic
ORA1_Charge = Basic & A2_Charge = Acidic
A1_Residue_Serial_Number (NE) A2_Residue_Serial_Number
[1:1]
Find the sequence of amino acids for this protein structure
Give me all the hydrogen bondings that contribute to the secondary structure
Find a set of forces similar to this, and the resulting 3-D structure?
New Operators based on Semantics
Sequence Subsequence Aggregate Comparison of Sequences and Subsequences
-- Identical-- Similar-- Partial
Allen’s Predicates: Before, After, Meets, During, Starts, Finishes, Contains, Overlaps.