semantic modeling of biological sequences sudha ram eller professor department of management...

Semantic Modeling ofBiological Sequences

Sudha RamEller Professor

Department of Management Information Systems

Eller School of Management

The University of Arizona

March 5, 2004

Road Map

Background Semantics of DNA sequences and Primary

protein structures Semantics of 3-D protein structures Summary and Future Work

Background

Human Genome Project (HGP) started 1990 by Department of Energy To sequence the 24 distinct chromosomes comprising the

human genome Completed in April, 2003 – earlier than expected. Achievements:

Determined the complete sequence of 3 billion DNA subunits, identified all human genes

Stored all the data in databases

Post-Genomic Era

“New generalizations and higher order biological laws are being approached but may be obscured by the simple mass of data”

---Morowitz et. al. 1987

More Challenges

Usage and analysis of the data requires: Ad hoc and complicated queries Efficient data browsing and retrieving Integrated data sources Effective and user-friendly data presentation

Find all genes that are structurally similar to a given gene and

expressed similarly over a specific DNA microarray dataset

Current Databases

Major DNA sequence databases: GenBank (Gene Bank) DDBJ (DNA Data Bank of Japan) EMBL (European Molecular Biology Laboratory)

Other databases: Different Types Different Scales Different Models

--Bioinformatics Databases and Systems

Current Data Models

Data models: Flatfile (ASN.1)

– Relational

– XML and its extensions (BSML)

– Others

• Drawbacks?

Research Motivation

Usage and analysis of the data requires: Ad hoc and complicated queries Efficient data browsing and retrieving Integrated data sources Effective and user-friendly data presentation

Existing sequence/structure databases not able to provide these capabilities:

Flatfile format hides semantics of data Relationships/hierarchies are not clear Don’t support ad hoc and complicated queries

DNA Sequences

Linear Sequences DNA sequences

Genetic information carrier

Composed of nucleic acids

Primary protein sequences Composed of amino

acids

Protein Building Blocks

Proteins are the most important macromolecules in the factory of living cells that perform various biological tasks

A protein is composed of 20 kinds of amino acids, also known as subunits or residues

Protein Structures

Protein 3-D structures Intermolecular and

intramolecular chemical forces force the linear primary sequence to be folded into 3-D structures to reach the minimum energy/most stable state

Structures determine properties or functions

Levels of Protein Structures—I

Primary (Linear): each building block (amino acid) can be represented by a letter (of the English alphabet)

Secondary: The chain of covalently linked amino acids is further organized by forming regularly repeating patterns due to hydrogen bondings

Levels of Protein Structures—II

Tertiary: Alpha helices and beta sheets fold themselves further into a "chain", cross-linking with one another via their side chains.

Quaternary: For proteins with more than one chain, interaction can occur between the chains themselves.

Previous Work

“A sequence is a mapping between a collection of similarly structured records and the positions of an ordering domain”

----Seshadri et. al., 1995

Various sequences are just different Ordering Domain and Collection of Records combinations

.

.

.OOOOO....

………

………

Ordering Domain Collection of Records

1-Many Relationship

Many-1 Relationship

Record-Oriented View

Positional View

Time Sequences—I

“For time is just this—number of movements in respect to the ‘before’ and ‘after’”—Aristotle We want to capture attributes of the movements We want to know the order/time of the movements

Time is continuous Temporal databases: deals with semantics of

ordered sequences of data values in the time domain.

Time Sequences—II

“Time sequence is basically the sequence of values in the time domain for a single entity instance”

---Segev et. al., 1987

Time sequences can be: Step-wise constant Discrete Continuous

.

.

.OOOOO....

………

………

Time Points Attribute Data Points

[1:M] [0:1]

Biological Sequences—III

Basic model for sequence can be adapted:

ACGT

.

.

.34567....

………

………

Linear Order Domain Collection of Nucleic Acids

[1:M] [1:1]

AB..YZ

.

.

.(X1, Y1, Z1)(X2, Y2, Z2)

.

.(Xn, Yn, Zn)

.

.

.

.

………

………

Coordinates Domain Collection of Amino Acids

[1:M] [1:1]

Other Sequences

Process Sequences Sequences of processes and subprocesses

Multimedia sequence Streams of multimedia data

ImageAudio

Why a New Model for Biological Sequences?

Time sequences are continuous, biological sequences are discrete

More semantics in biological sequences such as sequences and their subsequences

In time sequences, some time points don’t have data, in biological sequences, each position has its own data

Usage of biological data requires that the sequence data be represented and analyzed in different ways

Relation to Gene Ontologies?

Ontology defines biological properties associated with sequence data , however we model semantics of sequence data

No protein structure ontology exists Both contribute to database integration.

What about Relational technology?

Relational technology doesn’t support data structures as complicated as biological sequences

Hierarchy and semantics are hidden in relations

We can never emphasize semantics too much!

DNA Sequences Semantic Model

DNA sequences and primary protein sequences:

AMINO_ACID OTHER_ATOMS

ATOMSDomain={A1, A2, ...Ai}

S

Is_featured_by

BIOFEATURESrecordsFEATUREINFO

[1:M]

[1:1]

[1:1]

keeps

VERSIONS

[1:M]

POLYNUCLEOTIDE Are_maintained_for[1:1] [1:M]

BIODATABASES Is_connected_to[1:M]

SequentialAggregate

[1:1]Fragment

(O)[M:N] [1:M]

[M:N]

[0:M]

[1:M]

SUBSEQUENCES

),( nlmkXA lk

LINEARORDERDomain={X1, X2, ...Xj}

[1:M]

Atom_identifier

Atom_Symbol

Amino_Acid_Name Nucleic_Acid_Name Atom_Weight

Atom_Category

NUCLEIC_ACID

Sequence_VersionSequence_Length

Sequence_ID

Start_From End_At

Sub_Sequence_ID

SEQUENCES

nj

mi

XA ji

1

1

[1:1]

Ram, S. and Wei, W., Semantic Modeling of Biological Sequences. in Thirteenth Annual Workshop On Information Technology and Systems (WITS'03), Seattle, Washington, December 2003.

Entity Classes - I

ATOMS Superclass of families of atoms Collection of atomic components of biological sequence Domain: Possible components

LINEARORDER Set of positions (integers) in the sequence Domain: (1, j) where j is the length of the sequence



Entity Classes - II

SEQUENCES Ordered list of (ATOM, LINEARORDER) pairs

SUBSEQUENCES Part of a sequence Associated with biological activities

New Constructs-I

Sequential Aggregate It is aggregation of ATOMS and LINEARORDER It is sequential because order matters

Normal Aggregate To indicate whole-part relationship Example: Course and students

SequentialAggregate

{ (A, 1), (T, 2), (G, 3), (C, 4), (T, 5), (G, 6), (C, 7), (T, 8), (A, 9), (A, 10)}

SequentialAggregate

A-T-G-C-T-G-C-T-A-A

New Constructs - II

Fragment Sequences are segmented Fragments can overlap

Fragment (O)

A-T-G-C-T-G-C-T-A-A-G-T-C-C-A-T-T-A-C-G-G-T-A

A-T-G-C-T-G

1ST TO 6TH

G-C-T-G-C-T-A-A-G-T

3RD TO 12TH

A-A-G-T-C-C-A-T-T-A-C-G-G-T-A

9TH TO 23RD

G-C-T-GOverlap 3rd to 6th

A-A-G-TOverlap 9th to 12th

Relationships

Ternary Sequential Aggregation

Fragment:


SequentialAggregate

[1:1] [M:N]


[1:M]

SEQUENCES

nj

mi

XA ji

1

1

Fragment (O)

[1:M] [0:M]

SUBSEQUENCES

),( nlmkXA lk

SEQUENCES

nj

mi

XA ji

1

1

Utility of DNA Sequence Model

Semantics of sequence data captured Ad hoc queries are possible


Is_featured_by

BIOFEATURES

[1:M]

[1:1]

SequentialAggregate

[1:1]Fragment

(O)[M:N] [1:M]

[0:M]

SUBSEQUENCES

),( nlmkXA lk


[1:M]

SEQUENCES

nj

mi

XA ji

1

1

Find a particular sequence and display a segment from 2nd to 200th

How many subsequences are fragmented from a specific sequence

Find all the sequences that share one or more specific subsequences

Protein Databases

The Protein Data Bank, PDB (http://www.rcsb.org/pdb/) is the only worldwide archive of experimentally determined three-dimensional structures of proteins.

Data stored in flatfiles This format records primary and secondary structure of

proteins using groups of coordinates. It does not record the tertiary and quaternary structures. No relationships among structures at different levels is captured.

Protein Structure Semantic Model

ATOMS RESIDUES

SECONDARYSTRUCTURE

TERTIARYSTRUCTURE

[1:1]

PRIMARYSTRUCTURE

[1:1] [1:1]

QUATERNARYSTRUCTURE

[1:N]

Atom_Symbol

Structure_ID

Atom_Type

Atom_Serial Number

Residue_Serial_Number

Residue_Name

Residue_Symbol

Sequential-Aggregate//LL,X/X<=LL

[1:M]Protein_NameProtein_MW

Structure_ID

Experiment

Entry DateSequential_Length

Secondary_Structure_Type

[1:1]

Spatial-Aggregate//P(deg)/P(deg)/P(deg)//T(c)

Spatial-Bonding//B(A1-A2)//BL(A)//BE(kcal/mol)

A1_Type = Amine_HydrogenA2_Type = Carbonyl_Oxygen

A1_Residue_Serial_Number (NE) A2_Residue_Serial_Number|A1_Residue_Serial_Number - A1_Residue_Serial_Number| = 4

If Secondary_Structure_Type = Alpha


If A1_Type = Amine_Hydrogen, then A2_Type = Hydroxyl_OxygenIf A1_Type = Sulfur, then A2_Type = SulfurA1_Charge = Acidic & A2_Charge = Basic

ORA1_Charge = Basic & A2_Charge = Acidic

A1_Residue_Serial_Number (NE) A2_Residue_Serial_Number

Secondary_Structure_ID

[1:1]

[1:1]




A1_Residue_Serial_Number (NE) A2_Residue_Serial_NumberA1_Chain_Number (NE) A2_Chain_Number

[1:1][1:1]

Chain_Number

Entity Classes

ATOMS: This entity class is used to model chemical atoms (C, H, O, N etc) in the protein structure with each of them identified uniquely.

RESIDUES: This entity class represents amino acid subunits, which are the basic building blocks of protein structures.

PRIMARY STRUCTURE SECONDARY STRUCTURE TERTIARY STRUCTURE QUARTERNARY STRUCTURE

Relationships—I

Spatial-Aggregate

P: represents a point using x, y and z coordinates in degrees

T is the temperature at which the structure is determined

ATOMS RESIDUES[1:1] [1:1]Spatial-Aggregate

//P(deg)/P(deg)/P(deg)//T(c)

Relationships—II

Sequential-Aggregate

LL is the list length X is the position of the residue in the list Position of any atom has to be less than or equal

to the length

RESIDUES PRIMARYSTRUCTURE

[1:1]Sequential-Aggregate

//LL,X/X<=LL[1:M]

Relationships—III

Spatial-Bonding Represent the strength and length of the chemical forces

among atoms By describing the semantics of these bonds at each level

using additional annotations, we can differentiate between these bonds as they apply to different levels of protein structures

Relationships—IV

An example of annotated relationship

For secondary structures B: Bond BE: Bond energy BL: Bond length





Utility of Protein Structure Model

ATOMS RESIDUES

SECONDARYSTRUCTURE

[N:M]

PRIMARYSTRUCTURE

[1:1] [1:1]

[1:N]

Sequential-Aggregate//LL(X)/X<=Sequential_Length

[1:M]

[1:1]

Spatial-Aggregate//P(deg)/P(deg)/P(deg)//T(c)








A1_Residue_Serial_Number (NE) A2_Residue_Serial_Number

[1:1]

Find the sequence of amino acids for this protein structure

Give me all the hydrogen bondings that contribute to the secondary structure

Find a set of forces similar to this, and the resulting 3-D structure?

New Operators based on Semantics

Sequence Subsequence Aggregate Comparison of Sequences and Subsequences

-- Identical-- Similar-- Partial

Allen’s Predicates: Before, After, Meets, During, Starts, Finishes, Contains, Overlaps.

Future Research

Our ultimate goal is biological sequence database integration Additional semantics constructs Semantic reconciliation among databases Case studies

semantic modeling of biological sequences sudha ram eller professor department of management...

Documents

databases slide

protein structures protein

future work slide

simple mass of data

hydrogen bondings slide

d protein structures

dna subunits

linear primary sequence