report of master learning database€¦ · one well designed database schema for blue brain project...

Master Project‐Learning Database Computer Science Yuanjian Wang Zufferey

Database Laboratory

Report of Master Project Learning Database

Professor: Spaccapietra Stefano

Assistant: Fabio Porto

Student: Yuanjian Wang Zufferey


Database Laboratory

Summary

1. INTRODUCTION ......................................................................................................................... 4

2. RELATED WORKS ...................................................................................................................... 5

3. CONCEPT MODEL ...................................................................................................................... 8

3.1 Project Architecture ......................................................................................................................... 8

3.2 UML Definition ................................................................................................................................ 9

3.2.1 Biological Neuron ............................................................................................................................ 9

3.2.2 Computational Model.................................................................................................................... 11

3.2.3 Simulation Model .......................................................................................................................... 12

3.3 Definition of XML Schema ............................................................................................................... 14

3.3.1 LearningDatabase ........................................................................................................................ 14

3.3.2 BioNeuron ..................................................................................................................................... 14

3.3.3 NeuroModel ................................................................................................................................... 15

3.3.4 Hypotheses .................................................................................................................................... 18

3.3.5 Simulations ................................................................................................................................... 19

3.3.6 Constraints .................................................................................................................................... 22

4. LEARNING ALGORITHM ....................................................................................................... 23

5. IMPLEMENTATION ................................................................................................................ 28

5.1 Architecture of the Oracle Implementation ................................................................................... 28

5.1.2 Computational Model ................................................................................................................... 30

5.1.3 Biological question: ......................................................................................................................... 31

5.1.4 Simulation ...................................................................................................................................... 33

5.1.5 Competitive Learning Tables .......................................................................................................... 34

5.1.6 Definition of Views ........................................................................................................................ 35


Database Laboratory

6. QUERIES .................................................................................................................................... 42

7. PERFORMANCE ANALYSIS .................................................................................................. 44

8. CONCLUSION ............................................................................................................................ 48

9. ACKNOWLEDGEMENT ............................................................................................................ 50

REFERENCE ...................................................................................................................................... 51


Database Laboratory

1. Introduction The reason that I proposed this project to Prof. Stefano Spaccapietra and Dr. Fabio Porto is that I believe that computer science has to behavior as the tool that serves in one or more application domains such chemical, biological or financial domains. As a Master student who wants to be specialized in Bio‐Computing, I have studied the corresponding courses that including the basic molecular biological courses and biological computational courses, and the machine learning courses which give me the most enthusiasm. At same time, my interest in Database technology is never decreased. To combine the two techniques to supply the applicable services for biological domain became my first idea of the Master project. On the contact with Dr. Fabio Porto, the requirement of one well designed database schema for Blue Brain Project has been suggested. In further consideration, the possible requirement of application of machine learning algorithm in database to automatically cluster unstructured high dimension data, such as experiment results and simulation results has been proposed. It gave one opportunity to design and realize one application system that combines biological and computational aspects of neuronal science. With the very fast growing knowledge collection on the Neuroscience, the requirement is became more and more urgent to find a well‐formed storage and retrieval tool to save and share such knowledge among neuron scientists so that we can take the advantage of the coming new discoveries and share the information quickly and reliably.

The basic requirement is from the Neuroscience. We need an extensible database schema which can save and retrieve the neural information including the biological information and the mathematics information. One group of the users is composed by the biologists who work on the neuron experiments and try to find the biological functions of neurons. The second group is composed by the mathematicians, physicians, and computer scientists who create and manage the computational models that describe the electrical activity of neural cells. The advanced requirement is how to make the two groups of scientists to understand each other and further more to help them to work better together. As the two groups of scientists do not interpret the same biological or computational simulation data in the same way, we have to find a method to translate or map corresponding terms that have different representation but with same signification. Biologists are interested in the characteristic of biological part of neurons, for example, neural network, type of cells, characters of ion channel, receptor or transmitters of neuron. But the physicians are more interested the potential change on the membrane and abstract each neuron as a unit. A computational model usually simulates part of the biological data and the biologist may tune his experiment by cheap simulation on computer before the real experiment on neural cells. The contrary, physicians or mathematicians can find and test on the biological interpretation after he has built a computational model. How do they find the possible hidden similarity between biological data and simulated data? This project proposed a solution that is to build two basic database schemas for each group to store the common knowledge of biology and computation, and then build a bridge between biologists and physicians or mathematicians on the experiment and simulation data. The most common method is to draw a curve by using the recorded data and to see how the two curves fit each other. But how we do the queries on the mass of data to find the similarity in database and in the reliable way?


Database Laboratory

Here we propose a neural learning database solution. Given the biological neural data and the results of simulation, our database can learn the similarity between the two types of data and cluster them together. We supply the answer of such queries:

• Are there any computational model have similar simulation results with the given biological experiment results? If yes, show the most similar results and model in the comparable way (for example, curve and graphs).

• Are there any biological results fitting well the given computational simulation result? If yes, show the most similar results in the comparable way (for example, curve).

The neural learning database will be implemented with the learning ability. It is able to cluster different data series basing on the data content itself and certain predefined criterions. At same time it can be reinforced by external positive or negative feedback from users. For example a proposed result of a query may be strongly confirmed by user, it will augment the correlation of the results. Similar for the negative case, but the correlation will be reduced. With the size of learned data growing, the precision to cluster data can be improved.

2. Related works To supply a common knowledge database that saves computational aspect of the Neuroscience, modelDB [11] has given an example. It supplies the computational model by the classification of neuronal composition. The advantage of modelDB is that we can find various formats of computational model, such as NEURON, JAVA, MatLab, C++, etc. The inconvenience is that we can’t easily compare different formats directly. The definition of computational model is not easy to read (equation, definition of variable, parameters, etc). The connection between biological structure and model application can’t be found. At same time, we can’t find the similar simulation or experiment results in different computational models. Traditional database manager systems are aimed to classify data by certain predefined criteria. We usually need to know well in advance the data structure of data and have to carefully define the structures and the detailed procedures to abstract data for storage and classification. The semantics web applications use the well defined conceptual schema to supply annotations (knowledge markup technologies) to be recognized by the semantic analysis tools and thus we can classify the web contents based on the annotations. Based on the very carefully designed schema, the ontology technology can mine the valuable, high‐quality ontological resources. Obviously, the multi‐media data without the well defined annotations will not be possible to be mined by ontology. Recently the learning machines are well used in all kinds of domains, such as image reorganization, language processing, syntactic pattern recognition, search engines, medical diagnosis, bioinformatics and cheminformatics, etc. Machine learning and clustering in information retrieval systems can be applied to categorize the content‐based results or rank them more meaningfully. As a broad subfield of artificial intelligence, machine learning [13] is concerned with the design and development of algorithms and techniques that allow computer to “learn”. The major focus of


Database Laboratory

machine learning research is to extract information from data automatically, by computational and statistical methods. So, machine learning is closely related not only to data mining and statistics, but also theoretical computer science. Generally, there are three kinds of algorithms: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning aims to solve the classification problem. The procedure of a supervised learning usually begins with creating a function from the training data (inputs and their outputs), then predicts the values of this function for any valid input (not necessarily part of the training set). The principle is the generalization from training set to unseen inputs. It learns the behavior of a function which maps a vector [X1, X2, …, XN] into one of several predefined classes by fitting with several input‐output examples of the function. The problem to apply the supervised learning is that we have to know in advance the possible classes and require a set of classified examples for training. The most important supervised learning machine is the Artificial Neural Network. The formalization of a simple artificial neural network with one neuron that behaviors as perceptron can be described as (Figure‐1):

Given a training set: {(Xμ,t)| Xμ=( x0, x1, x2, … , xn), t є {1,‐1}, μ=1,2,..,p} , t is the expected

output. Output: Y=f(X): To find f(X) by the training set. Predict Y’=f(X’), where X’ is not in the training set.

Figure‐ 1‐ A simple artificial neural network for supervised learning‐ perceptron learning

Unsupervised learning agent models a set of inputs: classes and typical examples are not available. The common form is clustering, which is sometimes not probabilistic. The number of clusters is adapted by the problem size and user can control the degree of similarity between members of the same clusters by means of a user‐defined constant called the vigilance parameter. A typical Hebbian neural network is shown in Figure‐2. Hebbian learning can be summarized as “cells that fire together, wire together“. Starting from this point, we can cluster the cells that fired together in the same cluster.

Figure‐ 2‐ A simple artificial neural network for unsupervised learning: Hebbian learning


Database Laboratory

Reinforcement learning concerns with how an agent ought to take actions in an environment so as to maximize some notion of long‐term reward. It is studied in the domain of real time system. Learning machines’ algorithm, especially the supervised and unsupervised algorithms give the possibility to database to learn in advance (supervised) or learn by progression (unsupervised). As in this project we can’t get the possible examples for supervised algorithm to train database, we have to look at the unsupervised algorithms.

In the following section, we firstly we propose an Object‐Oriented schema in the UML presentation and the XML representation (Section 3). In the section 4 we describe the common cluster algorithms. Then we introduce the detailed design (implementation) in Oracle in section 5. In the section 6 we describe the queries the system support. The performance about the clustering algorithm‐ competitive learning unsupervised learning implementation is shown in section 7. We conclude this project in the section 8.


Database Laboratory

3. Concept Model The fundamental work of this project is to design a well‐formed data structure to store the necessary information. It not only supplies the storage base, but also to formalize some workflows. The storage base is composed by two parts: neuronal biological information and computational model information. The workflows are concerning the procedure of construction a computational model and the simulation procedure. 3.1 Project Architecture

Figure‐3‐ Architecture of Learning Database Project

In Figure‐3 we can see the architecture of this project. Generally speaking, this project is defined by two levels of definition: UML or XML Object‐Oriented classes’ definition and Relational Database definition. In the UML or XML Object‐Oriented definition, we stood at the point of user view and specified two groups of users: biologists on Neuroscience and scientist who working on mathematis, physics, and compute science.Thus two types of information is modeled as Biological Neurons and Computational Models. The biologists can retrieve the biological information about neurons (1.Biological Queries). Similarly, the other group of scientists can query the computational models information about neurons (2.Computational Queries). The derived information such as the biological experiments and computational simulation are available. From derived information, biologists or other scientists can not only execute the searching on the experiment or simulation (3.Experiment or Simulation Queries), but also the similarity queries, that is to find the cluster information given an experiment result or a simulation result (4.Data Clusters Queries).

Knowledge Database

Learning Database

Biological Neurons Computational Models

XML or UML Object‐Oriented Definition

Biologial Experiments Computational Simulations

XML or UML Object‐Oriented Interface Layer

1.Biological Queries 2.Computational Queries

5. Learning Process

3.Experiment or Simulation Queries

4.Data Cluster Queries

Relational Database Definition


Database Laboratory

In the Relational Database definition, we implement the Object‐Oriented definition of UML or XML in the relational way. We distingue two kinds of implementation: One is Knowledge Database that stores the biological neural information (including the neuron information and experiment information) and computational neural information (combining the computational model with the simulation information). The other one is learning database who learns from the simulation and experiment results to find the similarity between them and form the clusters (5. Learning Process). The biologists or other scientists can query the similar results of experiments or simulations. We start with the introduction of the Object‐Oriented conceptual model that is expressed by the two common representations: UML and XML in the following sub chapters. In this step of work, some of the terminologies are referred to the modelDB and some are referred to the wiki definitions. 3.2 UML Definition 3.2.1 Biological Neuron Firstly, we need to know what a neural cell is. Neurons [3] are electrically excitable cells in the nervous system that process and transmit information. Neurons are the core components of the brain, and spinal cord in vertebrates and ventral nerve cord in invertebrates, and peripheral nerves. Neurons are typically composed of a soma, or cell body, a dendritic tree and an axon (Figure‐4). The majority of vertebrate neurons receives input on the cell body and dendritic tree, and transmits output via the axon. Neurons communicate via chemical and electrical synapses, in a process known as synaptic transmission. The fundamental process that triggers synaptic transmission is the action potential, a propagating electrical signal that is generated by exploiting the electrically excitable membrane of the neuron. The electrical properties of the ion channels and receptors on the membrane of neuron together can decide the electrical properties of the neuron (Figure‐5). Neurotransmitters are chemicals that are used to relay, amplify and modulate signals between a neuron and another cell.

Figure‐4‐1 A typical neural cell Figure‐5‐ 2Receptor and Ion Channels

A biological neuron [2] (Figure‐6(a)) UML model can be composed by multiple compartments that are soma, axon (an axon can be composed by Hillock, stem, and terminal sub compartments), and dendrites (a dendrite can be composed by proximal, middle and distal sub compartments). Additional compartments could be added in the future. Each compartment has some electrical

1 Figure from http://www.cs.nott.ac.uk/ 2 Figure from http://www.neuropsychopathologie.fr


Database Laboratory

properties on the membrane such as the neurotransmitter receptor type (for example: ionotropic receptor, metabotropic receptor), ion channel (for example: Na+, Ka+, Ca2+) or transmitter type (for example: the acetylcholine, the biogenic amines, the amino acid transmitters, etc.). For each property of the compartment we may interested in the states of some measurement, such as potential of membrane, membrane capacitance, conduction velocity of axon, etc. For each neuron, we are interested in where it is (organ), which category it belongs to (classification), and what its function are. We can describe the interoperation among the molecular information, such as the coded gene, the microscope image for visualization, and the experiments to understand its dynamics characters.

(a)

(b) Figure‐6‐ (a) UML representation of Biological neural cell;

(b) An instance of Neuron in UML.

A concrete instance of neuron is shown in Figure‐6(b). By different classification criterions, a neuron could belong to different neuron class. An organ will be covered by numerous neurons. Each neuron can be composed by different compartments that carry special electrical properties.


Database Laboratory

Figure‐7‐ A pyramidal neural cell

For example, a pyramidal cell [5] (Figure‐7)or pyramidal neuron, or projection neuron) is a multipolar neuron (Nueron_Classification) located in the hippocampus and cerebral cortex (OrganInstance). These cells have a triangularly shaped soma, or cell body, a single apical dendrite extending towards the pial surface, multiple basal dendrites, and a single axon (Compartment). K+ channels (ElectricalProperty) on dendrites of pyramidal cell are often studied. 3.2.2 Computational Model

Figure‐8‐ UML Representation of a neural computational model

Let us look at an example of computational model (The computational model example we have taken from ModelDB [11].): Simple Model of Spiking Neurons [6] who combines the biological Hodgkin‐


Database Laboratory

Huxley‐type dynamics and the computational integrate‐and fire neurons. Two equations of this model are defined here:

With the auxiliary after‐spike resetting:

There are three variables:

: The membrane potential of the neuron; : The membrane recovery variable, which accounts for the activation of ionic currents

and inactivation of ionic currents. : Delivered synaptic currents or injected dc‐currents

And there are 4 parameters are defined: : Time scale of the recovery variable .

: The sensitivity of the recovery variable . : The after‐spike reset value of the membrane potential caused by the fast high‐threshold conductance. : The after‐spike reset of the recovery variable caused by slow high‐threshold and conductance.

We can easily define a computational model by its equations and the variables, parameters that included in equations (Figure‐8). For each variable and equation, there may be some biological explanation supplied (as in our example). This model will read from (ReadInterface) and produce

the results of (WriteInterface) in each time step. It’s a model based on the Hodgkin‐Huxley‐

type dynamics and integrate‐and fire hypotheses. The references of this model we can find in the paper of [6]. The biological question that it tries to solve is: spiking and bursting behavior of known types of cortical neurons. 3.2.3 Simulation Model

Figure‐9‐ UML representation of a simulation


Database Laboratory

Simulations (Figure‐9) are the realization of computational model in some programming languages, such as Matlab, NEURONS or Java. The same realization of computational model but with different simulation conditions (different start conditions, parameter settings and stop conditions) will produce different results. A simulation can include more than one simulation element. A simulation element can be a neural cell, or a compartment of neural cell, or electrical property binding with a computational model. And these elements can be connected to form a neural network or detailed neural cell. For each simulation, we may discover different behaviors.

Figure-10- (A) a=0.02, b=0.2, c=-65, d=6, I=14, tonic spiking

(B) a=0.02, b=0.25, c=-65, d=6, I=0.5, phasic spiking For example, in the computational model we described in 3.1.2 (Simple Model of Spiking Neurons), we can bind a neural cell with this model. With different initial condition and parameters setting, we can get the different simulation results as below (Figure‐10). With the (A) initial condition, we can get a tonic spiking behavior and with (B) initial condition, phasic spiking appears.


Database Laboratory

3.3 Definition of XML Schema Now we introduce the concept model in XML representation (xml schema) of the Objet‐Oriented concept from the top definition to the bottom. 3.3.1 LearningDatabase The top of schema is the LearningDatabase element (Figure‐11). It’s composed by BioNeurons, NeuronModels, Hypotheses, Simulations and References elements.

Figure‐11‐ XML representation of Learning Database

BioNeurons named biological neurons is the collections of BioNeuron element. NeuroModels are a collection of the neural computational model. Hypotheses are a collection of hypothese proposed by different authors and are referred in computational models. Simulations are a collection of simulation which simulates one computational model or a combination of more than one computational model. References element is the collection of references that may be referred in the computational model. We will explain the Constraints in section 3.2.6. 3.3.2 BioNeuron

Figure ‐12‐ XML representation of Biological Neuron

BioNeuron (Figure‐12) is an element who acts as an individual neuron cell. It’s identified by its ID and holds properties such as name, Canonical form, and Organism where it situates. To describe its


Database Laboratory

molecular information, we can use Interoperation in the form of molecular information (Gene_Chromosome), visual information (Microscopy_Data) or dynamics data (Experimental_Data). An external resource location such as image file path can be saved for each sub element of Interoperation (Resource attribute records this location). A neuron cell is composed by different compartment such as somas, axons or dendrites. Certain electronic properties (for example, channels, receptors or transmitters) can be bound to the membrane of the compartment. For each neural cell, more than one experiment can be executed and in each experiment the measurements of electronic properties may be recorded in each time step (named states) during the whole execution time. We record the unit, name, and value of the measurement. It’s important that we have such information for clustering algorithm to cluster the homologous data. More generally, we can save the result data sets in a file instead of saving each value in element. Additional information such as the protocols that the experiments followed or detailed description can be saved in the description element. 3.3.3 NeuroModel

(b)

(a) (c) Figure ‐13‐(a) XML representation of Neural Computational Model; (b) XML representation of a variable; (c) XML

representation of a parameter

A neural computational model (Figure‐13(a) NeuroModel) is identified by its ID. It holds the name of the model, its author, the date it has been built (BuiltDate) and description properties. Its principle


Database Laboratory

components are the definition of Equations, Variables and Parameters. It usually tries to solve a BiologicalQuestion. To understand well the computational model we need to know which reference it has used. For the retrieval aim, we need to know the keywords that the model concerns. For each computational model, we may have the resource files such as program codes (Matlab, Java, NEURON, C++, etc.), reference files, some other necessary files can be treated as additional file that zip in one file. We can find the location of this zip file by the attribute ‘resource’ of element of AdditionalFiles. Each variable (Figure‐13(b)) or parameter (Figure‐13(c)) has its unique identity (ID) in one computational model, name, unit and the symbol (for example, α, β, x, y, etc) which we use to represent it in equation.

Figure ‐14‐ XML representation of the Equation

An equation (Figure‐14) with unique ID in one computational model is composed by the variables and parameters. The mathematic expression of the equation is saved. For each equation, it may describe a functionality of one region of organism (for example, the brain), of one neural cell, or one compartment (for example, the soma), or an electronic property (for example, the K+ channel property). In one computational model, we may define more than one group of equations that have similar behavior. (Example)

(a) (b) Figure ‐15‐ (a) XML representation of a Biological Question; (b) XML representation of a Reference

For retrieval convenience, we are interested in the characteristics of one computational model, such as biological question (Figure‐15 (a): BiologicalQuestion) it tries to solve, references (Figure‐15 (b)) it


Database Laboratory

refers to, hypothesis it’s based on, the keywords it mentions. A reference can be a published paper (PaperReference), or from a scientist (TheoryFromPerson), or from one book. To precise a biological question, we can describe the research area it belongs to, which kind of topics it talks about, or a reference it uses. To reply the question, this computational model may give important contribution. The specialties (Feature) of this computational model to reply this biological question, and the conclusion we can get from it are important to other interested people to refer to or consult.To formalize the terminology of research area (Figure‐16(a)) and topics (Figure‐16(b)), we referred the classification of Wiki on Neuroscience[14] and the topics in ModelDB[11].

(a) (b)

Figure ‐16‐ (a) XML representation of a Research Area; (b) XML representation of Topic

Figure ‐17‐ Interface definition: Reads and Writes interface of a neuron

For a single neuron, its different compartments may have the same or different properties that can be described by the same or different computational model. To communicate between different


Database Laboratory

models we can define the reads variables that can be written by other control mechanism or writes variables that can be read by other control mechanism (Figure‐17).

In a neural network, the connections between different neural cells will behavior in the same way. One neuron cell can send information to the other by the reads variables of the second one (Figure‐18). The sent information is carried by the writes variables of the former one.

Figure ‐18‐ An neuron supplies Reads and Writes interface for other neurons

3.3.4 Hypotheses

Figure ‐19‐ XML representation of Hyposthese

Hypotheses (Figure‐19) are a collection of Hypothesis. Hypothesis has to be proposed by someone and has the corresponding statement. It may have some relationship with the region of organism, or some cell, or with compartment of neural cell. It has the unique identity (ID) in LearningDatabase scale.


Database Laboratory

3.3.5 Simulations

Figure ‐20‐ XML representation of Simulation

Simulations element is the collection of simulation. Simulation (Figure‐20) can be implemented and executed in the SimulationEnvironment (for example: Matlab, C++, Java) and corresponding program codes saved in SimulationResource (location information). A simulation can include more than one computational model at a time. For example, given two computational models that one defines the electronic property of Na+ channel and the other defines K+ channel and one neural cell with three compartments (Figure‐20): soma, axon and dendrite, the computational models of Na+ channel and K+ channel can be applied or bound on the membrane of each of the three compartments in a simulation. As a result, a simulation can involve more than one computational model and with more than one compartment.

Figure ‐20‐A simulation example: A neuron cell with soma, dendrite and axon compartment, each compartment’s membrane

applied two computational models that define the Na channel and K channel properties.


Database Laboratory

The variables defined in the same computational model (Figure‐21) but applied on different compartment can carry different initial conditions (InitialConditions), stop conditions (StopConditioons) and parameters setting (Assignments).

Connections (Figure‐21) serve as three functions: in the single neural cell simulation with multiple computational models binding to different compartment, it can describe the connection between compartment or property (ConnetctTo element) and computational model (From element); the second function is the connection between compartments or the connection between a compartment and a property; the third one is in the neural network, it can serve as the connection between different neural cells.

Figure ‐21‐ XML representation of sub elements of simulation: NeuronModels, who defines the simulation condition for each

applied computational model

Figure ‐22‐ XML representation


Database Laboratory

If we have discovered some abnormal phenomena or some interesting observations in a simulation, we can record it in Discovery element (Figure‐23). The cell type and the region of organism can serve as retrieval information. The observed results are recorded in Observation elements with the detailed compartment, electronic property and description, the measurement also.

Figure ‐23‐ XML representation of Discovery in simulation

The simulation results (Figure‐24) of the same computational model bound in different compartment will be different. The representation of simulation results can be graphs (curves) or tables that each axis or column saves the values of one variable during one simulation. Being limited to 3D graphs, the Graph element may have X, Y, and Z axis. Each axis save the variable (refer to the ID of variable in computational model) and its whole value as a list or external resource such as txt or xml file. The Tables element is the collection of all the TableData which saves one variable and its values for each step of simulation as list or external file.

Figure ‐24‐ XML representation of simulation result


Database Laboratory

3.3.6 Constraints We list the constraints include the key definition in Table‐ 1:

Key Scale Selection Field

NeuronID Learningdatabase ./BioNeurons/BioNeuron ID

NeuronModelID Learningdatabase ./NeuronModels/NeuronModel ID

PropertyID Learningdatabase ./Bioneurons/Properties/Property @ID

CompartmentID Learningdatabase ./ Bioneurons/Compartments/Compartment @ID

HypothesisID Learningdatabase ./Hypotheses/Hypothesis ID

SimulationID Learningdatabase ./Simulations/Simulation @ID

ReferenceID Learningdatabase ./References/Reference @ID

ParameterID NeuronModel Parameter ID

VariableID NeuronModel Variable ID

EquationID NeuronModel Equation ID Table ‐1‐ Key definition in XML representation

We also list out the key reference definition in Table‐2:

KeyRef refer selector fieldHypothesisRef HypothesisID NeuronModels/NeuronModel/BiologicalQuestion Hypothesis

NeuronModelRef NeuronModelID Simulations/Simulation/NeuronModels/NeuronModel @ID

CompartmentRef CompartmentID NeuronModels/NeuroModel/* |Simulations/* |BioNeurons/BioNeuron/Compartments

Compartment

PropertyRef PropertyID NeuronModels/NeuroModel/* |Simulatioins/*|BioNeurons/Compartments/* |BioNeurons/BioNeuron/Compartments/Compartment

Property

Table ‐2‐ Key Reference Definition in XML representation


Database Laboratory

4. Learning algorithm In this project we are trying to simulate the function of brain by using the database. The brain is not only serves as database that to store information but also has the capacity to learn from the information that it has met. A learning database, as we named, it is firstly a database which stores the information for the Neuroscience, secondly it learns from the information that it has stored. A simple example that everyone has experimented: we have learned more and more the classifications of plants or animals as we have seen more and more plants or animals. All people have different way to learn under different conditions. In the cases that we have a teacher to give the right answer is the easiest way and we usually named it as supervised learning. In the cases that we have to find the answers by ourselves, if we make mistake or we get right answers we may get a negative or positive compensation, it’s the case of reinforcement learning (for example, to learn bicycle); but without such feedback, we can only rely on some intuition. Such problems as classification and clustering are belonging to unsupervised learning, we can only try to put the similar things together and guess how many classes or clusters may exist. How to define the similarity for a class or a cluster? We have to give a formal definition such as the distance between the center of the cluster and the object that we want to classify. But what’s the center of the cluster? How can we find center or what’s the procedure or algorithm to find it? Such questions have to be replied before we can really learn something. Firstly, let’s look at a simple example in the following figure (Figure‐25):

Figure ‐25‐ a simple example for cluster problem

In this case we easily identify by eyes the 4 clusters into which the data can be divided. But how the computer can distingue it? What’s the criterion that the computer put one point into one of the cluster? Another more difficult example, as in our case, we want to identify if the experiment or simulation results in our database can be classified into any of the curves as shown in the Figure‐26 (Data samples for learning are created by the MatLab by defining different combination of elementary functions: exponentials, logarithms, and trigonometric). And worse, we don’t know in advance that which kind of typical curves we could have. How can we cluster the similar curves together?


Database Laboratory

Figure ‐26‐ The curves that created by Matlab by using the combination of elementary functions

In this project, learning algorithm is working as a clustering machine whose function is to find the cluster of each experiment or simulation results clusters and supply the possibility to retrieve the similar results when given a sample. There are many algorithms of unsupervised learning for cluster problem, such as K‐means Competitive Learning [][7], Kohonen Competitive Learning, and Fuzzy‐C‐Means Competitive Learning and Hierarchical Clustering Algorithms.

• K‐means CL: (MacQueen, 1967) [6] is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The standard k‐means algorithm calculates the distance between the input vector and the centre vector (prototype) (Figure‐27).

Figure ‐27‐ Visualization of the position between prototype and its sampels

The distance is usually defined as the Euclidian norm:

(4)

Where is the input vector, is the prototype vector and is the vectors’ dimension

and N is the number of the prototypes. The prototype with a minimum distance is named

winner that is the center of the cluster: (5)

The winner prototype is updated by a reduction of the learning rate towards the input:


Database Laboratory

(6)

This reduction of the learning rate makes each prototype vector the mean of all cases assigned to its cluster and guarantees convergence of the algorithm to an optimum value of the error function:

(7)

is the input vector that is classified in cluster . is the prototype of the cluster .

The algorithm is shown as following (Figure‐28):

Figure ‐28‐ Diagram of the K‐means algorithm

The convergence criterion is defined by the percentage of the number of changed winners for all inputs. The classical k‐means algorithm has the “dead units” problem. That is, some prototypes may never win the competition, so it may never be updated. The result is the “dead units” that can’t really represent prototypes. Furthermore, we need to know the exact number of cluster k, before performing the data clustering. Otherwise, it will lead to poor clustering performance. The resulting clusters depend on the initial random assignments. It minimizes intra‐cluster variance, but does not ensure that the result has a global minimum of variance. The time consummation is O(N2). N is the total size of inputs.

• Kohonen Competitive Learning (Kohonen, 1995/1997; Hecht‐Nielsen 1990): one of the “Kohonen network”, the Vector Quantization‐competitve networks can be viewed as unsupervised algorithm that is closely related to k‐means cluster analysis. The prototype

Convergence criterion is satisfied?

Start

Initialize the prototypes

Competition to find the winners for all inputs

End

yes no

Update the winner prototype

Competition to find the winners for all inputs


Database Laboratory

vector is moved a certain proportion of the distance between it and the training case, the proportion being specified by the learning rate, that is:

(8)

Kohonen’s learning law with a fixed learning rate does not converge. As is well known from stochastic approximation theory, convergence requires the sum of the infinite sequence of learning rates to be infinite, while the sum of squared learning rates must be finite (Koheonen, 1995, p.34). In this case, the learning rate has to be reduced in a suitable manner. These requirements are satisfied by MacQueen’s k‐means algorithm. The prototypes are randomly initialized from the input vector values. The algorithm is defined as following (Figure‐29):

Figure ‐29‐ Diagram of Koheonen algorithm

The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. But as in K‐means algorithm, the clustering results of Kohonen Competitive Learning depend on the initialization of the prototypes and may produce the “dead units”. The level of time consummation is O(N M). N is the total size of inputs and M is the total number of cluster.

• Fuzzy c‐means: (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition. In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster. Thus, points on the edge of a cluster may be in the cluster to a lesser degree than points in the center of cluster. For each point x we have a coefficient giving the degree of being in the kth cluster uk(x). Usually, the sum of those coefficients is defined to be 1: The fuzzy c‐means algorithm is very similar to the k‐means algorithm:

1. Choose a number of clusters. 2. Assign randomly to each point coefficients for being in the clusters. 3. Repeat until the algorithm has converged (that is, the coefficients' change

between two iterations is no more than ε, the given sensitivity threshold) : Compute the centroid for each cluster.

Whole set of inputs ?

Start

Initialize the prototypes

Competition to find the winner for one input

Update the winner prototype

End

yes no


Database Laboratory

4. For each point, compute its coefficients of being in the clusters. The algorithm minimizes intra‐cluster variance as well, but has the same problems as k‐means, the minimum is a local minimum, and the results depend on the initial choice of weights. The Expectation‐maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes. It has better convergence properties and is in general preferred to fuzzy‐c‐means.

• Hierarchical Clustering Algorithms[1]: Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:

1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. 3. Compute distances (similarities) between the new cluster and each of the old clusters. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

• Principal component analysis (PCA) [15]: is a vector space transform often used to reduce multidimensional data sets to lower dimensions for analysis. PCA uses the eigenvectors of the covariance matrix and it only finds the independent axes of the data under the Gaussian assumption. For non‐Gaussian or multi‐modal Gaussian data, PCA simply de‐correlates the axes. When PCA is used for clustering, its main limitation is that it does not account for class separability since it makes no use of the class label of the feature vector. There is no guarantee that the directions of maximum variance will contain good features for discrimination.

The reason that we chose Kohonen competitive learning algorithm is due to three principle limitations:

• First, the simplicity of competitive learning algorithm is a big advantage.

• Secondly, hierarchical clustering methods are unsuitable for isolating spherical or poorly separated clusters

• The last reason is that K‐means competitive learning algorithm can take a long time to converge to a solution, depending on the appropriateness of the reallocation criteria to the structure of the data. However, the structure of the data is unknown. Furthermore, for each new coming sample, we need to recalculate all the prototypes again. It’s quite impossible when there are huge samples existing in database.

To reduce the number of final “dead units” and improve the performance of the initialization of the prototypes, we have taken some strategies into account. We will discuss them in section 5 (Implementation).


Database Laboratory

5. Implementation This project is implemented in Oracle by using the PL/SQL language.

5.1 Architecture of the Oracle Implementation

Figure ‐30‐ Architecture of the Oracle implementation

The architecture of the Oracle implementation is shown in Figure‐30. There are three basic groups. The first one is the knowledgement about the biological neuron and computational model. The second is experiment and simulation. The last is the learning database. To supply the corresponding views as defined in the Object‐Oriented definition, we defined the queries’ views. To automatically abstract the experiment and simulation results into learning database, we defined the triggers on corresponding tables to import external data into the learning database tables. Learning Procedures and functions are defined to execute the competitive learning algorithm to find clusters. In the follow sub section, we introduce corresponding definitions for each part.

Biological Neuron Tables

Computational Model Tables

Biological Experiment Tables

Computational Simulation Tables

Learning Database Tables

1.Abstract Data :Trigger Definition

Queries’ Views Definition

2.Learning Procedures and Function definition


Database Laboratory

5.1.1 Biological Neuron

Figure ‐31‐ Schema definition in Oracle of Biological neuron

There are 9 tables that save the biological neuron information (Figure‐31):

• tb_organism: Table that saves the information about the biological organisms. It includes the name, description, and identity (id) information fields.

• tb_neuroncell: Biological neural cell information is saved in this table. Its fields include the name, canonical form (canonicalform), cell identity (cell_id), situated organism identity (id) and description of cell type (classification of neural cell: cell_type). The canonical form describes a standard way of presenting that cell. The interoperation of one neural cell is saved in table tb_interoperation. tb_compartmentcomposition saves its composition of compartments.

• tb_compartment: It saves the possible compartment of one neural cell. For example, the somas, axons, or dendrites. It is defined by its identity (id), name and type. It can have more than one property and they are saved in tb_compartmentcomposition.

• tb_property: The general properties such as electrical properties of ion channel, receptors on membrane of compartment are described in tb_property. It has its identity (id), type, name and description information.

• tb_compartmentcomposition: It builds the connection between compartments and properties of one neural cell. A compartment of one neural cell may have more than one property definition. A general property may be included in more than one compartment of a neural cell.

• tb_experiment: A biological experiment is recorded and identified by experimentid. The executer, description, and execution date (experimentdate) information are saved.

• tb_experiment_result: For each experiment, we can record more than one measurement. A measurement can be taken on the property of compartment of one cell. So we save the experiment identity (tb_experiment_experimentid), cell identity (cellid), compartment identity (compartmentid), and property identity (propertyid). For each measurement we record its name, unit and size (datasize). All values of one measurement are found in an external file (datalocation).


Database Laboratory

• tb_interoperation: The microscopy data, gene chromosome information of one neural cell can be saved in an external file and we could find it by the file location of each information (microscopydata, genechromosome).

• tb_observation: It’s the table to save the observation that we record in one experiment. This table can contain the observation of a simulation of computational models. The source_type decides that it’s an observation or experiment or simulation. Sim_or_exp_id saves the simulationid or experimentid. Each observation is unique by its id. The description of the observation is saved.

5.1.2 Computational Model

Figure ‐32‐ Schema definition in Oracle of computational model

There are 9 tables that save the computational model information (Figure‐32):

• tb_neuronmodel: It saves the basic information about the identity (modelid) and name of model, description, author and time when it is built (builtdate). If there are more external files about the model, it can be found in the additionalfile field that saves the file location. The definition of equations, variables, parameters are saved in tables of tb_equation, tb_parameter, tb_variable. More information for retrieval, such as keywords, concerning biological questions, and the references it has used is saved in tables of tb_keywords, tb_biologicalquestion, tb_refered.

• tb_variable: A variable may bind with some biological information such as the neural cell (cellid), compartment of neural cell (compartmentid), property of compartment


Database Laboratory

(propertyid). Or it may signify some region of organism or some other biological information (divers). A variable is represented by its symbol and name. It may have unit. It belongs to a computational model (modelid). It is identified by its id in a model.

• tb_parameter: A parameter may bind with some biological information such as the neural cell (cellid), compartment of neural cell (compartmentid), property of compartment (propertyid). Or it may signify some region of organism or some other biological information (divers).A variable is represented by its symbol and name. It may have unit. It belongs to a computational model (modelid). It is identified by its id in a model.

• tb_equation: An equation is an expression of variables and parameters. Apart from its biological meaning, it is identified by its id in a model and has an expression. Each equation may belong to different equation group.

• tb_variable_member: It saves for one equation the membership of variables.

• tb_parameter_member: It saves for one equation the membership of parameters.

• tb_refered: A model may refer to a reference that can be paper, book, etc.

• tb_keywords: The keywords that have defined in one model are saved in this table. Each keywordid signifies a unique keyword. It can be used by more than one model.

• tb_biologicalquestion: A computational model may reply a biological question in some research area and talks about certain topics. The detail of this table is introduced in next section.

5.1.3 Biological question:

Figure ‐33‐ Schema definition in Oracle of Biological question

There are 8 tables to save the biological question information (Figure‐33):

• tb_biologicalquestion: A computational model may reply to the questions that scientists have asked. For retrieval requirement, we may need to know which research area or


Database Laboratory

topics the question about. For each question we supply a unique identity (questionid). To reply the biological question, we may base on the known hypothesis (tb_hypothesis) or reference (tb_refered). The conclusions for the biological question are saved in table tb_contribution. The biological contributions of the computational model to reply the biological question are listed in table tb_conclusion.

• tb_hypothesis: A hypotheses is made by one person (author), it can be shared by more than one biological question. If it’s a clearly defined hypothesis on the compartment of neural cell (cellid, compartmentid), or on neural cell itself (cellid), or about the region of organism, we can record corresponding information in this table. The statement is the content of the hypothesis. It has one unique identity (id).

• tb_based_hypothesis: This table records for each biological question, the hypothesis it has based on.

• tb_reference: This table will record all kinds of possible references that can be used in this database. The type defines it’s a book, a paper, or other published articles. Description is the detailed information about the reference itself. For example, the author, published date, publisher, etc. It has the unique identity referenceid.

• tb_refered: It records for the computational model (tb_neuronmodel_modelid) and the biological question (questioned) the referred identity (referenceid) of the reference.

• tb_contribution: The contribution is the special devotion of the computational model when it tries to reply one biological question (questionid). It may have more than one contribution. We can index the content by the importance or other characters (indexofcontribution).

• tb_conclusion: The conclusion to reply one biological question is saved in this table (questionid). The similar structure as in tb_contribution is defined. We can index the multiple conclusion contents (indexofconclusion, content).


Database Laboratory

5.1.4 Simulation

Figure ‐34‐ Schema definition in Oracle of Simulation

There are 7 tables that are used to save the information about simulation (Figure‐34):

• tb_simulation: A simulation may be based on one computational model or based on the combination of multiple computational models. Each simulation is identified by its id. The simulationResource field saves the external files location (we suppose all the files in one zip). The SimulationEnvionment describes which kind of tools to simulate the computation models (such as Matlab, Java, or NEURON, etc). The simulated time (simulatedtime) is stored.

• tb_modelsimulation: Each computational model (modelid) that is simulated in one simulation (id) may be applied on the concrete biological environment. For example, given two computational models M1 and M2, we can apply the M1 on the K+ channel (propertyid) on membrane of soma (compartmentid) for one neural cell (cellid), and M2 on the Na+ channel (propertyid) on the membrane of the axon (compartmentid) for the same neural cell. In the description we can describe the connection between the two compartments of the cell. For each computational model (M1 and M2), they may have different parameters setting (tb_assignment). The start simulating condition (tb_startcondition) and stop simulation condition (tb_stopcondition) may different for each model. For each model, we can record the variables’ values during the simulation period (tb_resulttable) and we can construct the graphs based on the variables’ values (tb_resultgraph).

• tb_initialcondition: For each simulation (simulatonid), the initial value for each variable (variableid) of its computational model element (modelid) has been stored in this table.

• tb_stopcondition: For each simulation (simulatonid), the stop value for each variable (variableid) of its computational model element (modelid) has been stored in this table.

• tb_assignment: For each simulation (simulatonid), the assigned value for each parameter (parameterid) of its computational model element (modelid) has been stored in this table.


Database Laboratory

• tb_resulttable: The values of one variable (variableid) of a model (modelid) that has been recorded during simulation (simulatonid). The size of total value (resultsize) and the external file (datalocation) that saves these values as a column of data are stored.

• tb_resultgraph: A 3‐D or 2‐D dimension graph that is produced on the variables’ simulation results can be saved in this table. Xvariableid, yvariableid, zvariableid describe the x, y, z axis data source (We can find the corresponding values in tb_resulttable). Produced graph can be saved in the external file location (graphsource).

• tb_obsesrvation: As described in section 5.1.1.

5.1.5 Competitive Learning Tables There are 5 tables used to save the information that need to apply the competitive learning algorithm (Figure‐35).

Figure ‐35‐ Schema definition in Oracle of Competitive Learning

• tb_sample: This table is the summary for all the experiments’ results and simulations’ results. It’s the interface between learning database and the basic information which including biological neuron information and computational information. Each data set(A data set is an array with one dimension that saves values of one measurement in experiment or one variable in one simulation.) can be originated from simulation or experiment (sim_or_exp_id). If it’s from an experiment, it may include the cellid, compartmentid, propertyid information that are abstracted from experiment source; if it’s from a simulation, it may include the modelid and variableid information. Each data has the unit and size (datasize) information. Datalocation stores the external file that we can read the values from. Once we have read all the values from external file, the data_fromindex and data_toindex will save the values location begins and ends in tb_cluster_sample_values. Each value of one data will be saved in table


Database Laboratory

tb_cluster_sample_values. And data index (dataindex) is the unique identity that serves as the index and primary key.

• tb_cluster_sample_values: Values of one data from the external files are saved in this table. The order of the values in original file is kept. Each value has a data index (dataindex) that is unique for each value. All the values are saved in a vertical column. We need to transform each range of values for each data into horizontal table in usage.

• tb_cluster: This table records the cluster result for each data. Each data set in different cluster level belongs to only one cluster. Each cluster is identified by its clusterid and clusterlayer. The first level of the cluster layer here we defined by default is the clusters classified by the unit and data size. In each cluster with the same unit and data size, we will apply the competitive learning algorithm to find the corresponding cluster.

• tb_prototype: Each cluster has one prototype (the center of the cluster). A prototype is identified by its clusterid, layer (cluster level) and prototypeid. Its vector values are saved in table tb_cluster_sample_values. Data_fromindex and data_toindex save the location of value (data_fromindex<=value.dataindex<=data_toindex). Each prototype represents for one group of data with same unit and size (prototypesize). For quality analysis aim, we can write out the prototype values into an external file in datalocation.

• tb_feedback: It saves the feedback from user who has voted on the quality of the query that returns the similarity for one pair of samples. Dataindex1 and dataindex2 note the samples data index, author field notes who has voted. The time that user votes is saved in feedback_date. The value of the vote is saved in feedback field and it is in the interval [‐1, 1].

5.1.6 Definition of Views • v_organism_cell: View of the biological cells in one organism. The source of view is from

tb_neuroncell and tb_organism.

• v_neural_cell: View of Biological cell: a detailed neural cell with its compartments and properties information. The source of view is from tb_neuroncell, tb_compartment, tb_property and tb_compartmentcomposition.

• v_neural_cell_experiment: A view shows the experiments that have been done on the neural cells. The source is from tb_neuroncell, tb_experiment, and tb_experiment_result.

• v_experiment_observation: View of observation of the experiment. The source is from tb_experiment and tb_observation.

• v_computational_model: View of computational model: a detailed computational model with basic model description and its equations definition. The source is from tb_neuronmodel and tb_equation.

• v_equation_variables: View of equations variables: detailed equation descriptions about the definition of variables. The source is from tb_equation, tb_variable, and tb_variable_member.

• v_equation_parameters: View equations parameters: detailed equation description about the definition of parameters. The source is from tb_parameter_memeber, tb_parameter, and tb_equation.

• v_model_reference: View of the reference that referred by computational model. Source is from tb_neuronmodel, tb_referred, and tb_reference.


Database Laboratory

• v_model_keywords: Views of the keywords that used in each computational model. Source is from tb_neuronmodel, tb_keywords, and tb_keyword.

• v_model_hypothesis: View of the hypothesis of biological question. The source is from tb_neuronmodel, tb_biological_question, tb_based_hypothesis and tb_hypothesis.

• v_model_conclusion: View of the conclusion of biological question. The source is from tb_neuronmodel, tb_biological_question, and tb_conclusion.

• v_model_contribution: View of contribution of neural model. The source is from tb_neuronmodel, tb_biological_question, and tb_contribution.

• v_model_question_reference: View of the reference of the biological question. The source is from tb_neuronmodel, tb_biological_question, tb_referred, and tb_reference.

• v_simulation_models: View of simulation detailed composition. It defines the models that involved in one simulation. The source is from tb_simulation, tb_modelsimulation, and tb_neuronmodel.

• v_simulation_observation: View of the observation for the simulation. The source is from tb_simulation and tb_observation.

• v_simulation_startcondition: View of starting conditions for one simulation. The source is from tb_modelsimulation, tb_simulation, tb_initialcondition, and tb_variable.

• v_simulation_stopcondition: View of stopping conditions for one simulation. The source is from tb_neuronmodel, tb_modelsimulation, tb_simulation, tb_stopcondition and tb_variable.

• v_simulation_assignment: View of the parameters’ setting in one simulation. The source is from tb_modelsimulation, tb_simulation, tb_assignment and tb_parameter.

• v_simulation_result_list: View of recorded results in one simulation. The source is from tb_modelsimulation, tb_variable, tb_simulation, and tb_resulttable.

• v_experiment_cluster: View of the clustered samples that from experiment. The source is from tb_sample, v_neural_cell_experiment, and tb_cluster.

• v_simulation_cluster: View of the clustered samples that from simulation. The source is from tb_sample, v_simulation_result_list, and tb_cluster.

• v_neural_cell_experiment: View of the clustered samples that from the simulation. The source is from tb_sample, tb_cluster, and v_simulation_result_list.

• v_cluster_sim_exp_feedback: View of the feedback that is given on the pair of samples. One sample is from simulation and the other is from the experiment. The source is from tb_feedback, v_simulation_cluster, and v_experiment_cluster.

• v_cluster_sim_feedback: View of the feedback that is given on the pair of samples. Both samples are from simulation. The source is from tb_feedback and v_simulation_cluster.

• v_cluster_exp_feedback: View of the feedback that is given on the pair of samples. Both samples are from experiment. The source is from tb_feedback and v_ experiment _cluster.


Database Laboratory

5.2 Constraints The Constraints on tables are defined as in table‐3:

Table Name Primary Key Foreign Key

tb_property id

tb_simulation id

tb_organism id

tb_reference referenceid

tb_initialcondition simulationid, modelid, variableid

tb_stopcondition modelid, simulationid, variableid

tb_resultgraph simulationid, modelid

tb_compartment id

tb_variable id, modelid

tb_variable_member (id, modelid) REFERENCES tb_variable (id, modelid); (tb_equation_id, modelid) REFERENCES tb_equation (id, modelid)

tb_neuronmodel Modelid

tb_neuroncell cell_id (id) REFERENCES tb_organism (id)

tb_resulttable modelid, variableid, simulationid

tb_experiment_result lustered ted, propertyID, cellid,tb_experiment_experimentid

tb_assignment modelid, simulationid, parameterid

tb_observation observationid

tb_biologicalquestion questionid (id) REFERENCES tb_neuronmodel (modelid)

tb_experiment experimentid

tb_parameter id, modelid

tb_parameter_member (tb_equation_id, modelid) REFERENCES tb_equation (id, modelid); (id, modelid) REFERENCES tb_parameter (id, modelid)

tb_equation id, modelid

tb_cluster dataindex, lustered,clusterlayer

tb_sample dataindex

tb_prototype lustered, prototypeid,layer

tb_cluster_sample_values dataindex

tb_temp_values dataindex

tb_interoperation (cell_id) REFERENCES tb_neuroncell (cell_id)

tb_discovery (observationid) REFERENCES tb_observation (observationid); (tb_bio_questionid) REFERENCES


Database Laboratory

tb_biologicalquestion (questionid)

tb_contribution (tb_bio_questionid) REFERENCES tb_biologicalquestion ( uestioned)

tb_refered (tb_bio_questionid) REFERENCES tb_biologicalquestion (questionid); (tb_neuronmodel_modelid) REFERENCES tb_neuronmodel (modelid)

tb_keywords (id) REFERENCES tb_neuronmodel (modelid)

tb_modelsimulation (id) REFERENCES tb_simulation (id); (modelid) REFERENCES tb_neuronmodel (modelid)

tb_experiment_result (tb_experiment_experimentid) REFERENCES tb_experiment (experimentid)

tb_hypothesis (tb_bio_questionid) REFERENCES tb_biologicalquestion (questionid)

tb_conclusion (tb_bio_questionid) REFERENCES tb_biologicalquestion (questionid)

tb_compartmentcomposition (property_id) REFERENCES tb_property (id); (compartment_id) REFERENCES tb_compartment (id); (cell_id) REFERENCES tb_neuroncell (cell_id)

Table‐3‐ Constraints definition in Oracle

5.3 Triggers Strategy to keep the integration between base table and cluster structure (Table‐4): Triggers are defined on the table tb_experiement_result, tb_resulttable and tb_prototype to keep the integrity between data source and cluster source.

Trigger Name Table Description

t_experiment_delete tb_experiment_result After delete one experiment data, we delete the corresponding data in tables of tb_cluster,tb_sample,tb_cluster_sample_values

t_experiment_result_insert tb_experiment_result After insert a new experiment data, we insert the data saved in data file into tables of tb_cluster,tb_sample,tb_cluster_sample_values. If the clusters for the corresponding data size and unit are defined in learning database, we calculate the cluster for the new result.


Database Laboratory

t_experimentresult_update tb_experiment_result After have modified an experiment data, we modify the tables of tb_cluster, tb_sample, tb_cluster_sample_values to have the same data content. At same time, if the clusters for the corresponding data size and unit are defined in learning database, we recalculate the corresponding cluster for the modified result.

t_resulttable_delete tb_resulttable After delete one simulation result data, we delete the corresponding data in tables of tb_cluster,tb_sample,tb_cluster_sample_values

t_resulttable_insert tb_resulttable After insert a new simulation result data, we insert the data saved in data file into tables of b_cluster,tb_sample, tb_cluster_sample_values. If the clusters for the corresponding data size and unit are defined in learning database, we calculate the cluster for the new result.

t_resulttable_after_update tb_resulttable After have modified a simulation result data, we modify the tables of tb_cluster, tb_sample, tb_cluster_sample_values to have the same data content. . At same time, if the clusters are defined in learning database, we recalculate the corresponding cluster for the modified result.

t_prototype_delete tb_prototype When the prototype is delete, delete its values that are saved in tb_cluster_sample_values.

Table ‐4‐ Trigger definition in Oracle


Database Laboratory

5.4 Competitive learning algorithm implementation: Competitive learning algorithm applied on each homologous data groups such that they have the same unit and the data size. For example, a simulation result whose variable has the unit ‘mV’ and recorded value array with length 5000 is homologous with an experiment result of the membrane’s potential on a soma of neural cell which has the unit ‘mV’ and the measured 5000 values. We suppose all the simulation results and experiment results are saved on external files that can be accessed by the user in the authorized directory in Oracle. As being mentioned in section 5.1.5, four tables have been involved in the procedure of competitive learning. They are tb_sample, tb_cluster_sample_values, tb_cluster and tb_prototype. Each insertion of simulation result (tb_resulttable) or experiment result (tb_experiment_result) will trigger the corresponding importation data values procedure which read the external file location into the tb_sample and tb_cluster_sample_values tables. The update or delete operation on these two tables will trigger the update or deletes procedures on the corresponding tables: tb_sample, tb_cluster_sample_values, and tb_cluster. Once we have enough data samples to cluster, we can start the cluster procedure by initializing the prototypes and execute the competitive learning algorithm. To get better initialization of prototypes at the start point in which situation we do not be always “too far away” from interesting data, we need more extra effects to check the randomly chosen prototypes and reject the choices that too much overlapped prototypes exist. The overlap can be represented by the correlation between two prototypes. Prototype pair with high correlation (for example, more than 0.9) may result that the two prototypes represents the same cluster. We check correlation between prototypes to find the highly correlated prototypes and a criterion is defined to decide how many correlated prototypes accepted by the initialization. If we haven’t found the wanted level of correlation number criterion, there may be no such choice, and then we have to augment the tolerable correlated prototypes’ number. A maximum ratio of this criterion has to be defined, or it will loop infinitely. To avoid exhaustive tries, we have to limit to a reasonable iteration number to search the better choices. As we can’t know in advance the cluster number and missing the clusters is not that we want. We have to define a rather high ratio of prototypes regarding to the total size of samples data. The results may lead to more dead units but guarantee that we haven’t missed the poorly separated clusters. Once the initialization finds the satisfied prototypes, competitive learning begins. Another factor that influences the convergent speed (the moving average [14] of the length of the modification of prototypes) is learning rate. In our implementation we decrease it while the number of learned sample increasing in the initialization steps. In such way, later samples will influence the vectors less than earlier ones (As we learn fast at beginning, but less in later


Database Laboratory

learning.). For new inserted sample we keep a constant small learning rate. After all the samples have been clustered, the initialization of the competitive learning algorithm finishes. When new sample is inserted into database, it will be clustered by the competitive learning algorithm with constant small learning rate (as we talked about above). The learning procedure continues. To get better accuracy, we suggest that if the database size has grown in a considerable scale, we reinitialize more prototypes and execute the competitive learning again for all the samples. In the later performance analysis, we will show the precision improvement by reinitializing the prototypes with the growing data size. The flow in Oracle to execute competitive learning process is shown in Figure‐36:

Figure ‐36‐ Execution flow of the competitive learning flow

The sql scripts for realizing the process above (Figure‐36) are defined in 7 sql files: initialize.sql, createtable.sql, insertdata.sql, createview.sql, initializeprototype.sql, table_trigger.sql, searchcluster.sql. It’s sufficient to execute initialize.sql that will execute all the sql files consecutively. Then call the createpcadatasample procedure to insert the test samples. Next to call Initializeallprototype procedure will initialize prototypes for all the units and sample size. Finally to call execALLCompetiveLearning procedure will compute the cluster for each sample and update the prototypes. Given an external file that includes the high‐dimension data series and we want to know which cluster it belongs to, we can use the calculateCluster function that is defined in the searchcluster.sql and supply the data size, unit, file location and the cluster layer (in layer 0, it’s only the clusters of the same data size and unit, the layer 1 is the clusters for each data size and unit group). Given a sample that has stored in learning database, we want to know its nearest neighbors in the same cluster. We can call the function findneighbors that is defined in the searchclusters.sql. We need to supply the sample data index and number of neighbors we want to find.

Create user’s name and directory

Create Tables, Views and Triggers

Insert samples’ data

Initialize prototypes (by unit, data size)

Execute the competitive learning algorithm to find cluster for all the samples


Database Laboratory

6. Queries We decided to give the freedom to users to choose possible queries that they wants. The semester project of “EasyQueries” of laboratory of Databases of EPFL in 2007 (author: Ariane Pasquier) has supplied a good tools to exploit all the possibilities to query the databases. For summary, we can list some examples that are common for users: A. General queries:

1. Biological questions:

• Given the organ, list out its neural system.

• Given the neuron name or ID, list out its compartments with electronic properties, classifications, locations, and interpretations.

• Given the neuron name or ID, list out the experiment results.

• Given the type of receptor/neurotransmitter/channel, list out the neurons that have such type of receptor/ neurotransmitter/channel.

• Etc. 2. Computational questions:

• Given a neuron name or ID, list out its computational models.

• Given the name of a theory, list out the computational model based on this theory

• Given the research subject, list out the possible concerning models.

• Given the neuron name, list out the biological experimental data for data fitting

• Given the model name, list out the simulation results (table, graphs, condition and parameters).

• Given the properties of compartments, find out the computational model that is based on it.

• Etc. B. Advanced queries:

• Given the simulation resulting data (pre‐saved table or graph), find the closest experiment results or simulations.

• Given an external file that stores a simulation or experiment result, find the closest experiment results or simulations.

• Given a cluster id, find its prototype.

• Given a simulation result or experiment result, find its cluster.

• Etc.

Easyqueries supplies an interface that user can easily define his queries (Figure‐37). All the tables and views in the database corresponding to the role of user can be chosen as the querying object. It is original for query the tables in DERBY database. We slightly modified it to be able to query the database of Oracle. In our case, we use the assisted query with QBE to execute the queries.

Figure ‐37‐ EasyQueries interface to choose a database and table for querying


Database Laboratory

QBE, Query‐By‐Example, is a language for querying relational data with a graphic representation of the data. User can use QBE keywords to retrieve, update, delete, and insert data. The graphic representation is shown in Figure‐38. It refers to DB2 Query Management Facility (QMF) keywords definition [8], user can choose the queried table, the fields in the table (keywords: P.: Projection, UNQ: distinct etc.), define the conditions (for example: <=, >=, <>, etc) easily by using supplied keywords. For detailed explanation of this language, we comment reader to refer to [8]. Multiple tables can be joined by using the complex link. In our case, we have already created the necessary views for users to query almost all the visual information, the operation on one view is sufficient to query the necessary information.

Figure ‐38‐ Operation for query interface in EasyQueries

In Figure‐39, we show an example that a query to retrieval the view of v_organism_cell: we select all the fields (P. on the v_organism_cell) and chose the records that organism_name is like ‘brain’. The result SQL statement is previewed. Once we click on the ‘SEND QUERY’, we can get the result records below.

Figure ‐39‐ Query example in EasyQueries


Database Laboratory

7. Performance analysis Firstly, we look at the time consummation of the step of competitive learning process. As we mentioned before, the complexity of competitive learning is O(M*N), M is the number of clusters and N is the total samples. We fixed the number of clusters and executed competitive learning on different size of samples. The time consummation is shown in Figure‐. We can conclude that the time consummation is increased linearly with the size of samples.

Figure ‐40‐ Time consummation of competitive learning implemented in Oracle: Fixing dimension as 784, clusters 20,

varying the number of samples from 200 to 5000

Secondly, we check the final prototypes’ quality by two measurements: the number of dead units and the number of correlated pairs of prototypes.

(a) (b)

Figure ‐41‐ (a) Dead unit number under the condition: fixing dimension 500, 1000 samples, varying clusters’ number; (b) Correlation with the same simulation condition as in (a).

In Figure‐41(a) the dead units and the number of final correlated prototypes pairs with configuration of 1000 samples that each sample has 500 dimensions are shown. We can discover that with the fixed samples number and dimension, increasing the clusters will lead to more dead units (in our case, we defined the clusters that have less than 3 samples are dead units) and more correlated pairs of prototypes (Figure‐41(b) in our case, the pair of prototypes that has the correlation more than 0.8 are defined as correlated). The correlated prototypes may represent the same cluster and dead units show that there are some prototypes have never find samples belong to the cluster that the prototypes represent.


Database Laboratory

(a) (b)

Figure ‐42‐ (a) Final correlated prototypes in the simulation condition: Fixing 1000 samples, 20 clusters, varying the dimension of each sample;

(b) Total time consummation in the same simulation condition as in (a).

We studied various dimensions with the same number of samples and clusters (Figure‐42). The correlated prototypes pairs haven’t had great change (Figure‐42(a) between 0 or 3 pairs), but the consummation of time has great difference (from 113 to 1171 second) (Figure‐42(b)). We think the solution to reduce the dimension for the competitive learning is acceptable [12]. The prototypes’ quality is the most important measurement that indicates the successful factor for the algorithm. As we believe that more samples the database learned, more successful that it can cluster the correctly sample. To verify the cluster result’s correctness, we used the labeled hand written digits’ images that each image represents one number among 0 to 9 (The hand written digits data to test the precision of clustering is from the course “Unsupervised and reinforcement learning in neural networks” of Professor Wulfram Gerstner.). These images have fixed data dimension (784) and the clusters number proportional to the total samples (10%) and executed the learning process on different size of sample. We can define the precision of cluster as following:

(8)

The precision for clustering is shown in Figure‐43. We can easily to see that with more samples, we can learn more precisely. It confirms the solution that to improve the quality of the clusters, we may need to initialize the prototypes and execute the competitive learning again when the samples number has grown in some scales. In this example, the scale is 10. The precision is improved from 57.5% (200 samples, 20 clusters) to 82.5% (2000 samples, 200 clusters).


Database Laboratory

Figure ‐43‐ Precision of clustering under the simulation condition: Fixing the proportion of cluster related to the total

number of samples 10% and the dimension of each sample, varying the number of samples

In the case of 200 samples with 20 clusters, we can’t find all the possible prototypes for each digit. But in the case of 2000 samples and 200 clusters, different prototypes for each digit are found. To show the improvement, we reshaped the prototypes from the 784 dimension to the 32*32 matrix, and draw the digits. The 20 prototypes of 200 samples are shown in Figure‐44 and the 20 of 200 prototypes that 2000 samples have found are displayed in Figure‐45. We can observe in the two figures that in the first case, we have only found 1 prototypes of digit 1 but at least 4 prototypes in the second case; similarly, we have only found 2 prototype of digit 0 in the first case but at least 4 prototypes in the second case. Furthermore, in the first case there are 2 prototypes we can’t tell which digit they represent (in the second line, the third and the fifth prototype), but in the second case, we can distingue each digit easily.

Figure ‐44‐ total Prototypes of the 200 samples with 20 clusters


Database Laboratory

Figure ‐45‐ 20 of the 200 Prototypes with 2000 samples and 200 clusters

Figure ‐46‐ Comparison of time consummation between Oracle and Matlab to execute the same dimension of sample

500, with 1000 samples.

To further analyze the performance and we compared the time consummation of competitive learning process between the Oracle and Matlab. We fixed the dimension as 500 and took 1000 samples. By choosing different clusters number, we can get the following time consummation in Figure‐46 for Oracle and Matlab. Surprisingly, we can discover that Matlab is much faster in all the case. The maximum time during this test for Matlab to execute the competitive learning algorithm is only 3.5 seconds contrary to Oracle 4065 seconds. At same time it gives huge hope that we can improve the performance in Oracle by using external tools such as Matlab to execute the learning algorithm but Oracle serves as the storage and retrieval tools.


Database Laboratory

8. Conclusion Based on the requirement of the storage of biological neural information and computational models’ information, we designed a database that not only can store the biological and computational neural information systematically, but also can learn from the high dimension data series or graph information (transformed to high dimension vector) to find the cluster information. This project is implemented in Oracle 10G by using the PL/SQL language. Queries are possible by using the EasyQueries [10]. The performance of the project is studied by using data series produced by the elementary functions combinations and hand written digits images. Here we want to point out some problems about the implementation in Oracle:

• The most inconvenience that we have met is the mathematics’ calculation. We can’t execute the vector calculation easily in Oracle. The performance of execution of such procedures in Oracle is extremely heavy.

• The IO operation to load external file into Oracle is not efficient.

• In the analysis of the performance of the time consummation, Oracle is not efficient to execute the competitive learning algorithm. The Matlab is obviously much better.

In the future work, we may think about to implement competitive learning algorithm by using the external specialized components such as Matlab or C++ languages but Oracle serves as a storage and retrieval tool. We have supplied a well‐formed and easy to query knowledge base but we miss a good visual tools to input the knowledge. As the biological information and computational information are decomposed as detailed as possible, user may input enormous information. For example, for each computational model, we have to input the equations, parameters, variables separately if we have no an intelligent tool to abstract the parameters and variables from the equations. For each parameter or variable, if without the help of dictionary that user can easily choose the signification from, it’s a heavy work to input all the description or biological explanation manually. In the future work, such intelligent editor is necessary to supply to user. Furthermore, some existing databases may already have partial information. In this case we may need to develop a tool to automatically import the corresponding information. As we mentioned in the section 7, the possibility to reduce the dimension have no great effects on the precision, but reduce greatly the time consummation. The future work we may think about to calculate the principle components instead of calculate all the dimensions. In this project, we chose the simple one pass Kohonen competitive learning algorithm that can easily adapted for the database application. But in the future work, any learning algorithm can be added to the application level by defining a standard interface to retrieve data samples and return the results. Till now we are trying to find the possible separation between clusters. But the other requirement such as finding the common ancestor (For example: construction of phylogenetic tree) is big issue in bioinformatics. In the case of hand written digits, the different prototypes for the same digit


Database Laboratory

have been found, but how we know they are the same digit? In the future work, the result of cluster may serve as the base to find the construction of the phylogenetic tree.


Database Laboratory

9. Acknowledgement Thanks Professor Spaccapietra Stefano to have accepted my proposition for this project and I am deeply appreciated for the help of Dr. Fabio Porto during the whole work. Thanks the no‐stop support of my family.


Database Laboratory

Reference 1. Cluster analysis. (2008, June 17). In Wikipedia, The Free Encyclopedia. Retrieved 14:58, June

17, 2008, from http://en.wikipedia.org/w/index.php?title=Cluster_analysis&oldid=219899952

2. Neuron. (2008, May 18). In Wikipedia, the Free Encyclopedia. Retrieved 08:16, May 26, 2008, from http://en.wikipedia.org/w/index.php?title=Neuron&oldid=213235579

3. Membrane potential. (2008, May 21). In Wikipedia, The Free Encyclopedia. Retrieved 08:14, May 26, 2008, from http://en.wikipedia.org/w/index.php?title=Membrane_potential&oldid=214033480

4. Pyramidal cell. (2008, May 8). In Wikipedia, The Free Encyclopedia. Retrieved 13:46, May 26, 2008, from http://en.wikipedia.org/w/index.php?title=Pyramidal_cell&oldid=211030760

5. Izhikevich artificial neuron model from EM Izhikevich "Simple Model of Spiking Neurons" IEEE Transactions On Neural Networks, Vol. 14, No. 6, November 2003 pp 1569‐1572 6. Kanungo, T. Mount, D.M. Netanyahu, N.S. Piatko, C.D. Silverman, R. Wu, A.Y.

Almaden Res. Center, San Jose, CA: An Efficient k‐Means Clustering Algorithm:Analysis and Implementation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 7, JULY 2002

7. IBM Corporation, August, 2005 http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.qmf.doc.using/dsqk2mst365.htm

8. IBM Corporation, August, 2005 http://publib.boulder.ibm.com/infocenter/dzichelp/v2r2/index.jsp?topic=/com.ibm.qmf.doc.using/dsqk2mst339.htm

9. Ariane Pasquier: User Manul of EasyQueries. EPFL Semester Project April 2007. 10. Hines ML, Morse T, Migliore M, Carnevale NT, Shepherd GM. ModelDB: A Database to

Support Computational Neuroscience. J Comput Neurosci. 2004 Jul‐Aug;17(1):7‐11. 11. Heng Tao Shen , Xiaofang Zhou, Aoying Zhou: An adaptive and dynamic imensionality

reduction method for high‐dimensional indexing. The VLDB Journal (2007) 16(2): 219–234 12. Wikipedia contributors, 'Machine learning', Wikipedia, The Free Encyclopedia, 11 June 2008,

20:59 UTC, http://en.wikipedia.org/w/index.php?title=Machine_learning&oldid=218711697

13. Neuroscience. (2008, June 10). In Wikipedia, The Free Encyclopedia. Retrieved 14:09, June 17, 2008, from http://en.wikipedia.org/w/index.php?title=Neuroscience&oldid=218438467

14. Moving average. (2008, June 4). In Wikipedia, The Free Encyclopedia. Retrieved 07:27, June 18, 2008, from http://en.wikipedia.org/w/index.php?title=Moving_average&oldid=216962510

15. Principal components analysis. (2008, June 16). In Wikipedia, The Free Encyclopedia. Retrieved 07:45, June 18, 2008, from http://en.wikipedia.org/w/index.php?title=Principal_components_analysis&oldid=219604812

report of master learning database€¦ · one well designed database schema for blue brain project...

Documents