inquiry optimization technique for a topic map database
DESCRIPTION
In this paper the inquiry optimization technique for a topic map database is presented.TRANSCRIPT
Inquiry Optimization Technique
for a Topic Map Database
Yuki Kuribara
(Graduate School of Engineering,
Shibaura Institute of Technology)
Masaomi Kimura
(Information Engineering,
Shibaura Institute of Technology)
Topic maps
2010/10/6Data Engineering Lab3
Recently, many kinds of topic maps are created
For web portal site
For application development… and so on
When we target the large topic maps, we need to construct
databases for them
since databases can deal with the data larger than the size of physical
memory
Out of memory
On memory
The role of database
2010/10/6Data Engineering Lab4
Database systems should take responsibility for managing
information of topic maps
Query optimization
Transaction management
Physical data structure hiding
query
information
of topic map
Transaction management
Physical data structure
hiding
Database system
Queryoptimization
The physical data model for databases
2010/10/6Data Engineering Lab5
There are several options of data models for the databases
A relational model (table) and an object oriented model are mainly used
in topic map databases
When we crawl on the topic map to retrieve information, an
object oriented model needs not to join tables multiple times
unlike a relational model
We propose to utilize the object oriented model for the databases
Object BObject A
An object oriented modelA relational model
The logical data model for databases
2010/10/6Data Engineering Lab6
We assumed the topic map data structure defined by the topic
maps data model (TMDM)
since topic maps should follow TMDM!!
The data model consists of seven types of information items
and 19 types of named properties
We implemented these items as classes, whose instance have reference
relationships to other corresponding information item objects
TopicMap
Topic
Association
TopicNameAssociationRole
0..*
1
+topics
+parent
0..* 1
+associations +parent
+topicNames+parent
0..*1
+player
10..*
+roles
0..*+roles
1+parent
The possibility of plural retrieval routes
2010/10/6Data Engineering Lab7
When we retrieve the information of topic map, there may be
more than one way to retrieve the same objects
We can retrieve objects efficiently by searching method
The database systems need to select most suitable retrieval route (Query optimization)
TopicMap
Topic
Association
TopicNameAssociationRole
0..*
1
+topics
+parent
0..* 1
+associations +parent
+topicNames+parent
0..*1
+player
10..*
+roles
0..*+roles
1+parent
Query optimization
2010/10/6Data Engineering Lab8
Database systems need to estimate the suitable execution plan
the database system may take very long retrieval time without the query optimization
Though there are some topic map database systems, they seem not to take the optimization into consideration
The database should take responsibility for query optimization
Objective
2010/10/6Data Engineering Lab9
In this presentation, we focus on retrieval of topic objects that
are referred by a specific association with a particular topic
e.g.) we want to know that what Conan Doyle write?
We propose the optimization technique based on the estimation of execution cost
write
A study in
Scarlet Conan Doyle
A particular topic
Specified in the queryIntended topic
A specific association
Retrieval plan - the association route
2010/10/6Data Engineering Lab10
e.g.) What did Conan Doyle write?
A study in
Scarlet
write Conan
Doyle
1
2
2
We search the associationobjects ‘write’
We find the intendedtopic objects
We search the topic object‘Conan Doyle’
Retrieval plan - the topic route
2010/10/6Data Engineering Lab11
e.g.) What did Conan Doyle write?
writeA study in
ScarletConan Doyle
2
13
We find intendedtopics
We again search the association objects ‘write’ referred by the association role objects
We search the topic object ‘Conan Doyle’
Estimation of execution cost
2010/10/6Data Engineering Lab12
Systems have to choose the most suitable plan
It is necessary to define the cost which can effectively estimate
the retrieval time (cost estimation)
We define the estimation formulae for the retrieval cost of each plan
cost : 10
cost : 100
Route A
Route B
query
information
of topic map
Cost of objects - definition of cost
2010/10/6Data Engineering Lab13
We measured the total execution time and the retrieval time
of objects
The object retrieval time dominates the processing time more
than 99%
It is enough to measure the time to retrieve objects to
evaluate the cost of query processing
Execution Time
(A) (nano sec)
Retrieval time
of objects (B)
(nano sec)
The ratio of object
retrieval time (B/A)
Association
Route6.025×108 5.991×108 99.44 (%)
Topic
Route1.035×108 1.033×108 99.81 (%)
Retrieval time of
objects :
More than 99%
Other time :
Less than 1%
Execution time
of retrieval
Cost estimation formula
for the association route
2010/10/6Data Engineering Lab14
Q
NCCNCC tararouteassoc 2_
We need to retrieve all associations since multiple associations may have
the same name
The cost is doubled since we retrieve two topics both sides of the association
write
A study in
Scarlet
Conan
Doyle1
2 21
2
We approximate the number of associations with the specified name by the average number of associations per
their unique name
Cost estimation formula
for the topic route
2010/10/6Data Engineering Lab15
MQ
NC
M
NCC
MCC araartroutetopic
22
2_
The average times of topic retrieval ( note that each topic must have a
unique name )
The average number of associations per topic
The average number of associations that have the name specified by the
query
3 1
2
1 2 3
write
A study in
Scarlet
Conan
Doyle
Experiment
2010/10/6Data Engineering Lab16
In order to demonstrate our method, we applied our
technique to TOME
TOME is a prototype topic map database developed by authors
As target topic maps, we selected following two that have
different sizes
Rampo Edogawa* topic map
# of topics:29 (his name, his works and his hometown)
# of associations:15 (his works and his hometown)
Pokemon topic map
# of topics:174 (Pokemon names and their attributes)
# of associations:432 (evolutional and attribute relationships)
*Rampo Edogawa is a famous mystery story writer in Japan.
Evaluation of cost estimation formulae
2010/10/6Data Engineering Lab17
In order to evaluate our cost estimation formulae, we
measured the execution time of a query and compared the
tendency of the value of cost
Topic Maps
The average time of query execution
(nano sec)
The evalueated cost for each query
execution plan
The association
routeThe topic route
The association
routeThe topic route
Rampo Edogawa
Topic Map31 157 133.2 164.0
Pokemon
Topic Map297 31 2533 697.7
We can see the tendencies :the less estimated costs are, the short the execution time is
> >
< <
Conclusion
2010/10/6Data Engineering Lab18
We proposed the optimization technique based on the
estimation of execution cost
We showed that there are possibly more than one way to retrieve the
same objects
We defined the cost estimation formulae for the retrieval cost of each
plan
We estimated our optimization technique
The result of our experiment shows that we can see a proportional
tendency of the retrieval time and the object size
We can also see the tendencies that estimated costs are small in the
case that the execution time is short
The effect of buffers
2010/10/6Data Engineering Lab20
If the objects existing on the memory are required to be
loaded, a buffer shortens the retrieval time
the cost estimated by the formulae needs to be modified (reduced)
because of the effect of buffers
In our target query, there are two cases that the buffer is
used :
Conan
Doyle
Write
The Sign
of Four
A Study
in ScarletThe topic for association
name existing on the memory is also loaded
from buffer
The topic existing on the memory is loaded
from buffer
The coefficients of buffer
2010/10/6Data Engineering Lab21
In our target query, we need two coefficients :
For retrieval of topic
For retrieval of topic for the association names
N
Mr
N
M
21
2
N
Qr
N
Q1
r : the effective retrieval
ratio of cost for buffer
N:the number of
association objects
M:the number of
topic objects
Q:the number of unique
association names
The probability that the topic for the association names do not exist on
buffer
The probability that the topic do not exist on buffer
The modified cost estimation formulae
2010/10/6Data Engineering Lab22
Taking the buffering effect into consideration, we modify the
cost estimation formulae into this
The contribution of loading topic name objects is also taken into
consideration
Q
NCCCNCCCC tntartntarouteassoc 2_
MQ
NCCC
M
NCCCC
MCCC tntartntaartntroutetopic
22
2_
Cost estimation formula
for the association route
2010/10/6Data Engineering Lab23
We define the cost estimation formula as follows
Q
NCCCNCCCC tntartnta 21
Retrieval of
TopicMap objects
Retrieval of
Topic objects
Retrieval of
Association objects
Retrieval of Topic
objects that are defined
as the Association name
Retrieval of TopicName
objects that are defined
as the Association name
Retrieval of
AssociationRole objects
Retrieval of TopicName
objects that are defined
as the Topic name
N
Mr
N
M
21
2
N
Qr
N
Q1
N:the number of
association objects
M:the number of
topic objects
Q:the number of unique
association names
TMDM permits the redundant existence of multiple associations that have the same name
We assume that the association roles are uniformly assigned to associations
Q
NCCNCC tararouteassoc 2_
The accurate cost estimation formula
for the association route
2010/10/6Data Engineering Lab24
Q
NCCCNCCCC tntartntarouteassoc 2_
N
Mr
N
M
21
2
N
Qr
N
Q1
We have to consider the retrieval cost of
topic and topic name objects and
effect of buffer
We have to consider the retrieval cost of topic name objects and
effect of buffer
Ca: the retrieval cost of
association objects
Car: the retrieval cost of
association role objects
Ct: the retrieval cost of
topic objects
Ctn: the retrieval cost of
topic name objects
N:the number of association objects
M:the number of topic objects
Q:the number of
unique association names
Cost estimation formula
for the topic route
2010/10/6Data Engineering Lab25
We define the cost estimation formula as follows
MQ
NCCC
M
NCCCC
MCCC tntartntaartnt
22
22
Retrieval of
TopicMap objects
Retrieval of
Association objects
Retrieval of
AssociationRole objects
Retrieval of
Topic objects
Retrieval of Topic objects that are
defined as the Association name
Retrieval of TopicName objects that
are defined as the Association name
Retrieval of
AssociationRole objects
Retrieval of TopicName objects
that are defined as the Topic name
Retrieval of
Topic objects
Retrieval of TopicName objects
that are defined as the Topic name
TMDM permits the existence of only one topic that has the same name
Regarding the topic map as a graph, this is equal to the average degree
We assume that the association roles are uniformly assigned to associations
The accurate cost estimation formula
for the topic route
2010/10/6Data Engineering Lab26
MQ
NCCC
M
NCCCC
MCCC tntartntaartntroutetopic
22
2_
We have to consider the
retrieval cost of topic name
objects
We have to consider the retrieval cost of
topic objects and topic name objects and effect of buffer
We have to consider the
retrieval cost of topic name objects and effect of buffer
N
Mr
N
M
21
2
N
Qr
N
Q1
Ca: the retrieval cost of
association objects
Car: the retrieval cost of
association role objects
Ct: the retrieval cost of
topic objects
Ctn: the retrieval cost of
topic name objects
N:the number of association objects
M:the number of topic objects
Q:the number of
unique association names
MQ
NC
M
NCC
MCC araartroutetopic
22
2_
Result-Cost estimation of an object of each
class
2010/10/6Data Engineering Lab27
Topic Maps The object nameThe retrieval time
(nano sec)
The normalized value
by setting the retrieval time
to be 1
The object
Size
(byte)
The normalized value
by setting the object size
to be 1
Rampo
Edogawa
Topic Map
The retrieval time of
topic969200 3.34 608 4.75
The retrieval time of
topicname496700 1.71 376 2.94
The retrieval time of
associationrole289900 1 128 1
The retrieval time of
association562600 1.94 376 2.94
Pokemon
Topic Map
The retrieval time of
topic1053000 5.5 608 4.75
The retrieval time of
topicname501600 2.62 376 2.94
The retrieval time of
associationrole191400 1 128 1
The retrieval time of
association577700 3.02 376 2.94
We can see a similar tendency between the retrieval time and the object size
Retrieval cost of each object
2010/10/6Data Engineering Lab28
We measured the retrieval time and the object size of each
object
The result tells us that the retrieval time is almost proportional to the
object size
Based on this, we define the cost as an object size scale factor
( the ratio of object size to association role objects)
Topic Maps The object nameThe normalized value by setting
the retrieval time to be 1Object size scale factor
Pokemon
Topic Map
Topic object 5.5 4.75
Topic name object 2.62 2.94
Association role object 1 1
Association object 3.02 2.94
We can see a similar tendency between the retrieval time and the object size
Future perspective
2010/10/6Data Engineering Lab29
We will apply our method to other topic maps that have much
larger size
Our target topic maps are less than 1000 topics
We need to confirm the universality of cost estimate formulae by
evaluating of various topic maps
We will develop the mechanism to measure the size of objects
in a topic map
Since the size of objects depends on each topic map, we have to
measure it to set the value of costs adequate to evaluate execution plan
Reference
2010/10/6Data Engineering Lab30
M. Naito:An Introduction to Topic Maps. Tokyo Denki University
Press, 2006.
Yuki Kuribara, Takeshi Hosoya, Masaomi Kimura : TOME : The
Topic Map Database Extended, 2009
Ontopia:tolog Language tutorial.
http://www.ontopia.net/
ISO/IEC JTC1/SC34, Topic Map – Data Model
http://www.isotopicmaps.org/sam/sam-model/
Pokemon Topic Map
http://www.ontopia.net/omnigator/models/topicmap_complete
.jsp?tm=pokemon.ltm
Pajek, http://vlado.fmf.uni-lj.si/pub/networks/pajek/