superfast type inference engine
TRANSCRIPT
IIT-MADRAS
Ganesh P Kumar 7/8/2014
Superfast Type Inference Engine Done as part of the Smart Assistant Project, in Samsung Research Institute, Bengaluru, India under the guidance of Dr. Lokendra Shastri
Ganesh P Kumar
1
Superfast Type Inference Engine
Done as part of the Smart Assistant Project, in Samsung Research Institute,
Bengaluru, India under the guidance of Dr. Lokendra Shastri
About The Smart Assistant Project
The age of smartphones is well and truly upon us. A majority of the population own
smartphones and the market for this product has been growing substantially over the
last few years. One of the most salient, and perhaps differentiating, feature of
smartphones is the smart assistant. Be it Apple’s Siri or Microsoft’s Courtana, smart
assistants have definitely made their mark in today’s smartphone industry. What makes
a smart assistant ‘smart’ is its ability to interact with the user in an easy manner that the
user find comfortable. And the medium any user will find most comfortable is the
natural language of the mouth. So, any smart assistant will involve a significant NLP
component that would make inferences to generate new knowledge from existin
knowledge. It is as part of this ambitious project that I worked during my 2 month
internship in SRI-B, from 14-05-2014 till
11-07-2014.
The Need for a ‘Fast’ Type Inference Engine
Before we understand the reason for a fast type inference engine, we need to understand
what type inference is. We need to be able to look at a statement in context and deduce
the types of the various entities involved. For example, if we see the statement “I have a
pet dog”, even without any other information, we may deduce that ‘dog’ is some living
thing, which is tame enough to be had as a pet.
Type inference is something that goes on every second in the human brain. It is a vital
component to understand the meaning of any given statement, and one would often
have to do multiple such inferences to parse such a statement properly. Type inference
is important not just for understanding statements but also for generating new
knowledge from a given statement. Many times, as humans, we generate knowledge by
listening to the statements around us. Sometimes, this knowledge is explicit in the
statement. For example, the statement “Tagore got the Nobel Prize” leads us to generate
the corresponding fact in our brain.
Ganesh P Kumar
2
But the human brain, being as complex as it is, does not stop at this. The brain can also
generate knowledge that is not present directly in a statement. To someone who does
not know about Tagore, the same statement will lead to another deduction : If Tagore
got the Nobel Prize, Tagore must be a human (because only humans are awarded the
Nobel Prize), he must have done something noteworthy (Nobel Prize is an Award and
Awards are given for doing something noteworthy). Also, he must have got the Nobel
Prize in at least one of the 6 categories. So, he must have done something noteworthy in
one of these 6 categories. Note that the deduction does not stop here, but goes on, trying
to generate as much knowledge from the single statement by placing it in context of
already known facts. This is vital to interacting in an intelligent manner with any
human. If a smart assistant is to claim to be as intelligent as a close friend, it must be
able to master this ability. In particular, it must be able to perform type inference.
Examples of the latter are, “Tagore must be a human” and “Nobel Prize is an Award.”
It is in this context that the speed of type inference comes to the fore. Given a statement,
there can be hundreds, if not thousands, of type inferences that must be checked. If the
software is to maintain a semblance of ‘intelligence’ and ‘smartness’ , then these type
inferences must not only be done correctly, but also quickly. A single query must not
take longer than say, 100 milliseconds to run. This was the challenge that was tackled
partially during the course of the internship.
A Connectionist Model for representing Types
In order to make the inference faster, we decided to adopt a proven connectionist model
for representing types and type hierarchies as given in Section 3 of the paper, “Types
and Quantifiers in SHRUTI – a connectionist model of rapid reasoning and relational
processing”, by my mentor Dr. Lokendra Shastri [Shastri, 2000]. The simple network-
based scheme is as explained below. To understand this, it is better to think of this as an
electrical circuit where each network node is a LED that can be on or off, and each
interconnection is just a wire that runs from one node to another. A description of the
network structure is given below. It would be helpful to refer to Figure 1 to better
understand the description.
Each node in the type hierarchy is represented by a focal cluster of nodes in the network,
and the focal clusters are different for types and instances. A type focal cluster has 4
nodes (labelled ?v, ?e, +v and +e) while an instance focal cluster has just 2 (labeled +
and ?). It is the interconnection within and between the focal clusters that forms the
basis of the inference engine.
Ganesh P Kumar
3
Figure 1 : A small connectionist network [Reprinted from (Shastri, 2000) with permission.]. Note the interconnections between and within the different types of focal clusters
Semantics
Before going in depth, we need to understand the semantics of the different types of
nodes. This is explained below:
+v:
Universal assertion.
If this node is on at the same time a corresponding predicate node is on (in the
Ganesh P Kumar
4
same phase), then it means that the predicate is true for all instances of the
corresponding type.
+e:
Existential assertion.
If this node is on at the same time a corresponding predicate node is on (in the
same phase), then it means that the predicate is true for at least one instance of
the corresponding type.
?v:
Universal query.
If this node is on at the same time a corresponding predicate node is on (in the
same phase), then it means that a query is being made if the attribute holds for all
instances of the type. The query is considered true if the +v node turns on, else it
is false.
?e:
Existential query.
If this node is on at the same time a corresponding predicate node is on (in the
same phase), then it means that a query is being made if the attribute holds for at
least one instance of the type. The query is considered true if the +e node turns
on, else it is false.
+:
Instance assertion.
If this node is on at the same time a corresponding predicate node is on (in the
same phase), then it means that the predicate is true for the corresponding
instance
?:
Instance Query
If this node is on at the same time a corresponding predicate node is on (in the
same phase), then it means that a query is being made if the attribute holds for
the corresponding instance. The query is considered true is the + node on the
same focal cluster turns on, else it is false.
Apart from this we have 2 nodes per predicate, which are similar to the + and ? nodes
discussed above.
Interconnections
The nodes in the network are interconnected to create the inference engine. These
interconnections are so designed so that any kind of query can be answered by the
propagation of messages along them. They are explained as follows:
Ganesh P Kumar
5
Within type focal clusters:
+v -> +e :
If +v node is on, then +e nod of same type must get switched on.
?e -> ?v
An existential query could be answered by asking a universal query on the same
type. If the universal query is true, the existential query automatically becomes
true.
+v -> ?v
If +v turns on, ?v must turn on to assimilate the fact in the system.
+e -> ?e
If +e turns on, ?e must turn on to assimilate the fact in the system.
Within instance focal clusters:
+ -> ?
An assertion on the instance must turn on the ? node to assimilate the fact in the
system.
Between type focal clusters:
+e -> +e’
An existential assertion on the subtype should turn on an existential assertion on
the supertype
+v’ -> +v
A universal assertion on the supertype should turn on a universal assertion on the
subtype.
?v -> ?v’
A universal query on the subtype should turn on a universal query on the
supertype.
?e’ -> ?e
An existential query on the supertype must turn on an existential query on the
subtype.
Between type and instance focal clusters
+ -> +e
An assertion on the instance means that there exists at least one object of the type
has the attribute. Hence the +e node of the type must turn on.
Ganesh P Kumar
6
+v -> +
If something s true about all instances of a type then the + node must turn on for
every instance in that type.
? -> ?v
A question about an instance may be answered if we ask a universal query to
corresponding type. Hence this connection.
?e -> ?
An existential query about the type could be answered by querying each instance
of that type.
Between type focal clusters and database/ontology:
?v -> DB
DB -> +v
Between instance focal clusters and database/ontology:
? -> DB
DB -> +
Querying the Network:
Shown above in Figure 1 is a small network for a simple type hierarchy : Agent-
>Human->John. To understand how the system works, let us assume that we know the
fact “All agents are alive” and let us ask the network “Is John alive?”
Step 1: Clamp ?:John
Step 2: ?:John -> ?v:Human
Step 3: ?v:Human -> ?v:Agent
Step 4: ?v:Agent -> DB
Step 5: DB -> +v:Agent
Step 6: +v:Agent -> +v:Human
Step 7: +v:Human -> +:John
This is what we wanted. +:John has turned on. The response to the query is “TRUE”.
Ganesh P Kumar
7
Objective
This interesting design has to be implemented in a way that not only guarantees
correctness but also does it fast. The goal of this intern was to implement this module
and achieve as much speed up as possible. Since the software would be running on the
cloud, memory was no constraint.
We had 2 ways to approach the problem:
i. To implement it in an Object-oriented fashion, where each of these nodes is an
object with certain attributes.
ii. To implement the nodes as secondary (implied) objects, so that the attributes are
expressed directly in the form of arrays.
Both the methods have been shown below:
Figure 2 : The Design Dilemma. Should we go for an explicit representation of nodes or a more implicit one? The first offers tree parallelism, while the second promises faster access.
The latter was preferred to exploit the naturally fast table lookup mechanism in the
computer as compared to object field lookup. While not very intuitive to work with, it
was a sacrifice we were willing to make if the desired speed could be achieved.
Moreover, working with tables opens the door for a KB-level parallelism which can be
implemented.
Synchronization Constraints
For correctness purposes, there are a few constraints that we must respect while
attempting to bring parallelism in the code. Broadly, let us divide a single iteration into
3 phases, which are explained as below:
Production stage
Production involves looking at the outputs of the upstream nodes and calculating the
current node’s output. This newly calculated value is stored in the node’s
bufferedOutput slot.
Ganesh P Kumar
8
Firing stage
This involves moving the values from the bufferedOutput to the output slot. Also, if the
node that is firing is the target node for that phase, then report success and mark the
phase for resetting. Note that the target node for a phase p is defined as the node that is
expected to turn on so that the query associated with phase p can be answered as
“TRUE”.
Clamping/Resetting stage
If the clampManager has pending clamps, then it is resolved in this stage i.e. the clamp
manager actually clamps the required nodes to the corresponding clamp values in the
network. Similarly, any phase that needs to be reset is also reset during this period.
While trying to achieve concurrency, we must ensure that the following hold true:
Before production phase, firing phase for all upstream nodes must be complete.
Before firing phase, production phase must be over for the nodes.
Clamping must take place only after resetting. Otherwise, we will not be able to
differentiate between the recently clamped nodes and the nodes that were already
on. As a result, we may end up resetting nodes that were clamped in the same
iteration. In other words, the new query which was initiated by clamping a node
could be killed prematurely.
While clamping and resetting, no other operation must take place on the
network. As a result, the software was designed to have a single MasterController
that would repeat the following cycle indefinitely.
So one iteration of the MasterController could be broadly divided into 3 phases. Note
that these 3 phases cannot overlap with one another across the whole set of nodes
(which we would be referring to as the Network), i.e. no node can experiencing a
Firing phase or Clamping phase when some other node is in Production phase. This
places a tight bound on the amount of concurrency we can wring out of the system.
Ganesh P Kumar
9
Figure 3: The 3-stage cycle of the MasterController : Produce, Fire, Reset/Clamp
Points of Potential Concurrency
There is probably just one phase in which concurrency is achievable. Since we cannot
overlap one phase on another, we have to think of parallelizing the Production phase
and the Firing phase separately. So we try two approaches: one by partitioning the
nodes into sets and letting threads handle production and firing separately for each set
(which will be referred to as Network Partitioning, or KB Parallelism), and another by
running multiple queries simultaneously on the same network (which will be referred to
as Multi-Phase Parallelism).
Network Partitioning
To implement the first kind of parallelism, we structure the code as follows:
A MasterController spawns several SubControllers, depending on the size of the
network, taking care that not more than 150 nodes end up on the same
SubController.
Each SubController handles the production and firing for its subset.
Once all the SubControllers are done, the MasterController resets the network (if
required) and checks with the ClampManager if there are any pending
clamp/declamp requests. If yes, the ClampManager resolves its pending requests
and returns control back to the MasterController that moves on to the next
iteration.
A diagrammatic representation of the algorithm is as follows:
Production
Firing
Clamping/Resetting
Ganesh P Kumar
10
Figure 4 : System Design. A single MasterController spawns multiple SubControllers that take care of production and firing concurrently, before synchronizing to begin the resetting/clamping stage.
Multi-phase Parallelism
To understand what I mean by multi-phase parallelism, let us revert back to the circuit-
like view of our network. Now, instead of assuming one LED in the place of each node,
assume a set of 8 LEDs, each one in a different color. The interconnections are made
between LEDs of the same color. So in essence, we have made 8 copies of the same
network. I call each different color a ‘phase’ and in essence, I must be able to run 8
queries simultaneously on the same network, using the same controller. Also, note that
these phases are independent, which is good for concurrency.
But having multiple phases simultaneously has its own share of complications. Some of
them are given below:
Ganesh P Kumar
11
Earlier, it was enough if we just propagated Booleans through the network –
simple on/off signals that could be used to trace a query through the network.
Now, our signals will become slightly more complicated : we need to tell the node
which phase it has to fire in apart from the simple on/off signal.
Network resetting needs to be done per phase. We cannot afford to have a coarse
reset that simply reverts the network to the original state, which was possible in
the single-phase scenario.
Each node must be able to handle multiple phases simultaneously. In other
words, the phase of firing is not necessarily unique to the node. In the extreme
scenario, a node could be firing in all 8 phases.
The final version of the module combines both the kinds of parallelism. The code
structure and flow are discussed in the coming sections.
Code structure
The code structure is as discussed below:
Important Classes
Network
This contains the basic network involved in the type inference engine. It is through this
Network object that the queries progress.
MasterController
Each Network contains a MasterController object. This MasterController is responsible
for processing the queries. In some sense, while the Network object is like a basic circuit,
the MasterController is like the battery source that drives it.
SubController
The central MasterController spawns SubController threads to do the actual processing.
It is through these SubController threads that we have achieved KB-level parallelism.
ClampManager
Since we can clamp only at specified intervals of the execution, we need a special
ClampManager object to take care of it. The ClampManager collects requests for
clamping/declamping from multiple threads and resolves the pending clamps once it is
possible to do so.
Ganesh P Kumar
12
PhaseManager
Each query, before it is actually handled, needs a phase in which to fire. The
PhaseManager object mainitains an ArrayList of available phases. Each new query
requests the PhaseManager for its phase before firing.
Life of a Query
Any query that comes in initially is handled by the MasterController object. The
MasterController decides which node to clamp, and requests the PhaseManager for a
phase to fire in. The PhaseManager returns with an available phase, or else waits till a
phase becomes available. The MasterController registers the query and the phase in a
HashMap, along with the desired target node for that query and requests the
ClampManager to clamp the node in that phase.
Once the node has been clamped, the MasterController thread takes care of propagating
the query through the network. If the target node for given phase fires in that phase,
then the MasterController notes down the answer to the query as true and nots down
the phase for resetting.
Once the firing phase is done, the MasterController checks if there is any phase that
must be reset and calls the Network to reset itself in that phase.
Handling Multiple Phases
Data Structure for Message passing:
To handle multiple phases, we follow an encoding scheme so that we can encode
information about the source, amplitude and phase of firing into a single 32-bit integer.
Figure 5 : A Message encoded into a 32-bit integer
Src is a single bit that tells whether the message is from an Entity focal cluster or from a
Predicate focal cluster.
The phase bitvector is, as the name suggests, a bitvector to encode the phases active in
this Message. So if the phase bitvector is 01110101, it means that the Message is active in
phases 0,2,4,5 and 6. So with this encoding scheme we have effectively tackled the
problem of passing more information and multi-phase firing of nodes.
Ganesh P Kumar
13
For programmer ease of access and usage, this Message has been placed in a separate
class with functions that can take care of encoding , decoding and extracting the phases
from a given Message.
Phase-specific Network reset:
This is fairly simple to achieve once we have the underlying data structure ready.
Network reset on a phase i scans through the clamps array, output array and the
buffered output array and resets the phase on all existing Messages. A Message that was
firing only on phase i is removed.
This could have been made parallel, but the decision to make it serial had its basis in the
following observations:
One idea was to run the resetting for multiple phases concurrently. But on further
analysis, we see that all the threads (1 thread / phase) that are running are trying
to access the same Network object. This will result in a lot of competition, which
would kill the benefits of making the code concurrent.
There are going to be only a small number ( < 10 ) phases that are firing at any
instant, of which only a small fraction (on an average) needs resetting . Even if
perfect speedup could be achieved it cannot be guaranteed that the speedup we
will achieve will be reasonable. This really does not give much incentive to try to
make the code parallel.
Base Algorithm
Production
Algorithm 1.1: Basic Production
1. procedure doProduction 2. for each node n in SubController list 3. if n is clamped 4. buffredOutput[n] = clamps[n] 5. else 6. inputs = new Array[Message] 7. for each node m upstream to node n 8. inputs.add (m) 9. endfor 10. bufferedOutput[n] = outputFromInputs(inputs) 11. endif 12. endfor 13. end procedure
Ganesh P Kumar
14
In words, we are essentially looking at a node’s clamp value. If it is set, transfer it to the buffer immediately. Otherwise, take in the inputs from the upstream nodes, process them to get the output and then put this output generated into the buffer.
Firing
Algorithm 1.2: Basic Firing
1. procedure doFiring
2. for each node n in SubController list
3. if clamps[n] is not set
4. output[n] = bufferedOutput[n]
5. else
6. output[n] = clamps[n]
7. endif
8. if bufferedOuptut[n] is not empty
9. for each phase p in bufferedOutput[n]
10. if n is targetNode for p
11. reportTrue(p)
12. markForReset(p)
13. endif
14. endfor
15. endif
16. endfor
17. end procedure
Here, we just transfer the buffferedOutput to the output. After that we test if the current
node is the target node for any of the phases it is firing on. If it is, we report true on the
phase and mark it for resetting.
Optimization
Experiments showed that the speedup from this algorithm was not too good. To answer
a query with depth 1, the algorithm took, on an average, 30.35 ms. And as the depth
increases, the time taken increases exponentially. This is clearly not the behavior
expected form a scalable, high-speed inference engine. As a result, we needed to
somehow optimize this code, either by making it more parallel, or by removing
unnecessary and redundant operations.
So, after a brief analysis of the code, we singled out one main problem. We hoped that if
we fixed it, we could achieve the speedup that we aim to achieve.
Ganesh P Kumar
15
Targets for Optimization
Note that , in both algorithms, we have blindly processed every node in the
SubController’s range. This is an enormous waste of resources and time. Of course, in
Algorithm 1.1, the only operation involved is a few logical checks and a simple table
lookup and write operation. But to put this in perspective, consider just one query that is
running, in a network with say, 10000 nodes. Assuming this is a depth 1 query, the
nodes that really need to be processed are probably a handful. However, this algorithm
will process all 10000 nodes before reporting “True”. This was the major bottleneck that
we faced and we needed to remove this.
Optimization Ideas
To remove the drawback stated above, we came up with two ideas that were
implemented. These are as discussed below:
Selective Activation
Every SubController needs to maintain a list of ‘active’ nodes – nodes that must be
processed this turn. These active nodes are the only ones for which the Production and
Firing will take place. However, note that this is not static but a dynamic list to which
new nodes must be added every iteration (the frontier nodes’ downstream nodes), and
may potentially be removed during an iteration (network reset). Also, while clamping,
we must make sure to add the node to the active list of the corresponding
SubController’s active node list. Since this method tries to restrict the activation only to
those nodes that really need to be activated, it will be referred to as Selective
Activation.
Taking it one step higher, we can maintain an active node list at the Master Controller
that is essentially a union of the active node lists of all the SubControllers. This needs
more maintanence, but with this in place , we can start off only those SubControllers
that are really required. This will reduce competition for system resources.
Signal Decay
Another idea was inspired by the concept of paging in Operating systems. The
fundamental philosophy behind paging as a concept is the fact that for a program to go
to completion, it is not required that the entire code be on RAM. What is required is that
the next instruction to be executed must be present in the RAM. Similarly, in the given
network, if you notice, what is required for a query to go to completion is just a few
nodes that must be turned on. It is not required that every node that is involved in the
query needs to fire till the query is satisfied (or times out, as the case may be). What is
required is just one ‘layer’ of nodes in the network that needs to fire so that the query
propagates to the next level. If we were to visualize this in the framework of our
electrical circuit, it would be as if the LEDs gradually dim and switch off after a
Ganesh P Kumar
16
particular amount of time. As a result, this idea will be referred to as will be referred to
as Signal Decay.
Auxiliary Modules
In order to facilitate module testing and experimentation, the following auxiliary
modules were created.
Console
This was a small command-line-like environment through which the user could interact
with the program as it runs. Using it, the user can create a type hierarchy on the fly by
entering commands. It offers the user the ability to load and save type hierarchies and
also predicates. Shown below is a screenshot of the Console’s help menu.
Figure 6 : A screenshot of the Console's help menu
Tree Creator
This is used to create a random tree, given certain parameters like number of primary
concepts, maximum depth of each concept, average branching factor, average number of
instances per type and multi-parent probability rate. It was used to generate random
trees to test the algorithms on.
Tree Network Converter
Ganesh P Kumar
17
This is the module used for creating a Network from a given Tree data structure. It was
used to generate the required Network from the Tree generated by the Tree Creator
class.
Experimental Setup
All the tests were run on a randomly generated tree. The tree was generated by the
TreeNetworkConvertor object with the following specifications:
Number of primary concepts = 2
Maximum allowable depth of primary concept = 5
Maximum allowable depth of type hierarchy = 7
Average branching factor = 2
Average number of instances per type = 2
Multi-parent probability rate = 10%
All tests were done on an Intel Core i7 machine, with 8 cores of 2.4 GHz each.
Time was measured by calls to System.nanoTime() for measurements. Inaccuracies
creeping in because of this have been assumed to be tolerable.
Experimental Results
Data
Trial 1: KB Parallelism + Selective Activation + Signal Decay
Number of phases = 1
Ganesh P Kumar
18
Trial 2: KB Parallelism + Selective Activation
Number of phases = 8
Trial 3: KB Parallelism + Selective Activation + Signal Decay
Number of phases = 1
Graphical Analysis and Reasoning
The initial tests run were to gauge the advantage of running on multiple phases over just
a single phase. The Network partitioning parallelism scheme had already been
implemented successfully, but without yielding significant improvement in speed. Thus,
phasing was applied on top of the existing parallel code and tests taken to see its effect.
The readings are as shown below.
Ganesh P Kumar
19
Analysis:
As expected, having 8 phases is much more faster than having a single phase. The
difference is negligible at lower querying depths, but becomes very significant as the
querying depth increases. The negligible difference at lower levels may be explained by
the fact that at lower levels, the queries seldom move far from their source. As a result,
there is less competition and hence the effect of parallelism is limited.
Ganesh P Kumar
20
Analysis:
As can be seen, the implementation with KB Parallelism and no Signal Decay, running
on a single phase is the worst. But the remaining two implementations seem to match
each other almost. The two lines almost overlap. To get a better idea, we need to zoom in
on the two lines, which is shown in the chart below.
Analysis:
In this zoomed up version of the previous chart, we can see that 8-fold phasing achieves
almost the same effect as Signal decay. In fact, it is marginally better compared to signal
decay.
Conclusions
We can see that by harnessing the advances made in hardware and by introducing
concurrency where possible, it has been possible to achieve a tremendous speedup
(close to 600% for a query of depth 6).We can see that a basic Network Partitioning
scheme on a single phase is not very efficient because the exponential rate at which the
active node list grows. However, on applying Signal Damping on top of this gives us
excellent results, competent with an 8-phase network.
Ganesh P Kumar
21
Future Work
There are still some design choices that need to be evaluated and tested before we can
freeze on a final design. Some of them are:
Tree-based Network:
We made a design decision that it was better to go for implicit nodes so that we
could exploit the inherently faster table-lookup operation in the system. But an
explicit node, with a tree-based approach to the network, offers us the advantage
of tree parallelism, i.e. the query could propagate in multiple branches
concurrently. This is worth a trial just to see how it compares with our network.
8-fold phasing with Signal Decay:
We tested 8-fold phasing without Signal Decay and monophasing with Signal
Decay and found them to be comparable. So, the next step would be to try out 8-
fold phasing with Signal Decay in an attempt to wring out more parallelism.
Per-phase processing:
If the phase representation mechanism could be made simpler, then it could be
possible to process each phase separately. This would be highly advantageous in a
GPU system as it involves reading an element in an array, processing it and
putting it back into the same array (or a buffer that can be copied back once the
processing is done).
Acknowledgements
I thank Samsung Research Institute, Bengaluru, India and all its employees for giving
me this opportunity to work on this interesting project. They have been a wonderful host
for the duration of the internship. I also thank my mentor, Dr. Lokendra Shastri,
Director, Advanced Technologies Lab, SRI-B for guiding me through this project with
patience.
References:
Shastri, L. (2000). “Types and Quantifiers in SHRUTI – A Connectionist Model of Rapid
Reasoning and Relational Processing” In Hybrid Neural Systems, Lecture notes in
Computer Science (pp. 28-45), Springer