superfast type inference engine

Ganesh P Kumar 7/8/2014 Superfast Type Inference Engine Done as part of the Smart Assistant Project, in Samsung Research Institute, Bengaluru, India under the guidance of Dr. Lokendra Shastri

Ganesh P Kumar 7/8/2014

Superfast Type Inference Engine Done as part of the Smart Assistant Project, in Samsung Research Institute, Bengaluru, India under the guidance of Dr. Lokendra Shastri

Ganesh P Kumar


Superfast Type Inference Engine

Done as part of the Smart Assistant Project, in Samsung Research Institute,

Bengaluru, India under the guidance of Dr. Lokendra Shastri

About The Smart Assistant Project

The age of smartphones is well and truly upon us. A majority of the population own

smartphones and the market for this product has been growing substantially over the

last few years. One of the most salient, and perhaps differentiating, feature of

smartphones is the smart assistant. Be it Apple’s Siri or Microsoft’s Courtana, smart

assistants have definitely made their mark in today’s smartphone industry. What makes

a smart assistant ‘smart’ is its ability to interact with the user in an easy manner that the

user find comfortable. And the medium any user will find most comfortable is the

natural language of the mouth. So, any smart assistant will involve a significant NLP

component that would make inferences to generate new knowledge from existin

knowledge. It is as part of this ambitious project that I worked during my 2 month

internship in SRI-B, from 14-05-2014 till


The Need for a ‘Fast’ Type Inference Engine

Before we understand the reason for a fast type inference engine, we need to understand

what type inference is. We need to be able to look at a statement in context and deduce

the types of the various entities involved. For example, if we see the statement “I have a

pet dog”, even without any other information, we may deduce that ‘dog’ is some living

thing, which is tame enough to be had as a pet.

Type inference is something that goes on every second in the human brain. It is a vital

component to understand the meaning of any given statement, and one would often

have to do multiple such inferences to parse such a statement properly. Type inference

is important not just for understanding statements but also for generating new

knowledge from a given statement. Many times, as humans, we generate knowledge by

listening to the statements around us. Sometimes, this knowledge is explicit in the

statement. For example, the statement “Tagore got the Nobel Prize” leads us to generate

the corresponding fact in our brain.

Ganesh P Kumar


But the human brain, being as complex as it is, does not stop at this. The brain can also

generate knowledge that is not present directly in a statement. To someone who does

not know about Tagore, the same statement will lead to another deduction : If Tagore

got the Nobel Prize, Tagore must be a human (because only humans are awarded the

Nobel Prize), he must have done something noteworthy (Nobel Prize is an Award and

Awards are given for doing something noteworthy). Also, he must have got the Nobel

Prize in at least one of the 6 categories. So, he must have done something noteworthy in

one of these 6 categories. Note that the deduction does not stop here, but goes on, trying

to generate as much knowledge from the single statement by placing it in context of

already known facts. This is vital to interacting in an intelligent manner with any

human. If a smart assistant is to claim to be as intelligent as a close friend, it must be

able to master this ability. In particular, it must be able to perform type inference.

Examples of the latter are, “Tagore must be a human” and “Nobel Prize is an Award.”

It is in this context that the speed of type inference comes to the fore. Given a statement,

there can be hundreds, if not thousands, of type inferences that must be checked. If the

software is to maintain a semblance of ‘intelligence’ and ‘smartness’ , then these type

inferences must not only be done correctly, but also quickly. A single query must not

take longer than say, 100 milliseconds to run. This was the challenge that was tackled

partially during the course of the internship.

A Connectionist Model for representing Types

In order to make the inference faster, we decided to adopt a proven connectionist model

for representing types and type hierarchies as given in Section 3 of the paper, “Types

and Quantifiers in SHRUTI – a connectionist model of rapid reasoning and relational

processing”, by my mentor Dr. Lokendra Shastri [Shastri, 2000]. The simple network-

based scheme is as explained below. To understand this, it is better to think of this as an

electrical circuit where each network node is a LED that can be on or off, and each

interconnection is just a wire that runs from one node to another. A description of the

network structure is given below. It would be helpful to refer to Figure 1 to better

understand the description.

Each node in the type hierarchy is represented by a focal cluster of nodes in the network,

and the focal clusters are different for types and instances. A type focal cluster has 4

nodes (labelled ?v, ?e, +v and +e) while an instance focal cluster has just 2 (labeled +

and ?). It is the interconnection within and between the focal clusters that forms the

basis of the inference engine.

Ganesh P Kumar


Figure 1 : A small connectionist network [Reprinted from (Shastri, 2000) with permission.]. Note the interconnections between and within the different types of focal clusters


Before going in depth, we need to understand the semantics of the different types of

nodes. This is explained below:


Universal assertion.

If this node is on at the same time a corresponding predicate node is on (in the

Ganesh P Kumar


same phase), then it means that the predicate is true for all instances of the

corresponding type.


Existential assertion.

If this node is on at the same time a corresponding predicate node is on (in the

same phase), then it means that the predicate is true for at least one instance of

the corresponding type.


Universal query.

If this node is on at the same time a corresponding predicate node is on (in the

same phase), then it means that a query is being made if the attribute holds for all

instances of the type. The query is considered true if the +v node turns on, else it

is false.


Existential query.

If this node is on at the same time a corresponding predicate node is on (in the

same phase), then it means that a query is being made if the attribute holds for at

least one instance of the type. The query is considered true if the +e node turns

on, else it is false.


Instance assertion.

If this node is on at the same time a corresponding predicate node is on (in the

same phase), then it means that the predicate is true for the corresponding



Instance Query

If this node is on at the same time a corresponding predicate node is on (in the

same phase), then it means that a query is being made if the attribute holds for

the corresponding instance. The query is considered true is the + node on the

same focal cluster turns on, else it is false.

Apart from this we have 2 nodes per predicate, which are similar to the + and ? nodes

discussed above.


The nodes in the network are interconnected to create the inference engine. These

interconnections are so designed so that any kind of query can be answered by the

propagation of messages along them. They are explained as follows:

Ganesh P Kumar


Within type focal clusters:

+v -> +e :

If +v node is on, then +e nod of same type must get switched on.

?e -> ?v

An existential query could be answered by asking a universal query on the same

type. If the universal query is true, the existential query automatically becomes


+v -> ?v

If +v turns on, ?v must turn on to assimilate the fact in the system.

+e -> ?e

If +e turns on, ?e must turn on to assimilate the fact in the system.

Within instance focal clusters:

+ -> ?

An assertion on the instance must turn on the ? node to assimilate the fact in the


Between type focal clusters:

+e -> +e’

An existential assertion on the subtype should turn on an existential assertion on

the supertype

+v’ -> +v

A universal assertion on the supertype should turn on a universal assertion on the


?v -> ?v’

A universal query on the subtype should turn on a universal query on the


?e’ -> ?e

An existential query on the supertype must turn on an existential query on the


Between type and instance focal clusters

+ -> +e

An assertion on the instance means that there exists at least one object of the type

has the attribute. Hence the +e node of the type must turn on.

Ganesh P Kumar


+v -> +

If something s true about all instances of a type then the + node must turn on for

every instance in that type.

? -> ?v

A question about an instance may be answered if we ask a universal query to

corresponding type. Hence this connection.

?e -> ?

An existential query about the type could be answered by querying each instance

of that type.

Between type focal clusters and database/ontology:

?v -> DB

DB -> +v

Between instance focal clusters and database/ontology:

? -> DB

DB -> +

Querying the Network:

Shown above in Figure 1 is a small network for a simple type hierarchy : Agent-

>Human->John. To understand how the system works, let us assume that we know the

fact “All agents are alive” and let us ask the network “Is John alive?”

Step 1: Clamp ?:John

Step 2: ?:John -> ?v:Human

Step 3: ?v:Human -> ?v:Agent

Step 4: ?v:Agent -> DB

Step 5: DB -> +v:Agent

Step 6: +v:Agent -> +v:Human

Step 7: +v:Human -> +:John

This is what we wanted. +:John has turned on. The response to the query is “TRUE”.

Ganesh P Kumar



This interesting design has to be implemented in a way that not only guarantees

correctness but also does it fast. The goal of this intern was to implement this module

and achieve as much speed up as possible. Since the software would be running on the

cloud, memory was no constraint.

We had 2 ways to approach the problem:

i. To implement it in an Object-oriented fashion, where each of these nodes is an

object with certain attributes.

ii. To implement the nodes as secondary (implied) objects, so that the attributes are

expressed directly in the form of arrays.

Both the methods have been shown below:

Figure 2 : The Design Dilemma. Should we go for an explicit representation of nodes or a more implicit one? The first offers tree parallelism, while the second promises faster access.

The latter was preferred to exploit the naturally fast table lookup mechanism in the

computer as compared to object field lookup. While not very intuitive to work with, it

was a sacrifice we were willing to make if the desired speed could be achieved.

Moreover, working with tables opens the door for a KB-level parallelism which can be


Synchronization Constraints

For correctness purposes, there are a few constraints that we must respect while

attempting to bring parallelism in the code. Broadly, let us divide a single iteration into

3 phases, which are explained as below:

Production stage

Production involves looking at the outputs of the upstream nodes and calculating the

current node’s output. This newly calculated value is stored in the node’s

bufferedOutput slot.

Ganesh P Kumar


Firing stage

This involves moving the values from the bufferedOutput to the output slot. Also, if the

node that is firing is the target node for that phase, then report success and mark the

phase for resetting. Note that the target node for a phase p is defined as the node that is

expected to turn on so that the query associated with phase p can be answered as


Clamping/Resetting stage

If the clampManager has pending clamps, then it is resolved in this stage i.e. the clamp

manager actually clamps the required nodes to the corresponding clamp values in the

network. Similarly, any phase that needs to be reset is also reset during this period.

While trying to achieve concurrency, we must ensure that the following hold true:

Before production phase, firing phase for all upstream nodes must be complete.

Before firing phase, production phase must be over for the nodes.

Clamping must take place only after resetting. Otherwise, we will not be able to

differentiate between the recently clamped nodes and the nodes that were already

on. As a result, we may end up resetting nodes that were clamped in the same

iteration. In other words, the new query which was initiated by clamping a node

could be killed prematurely.

While clamping and resetting, no other operation must take place on the

network. As a result, the software was designed to have a single MasterController

that would repeat the following cycle indefinitely.

So one iteration of the MasterController could be broadly divided into 3 phases. Note

that these 3 phases cannot overlap with one another across the whole set of nodes

(which we would be referring to as the Network), i.e. no node can experiencing a

Firing phase or Clamping phase when some other node is in Production phase. This

places a tight bound on the amount of concurrency we can wring out of the system.

Ganesh P Kumar


Figure 3: The 3-stage cycle of the MasterController : Produce, Fire, Reset/Clamp

Points of Potential Concurrency

There is probably just one phase in which concurrency is achievable. Since we cannot

overlap one phase on another, we have to think of parallelizing the Production phase

and the Firing phase separately. So we try two approaches: one by partitioning the

nodes into sets and letting threads handle production and firing separately for each set

(which will be referred to as Network Partitioning, or KB Parallelism), and another by

running multiple queries simultaneously on the same network (which will be referred to

as Multi-Phase Parallelism).

Network Partitioning

To implement the first kind of parallelism, we structure the code as follows:

A MasterController spawns several SubControllers, depending on the size of the

network, taking care that not more than 150 nodes end up on the same


Each SubController handles the production and firing for its subset.

Once all the SubControllers are done, the MasterController resets the network (if

required) and checks with the ClampManager if there are any pending

clamp/declamp requests. If yes, the ClampManager resolves its pending requests

and returns control back to the MasterController that moves on to the next


A diagrammatic representation of the algorithm is as follows:




Ganesh P Kumar


Figure 4 : System Design. A single MasterController spawns multiple SubControllers that take care of production and firing concurrently, before synchronizing to begin the resetting/clamping stage.

Multi-phase Parallelism

To understand what I mean by multi-phase parallelism, let us revert back to the circuit-

like view of our network. Now, instead of assuming one LED in the place of each node,

assume a set of 8 LEDs, each one in a different color. The interconnections are made

between LEDs of the same color. So in essence, we have made 8 copies of the same

network. I call each different color a ‘phase’ and in essence, I must be able to run 8

queries simultaneously on the same network, using the same controller. Also, note that

these phases are independent, which is good for concurrency.

But having multiple phases simultaneously has its own share of complications. Some of

them are given below:

Ganesh P Kumar


Earlier, it was enough if we just propagated Booleans through the network –

simple on/off signals that could be used to trace a query through the network.

Now, our signals will become slightly more complicated : we need to tell the node

which phase it has to fire in apart from the simple on/off signal.

Network resetting needs to be done per phase. We cannot afford to have a coarse

reset that simply reverts the network to the original state, which was possible in

the single-phase scenario.

Each node must be able to handle multiple phases simultaneously. In other

words, the phase of firing is not necessarily unique to the node. In the extreme

scenario, a node could be firing in all 8 phases.

The final version of the module combines both the kinds of parallelism. The code

structure and flow are discussed in the coming sections.

Code structure

The code structure is as discussed below:

Important Classes


This contains the basic network involved in the type inference engine. It is through this

Network object that the queries progress.


Each Network contains a MasterController object. This MasterController is responsible

for processing the queries. In some sense, while the Network object is like a basic circuit,

the MasterController is like the battery source that drives it.


The central MasterController spawns SubController threads to do the actual processing.

It is through these SubController threads that we have achieved KB-level parallelism.


Since we can clamp only at specified intervals of the execution, we need a special

ClampManager object to take care of it. The ClampManager collects requests for

clamping/declamping from multiple threads and resolves the pending clamps once it is

possible to do so.

Ganesh P Kumar



Each query, before it is actually handled, needs a phase in which to fire. The

PhaseManager object mainitains an ArrayList of available phases. Each new query

requests the PhaseManager for its phase before firing.

Life of a Query

Any query that comes in initially is handled by the MasterController object. The

MasterController decides which node to clamp, and requests the PhaseManager for a

phase to fire in. The PhaseManager returns with an available phase, or else waits till a

phase becomes available. The MasterController registers the query and the phase in a

HashMap, along with the desired target node for that query and requests the

ClampManager to clamp the node in that phase.

Once the node has been clamped, the MasterController thread takes care of propagating

the query through the network. If the target node for given phase fires in that phase,

then the MasterController notes down the answer to the query as true and nots down

the phase for resetting.

Once the firing phase is done, the MasterController checks if there is any phase that

must be reset and calls the Network to reset itself in that phase.

Handling Multiple Phases

Data Structure for Message passing:

To handle multiple phases, we follow an encoding scheme so that we can encode

information about the source, amplitude and phase of firing into a single 32-bit integer.

Figure 5 : A Message encoded into a 32-bit integer

Src is a single bit that tells whether the message is from an Entity focal cluster or from a

Predicate focal cluster.

The phase bitvector is, as the name suggests, a bitvector to encode the phases active in

this Message. So if the phase bitvector is 01110101, it means that the Message is active in

phases 0,2,4,5 and 6. So with this encoding scheme we have effectively tackled the

problem of passing more information and multi-phase firing of nodes.

Ganesh P Kumar


For programmer ease of access and usage, this Message has been placed in a separate

class with functions that can take care of encoding , decoding and extracting the phases

from a given Message.

Phase-specific Network reset:

This is fairly simple to achieve once we have the underlying data structure ready.

Network reset on a phase i scans through the clamps array, output array and the

buffered output array and resets the phase on all existing Messages. A Message that was

firing only on phase i is removed.

This could have been made parallel, but the decision to make it serial had its basis in the

following observations:

One idea was to run the resetting for multiple phases concurrently. But on further

analysis, we see that all the threads (1 thread / phase) that are running are trying

to access the same Network object. This will result in a lot of competition, which

would kill the benefits of making the code concurrent.

There are going to be only a small number ( < 10 ) phases that are firing at any

instant, of which only a small fraction (on an average) needs resetting . Even if

perfect speedup could be achieved it cannot be guaranteed that the speedup we

will achieve will be reasonable. This really does not give much incentive to try to

make the code parallel.

Base Algorithm


Algorithm 1.1: Basic Production

1. procedure doProduction 2. for each node n in SubController list 3. if n is clamped 4. buffredOutput[n] = clamps[n] 5. else 6. inputs = new Array[Message] 7. for each node m upstream to node n 8. inputs.add (m) 9. endfor 10. bufferedOutput[n] = outputFromInputs(inputs) 11. endif 12. endfor 13. end procedure

Ganesh P Kumar


In words, we are essentially looking at a node’s clamp value. If it is set, transfer it to the buffer immediately. Otherwise, take in the inputs from the upstream nodes, process them to get the output and then put this output generated into the buffer.


Algorithm 1.2: Basic Firing

1. procedure doFiring

2. for each node n in SubController list

3. if clamps[n] is not set

4. output[n] = bufferedOutput[n]

5. else

6. output[n] = clamps[n]

7. endif

8. if bufferedOuptut[n] is not empty

9. for each phase p in bufferedOutput[n]

10. if n is targetNode for p

11. reportTrue(p)

12. markForReset(p)

13. endif

14. endfor

15. endif

16. endfor

17. end procedure

Here, we just transfer the buffferedOutput to the output. After that we test if the current

node is the target node for any of the phases it is firing on. If it is, we report true on the

phase and mark it for resetting.


Experiments showed that the speedup from this algorithm was not too good. To answer

a query with depth 1, the algorithm took, on an average, 30.35 ms. And as the depth

increases, the time taken increases exponentially. This is clearly not the behavior

expected form a scalable, high-speed inference engine. As a result, we needed to

somehow optimize this code, either by making it more parallel, or by removing

unnecessary and redundant operations.

So, after a brief analysis of the code, we singled out one main problem. We hoped that if

we fixed it, we could achieve the speedup that we aim to achieve.

Ganesh P Kumar


Targets for Optimization

Note that , in both algorithms, we have blindly processed every node in the

SubController’s range. This is an enormous waste of resources and time. Of course, in

Algorithm 1.1, the only operation involved is a few logical checks and a simple table

lookup and write operation. But to put this in perspective, consider just one query that is

running, in a network with say, 10000 nodes. Assuming this is a depth 1 query, the

nodes that really need to be processed are probably a handful. However, this algorithm

will process all 10000 nodes before reporting “True”. This was the major bottleneck that

we faced and we needed to remove this.

Optimization Ideas

To remove the drawback stated above, we came up with two ideas that were

implemented. These are as discussed below:

Selective Activation

Every SubController needs to maintain a list of ‘active’ nodes – nodes that must be

processed this turn. These active nodes are the only ones for which the Production and

Firing will take place. However, note that this is not static but a dynamic list to which

new nodes must be added every iteration (the frontier nodes’ downstream nodes), and

may potentially be removed during an iteration (network reset). Also, while clamping,

we must make sure to add the node to the active list of the corresponding

SubController’s active node list. Since this method tries to restrict the activation only to

those nodes that really need to be activated, it will be referred to as Selective


Taking it one step higher, we can maintain an active node list at the Master Controller

that is essentially a union of the active node lists of all the SubControllers. This needs

more maintanence, but with this in place , we can start off only those SubControllers

that are really required. This will reduce competition for system resources.

Signal Decay

Another idea was inspired by the concept of paging in Operating systems. The

fundamental philosophy behind paging as a concept is the fact that for a program to go

to completion, it is not required that the entire code be on RAM. What is required is that

the next instruction to be executed must be present in the RAM. Similarly, in the given

network, if you notice, what is required for a query to go to completion is just a few

nodes that must be turned on. It is not required that every node that is involved in the

query needs to fire till the query is satisfied (or times out, as the case may be). What is

required is just one ‘layer’ of nodes in the network that needs to fire so that the query

propagates to the next level. If we were to visualize this in the framework of our

electrical circuit, it would be as if the LEDs gradually dim and switch off after a

Ganesh P Kumar


particular amount of time. As a result, this idea will be referred to as will be referred to

as Signal Decay.

Auxiliary Modules

In order to facilitate module testing and experimentation, the following auxiliary

modules were created.


This was a small command-line-like environment through which the user could interact

with the program as it runs. Using it, the user can create a type hierarchy on the fly by

entering commands. It offers the user the ability to load and save type hierarchies and

also predicates. Shown below is a screenshot of the Console’s help menu.

Figure 6 : A screenshot of the Console's help menu

Tree Creator

This is used to create a random tree, given certain parameters like number of primary

concepts, maximum depth of each concept, average branching factor, average number of

instances per type and multi-parent probability rate. It was used to generate random

trees to test the algorithms on.

Tree Network Converter

Ganesh P Kumar


This is the module used for creating a Network from a given Tree data structure. It was

used to generate the required Network from the Tree generated by the Tree Creator


Experimental Setup

All the tests were run on a randomly generated tree. The tree was generated by the

TreeNetworkConvertor object with the following specifications:

Number of primary concepts = 2

Maximum allowable depth of primary concept = 5

Maximum allowable depth of type hierarchy = 7

Average branching factor = 2

Average number of instances per type = 2

Multi-parent probability rate = 10%

All tests were done on an Intel Core i7 machine, with 8 cores of 2.4 GHz each.

Time was measured by calls to System.nanoTime() for measurements. Inaccuracies

creeping in because of this have been assumed to be tolerable.

Experimental Results


Trial 1: KB Parallelism + Selective Activation + Signal Decay

Number of phases = 1

Ganesh P Kumar


Trial 2: KB Parallelism + Selective Activation

Number of phases = 8

Trial 3: KB Parallelism + Selective Activation + Signal Decay

Number of phases = 1

Graphical Analysis and Reasoning

The initial tests run were to gauge the advantage of running on multiple phases over just

a single phase. The Network partitioning parallelism scheme had already been

implemented successfully, but without yielding significant improvement in speed. Thus,

phasing was applied on top of the existing parallel code and tests taken to see its effect.

The readings are as shown below.

Ganesh P Kumar



As expected, having 8 phases is much more faster than having a single phase. The

difference is negligible at lower querying depths, but becomes very significant as the

querying depth increases. The negligible difference at lower levels may be explained by

the fact that at lower levels, the queries seldom move far from their source. As a result,

there is less competition and hence the effect of parallelism is limited.

Ganesh P Kumar



As can be seen, the implementation with KB Parallelism and no Signal Decay, running

on a single phase is the worst. But the remaining two implementations seem to match

each other almost. The two lines almost overlap. To get a better idea, we need to zoom in

on the two lines, which is shown in the chart below.


In this zoomed up version of the previous chart, we can see that 8-fold phasing achieves

almost the same effect as Signal decay. In fact, it is marginally better compared to signal



We can see that by harnessing the advances made in hardware and by introducing

concurrency where possible, it has been possible to achieve a tremendous speedup

(close to 600% for a query of depth 6).We can see that a basic Network Partitioning

scheme on a single phase is not very efficient because the exponential rate at which the

active node list grows. However, on applying Signal Damping on top of this gives us

excellent results, competent with an 8-phase network.

Ganesh P Kumar


Future Work

There are still some design choices that need to be evaluated and tested before we can

freeze on a final design. Some of them are:

Tree-based Network:

We made a design decision that it was better to go for implicit nodes so that we

could exploit the inherently faster table-lookup operation in the system. But an

explicit node, with a tree-based approach to the network, offers us the advantage

of tree parallelism, i.e. the query could propagate in multiple branches

concurrently. This is worth a trial just to see how it compares with our network.

8-fold phasing with Signal Decay:

We tested 8-fold phasing without Signal Decay and monophasing with Signal

Decay and found them to be comparable. So, the next step would be to try out 8-

fold phasing with Signal Decay in an attempt to wring out more parallelism.

Per-phase processing:

If the phase representation mechanism could be made simpler, then it could be

possible to process each phase separately. This would be highly advantageous in a

GPU system as it involves reading an element in an array, processing it and

putting it back into the same array (or a buffer that can be copied back once the

processing is done).


I thank Samsung Research Institute, Bengaluru, India and all its employees for giving

me this opportunity to work on this interesting project. They have been a wonderful host

for the duration of the internship. I also thank my mentor, Dr. Lokendra Shastri,

Director, Advanced Technologies Lab, SRI-B for guiding me through this project with



Shastri, L. (2000). “Types and Quantifiers in SHRUTI – A Connectionist Model of Rapid

Reasoning and Relational Processing” In Hybrid Neural Systems, Lecture notes in

Computer Science (pp. 28-45), Springer