Visual analytics for discovering entity relationship on text data
Hanbo DaiEe-Peng LimHady Wirawan LauwHweeHwa Pang
Analysis scenario
• A homeland security analyst– Finds out relationships between two terrorists
on complex, large information sources – Needs user judgments
Jemaah Islamiah Al-QaedaMas Selamat Osama Bin Laden
Justinus Andjarwirawan
Born in Central Java
Abu Latif
Was not directly connected
Visual analytics system architecture
Two TUBE (Text-Cube) instances for entity relationship discovery
e0 e1 e2 e3 e4
e0 e1
e2
e3 e4
e0
e1
e2
e3
e4
T1=<S1, B1, M1, D>
T2 =<S2, B2, M2, D>
Document Evidencee.g. {d1, d2,…}
Mask value (0/1)nodes
Measures e.g. Path_strength
Document Evidencee.g. {d3, d4,…}
Mask value (0/1)edges
Measures e.g. strength
ER-Explorer interface
Visual analytical operations
• Insert
• Cluster
• Delete
Our tool helps to discover new relationships
Conclusion
• Interactive visual method to discover entity and relationships embedded in text data
• ER-Explorer equipped with TUBE model and operations
• Our tool assisted analysts in finding relationships between two terrorists
Back up slides
Case study• Dataset: The hijacking of IC814• Entities of type Person, Organization, Event, GPE are extracted• Co-occurrence Relationships are identified on sentence level.• Each sentence is considered as a document.
Text-Cube Model Represents Entities and Relationships • An entity is either a named entity or a conceptual entity.• A n-dimensional TUBE is a tuple T= <S, B, M, D>
– S: Schema = {s1, s2,…, sn}• Si denotes the list of entities of dimension i
– B: Mask• 0 or 1 value
– M: Measure= {m1, m2,…, m|M|}• Each measure mi is associated with a measure function mfi
– D: Document Collection– A TUBE T has | s1|×|s2|×…×| sn | cells
• A cell c– Has document evidence denoted as Fd(c) – Is present if B(c)=1 , or hidden if B(c)=0– Has measure value denoted as c.mj , computed by mfj(c)– Represent the co-occurrence relationship, if Fd(c) is not empty
Measure formulas
Two TUBE Instances for entity relationship discovery• A discovery task is to find interesting paths between two
entities source (s) and target (t)– A path represents a chain of relationships
• 1-Dimension TUBE instance: T1=<S1, B1, M1, D>– S1 initiated as all named entities – M1= {path_strength}
• The strength of shortest path through an entity between s and t
• 2-Dimension TUBE instance: T2=<S2, B2, M2, D>– S2 initiated as all named entities on both dimensions– M2= {name_sim, strength, dom_entity}
• name_sim– Computed by edit distance
• strength– Computed by Jaccard Coefficent or Dice Coefficent
• dom_entity– Whenever ei appears ej is always there, ej dominate ei
Related Work
• Social network visualization– assume entities and relations
• have been identified and verified.• can be studied without supporting document
– Use only measures of graph structure, such as degree, centrality.
• Automatic path/subgraph finding algorithms– Users have little control over the relations and entities
involved– Do not consider semantically identical entities.
Formal definition of entity
• Entity e is defined as a named object or a set of other entities.
Tube operations
• Insert– Add an entity to a dimension
• Remove– Remove an existing entity from a dimension
• SelectCell– Assign 0 or 1 to a entry (a cell in T) in Mask
• Cluster– Add a new conceptual entity representing a s
ubset of entities to a dimension
Visual Analytics Operations
• Insert an entity– SelectCell in T1 and T2
– Reveals all relationships this entity has with all entities in the network
• Delete– Delete a named entity
• SelectCell in T1
– Delete a conceptual entity• Remove in T1 and T2
– Delete a relationship (a cell)• SelectCell in T2
• Cluseter– Cluster in in T1 and T2