Quick survey: How many of us...
● Regularly develop Elastic applications?● Develop Elastic applications that include
names of…○ ...People?○ ...Places?○ ...Products?○ ...Organisations?○ …(other entity types)?
● Have names in languages beside English?● Want to have better name search?● Are Elasticsearch or plugin developers?
Motivating Questions...
● How could a border officer know whetheryou’re on a terrorist watch list?
● How does your bank know if you’re wiring money to a Colombian drug lord?
● How can an ecommerce site treat “Ho-medics Ultra sonic” and “Homedics Ultrasconic” as the same thing?
● How can a system search for mentions of people across news articles?
Reality...
April 15 2013 2:49 PM .
Real life exampleDavid K. MurgatroydVP of Engineering
Boarding Pass
Current Best Practice?
● multi_field type with a field per possible variation (http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names)
"mappings": { ... "type": "multi_field", "fields": {
"pty_surename": { "type": "string", "analyzer": "simple" },
"metaphone": { "type": "string", "analyzer": "metaphone" },
"porter": { "type": "string", "analyzer": "porter" } …
● Complex query against each field
● Generally gives high recall(but how do you get high precision too?)
So can a name field-type do this?
● Manage all the subfields
● Contribute score that reflects phenomena
● Be part of queries using many field types
● Have multiple fields per document
● Have multiple values per field (coming soon)
“Jesus Alfonso Lopez Diaz”
vs.
“LobezDias, Chuy”
Can we do better?
● Incorporates our proprietary name matching technology
● Provides similarity scores to name pairs● Uses Elasticsearch's Rescore query● Allows for higher precision ranking and
tresholding● Multi-lingual name search
RNI
Elastic + RNI
Rescore Query
Main Query
Plug-in Implementation
match : { name: "Bob Smitty" }
bool:name.Key1:...name.Key2:...name.Key3:...
User Query
Rescorename_score : { field : "name", name : "Bob
Smitty")
name:"Robert Smith"dob:2/13/1987score : .79
Indexing
{ name: "Robert Smith"dob:"1987/02/13" }
{ name: "Robert Smith"name.Key1:…name.Key2:…name.Key3:…dob: "1987/02/13" }
User Doc
Index
subset
Demo
How could you use such a Field?
● Plugin contains custom mapper which does all the work behind the scenesPUT /ofac/ofac/_mapping{ "ofac" : { "properties" : { "name" : { "type:" : "rni_name" } "aka" : { "type:" : "rni_name" } } }}
What happens at index time?
● NameMapper indexes keys for different phenomena in separate (sub) fields@Override
public void parse(ParseContext context) throws IOException {
Name name = NameBuilder.data(nameString).build();
//Generate keys for name
Collection<FieldSpec> fields = helper.deriveFieldsForName(name);
//Parse each key with the appropriate Mapper
for (FieldSpec field : fields) {
Mapper mapper = keyMappers.get(field.getField().fieldName());
context = context.createExternalValueContext(field.getStringValue());
mapper.parse(context);
}
}
What happens at query time?
● Step #1: NameMapper generates analogous keys for a custom Lucene query that finds good candidates for re-scoring@Override
public Query termQuery(Object value, @Nullable QueryParseContext context) {
//Parse name string
Name name = NameBuilder.data(value.toString()).build();
QuerySpec spec = helper.buildQuerySpec(new NameIndexQuery(name));
//Build Lucene query
Query query = spec.accept(new ESQueryVisitor(names.indexName() + "."));
return query;
}
What else happens at query time?
● Step #2: Uses a Rescore query to score names in the best candidate documents and reorder accordingly○ Tuned for high precision name matching○ Computationally expensive"rescore" : {
"query" : {
"rescore_query" : {
"function_score" : {
"name_score" : {
"field" : "name",
"query_name" : "LobEzDiaS, Chuy"
}
...
● The 'name_score' function matches the query name against the indexed name in every candidate document and returns the similarity score
@Override
public double score(int docId, float subQueryScore) {
//Create a scorer for the query name
CachedScorer cs = createCachedScorer(queryName);
//Retrieve name data from doc values
nameByteData.setDocument(docId);
Name indexName = bytesToName(nameByteData.valueAt(i).bytes);
//Score the query against the indexed name in this document
return cs.score(indexName);
}
What does that function do?
HighRecall Query(Elastic)
Subset High Recall Results
Total < windowsize
&
Score > minimumScoreThreshold
Re-scoring High Precision
Query
ScoredResults
Trading Off Accuracy for Speed
● window_size○ Controls how many of the subset
documents to rescore (imagine a HUGE name index)
○ Trade-off accuracy vs speed
● minScoreToCheck - (Added by Us)○ Lucene score threshold subset
docs must meet to be rescored○ Trade-off accuracy vs speed
What Challenges Were There?
● Design based on similar Solr plugin● 1-2 months solo develop time● Nice plugin infrastructure● Missing some useful javadocs/comments● No (official) plugin development guide● Used other plugin implementations as
guides https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#_plugins
Summary: How it works
● Custom field type mapping○ Splits a single field into multiple fields covering
different phenomena○ Supports multiple name fields in a document○ Intercepts the query to inject a custom Lucene query
● Custom re-score function○ Re-scores documents with algorithm specific to
name matching ○ Limits intense calculations to only top candidates○ Highly configurable
Major terrorist attack is ‘inevitable’ as Isis fighters return, say EU officials
Trojan horse: ISIS militants come to Europe disguised as refugees, US intel sources claim
The Guardian, Thursday 25 September 2014
RT.com, Thursday 9 October 2014
Europe fears more 9/11 or 7/7 terrorist attacks delivered by rising Militant Group ISIS
Europeans are returning from Syria in their masses, many fearing the rise of ISIS however many may also be radicalised natives - not only is this bad for Europe but also the US may have to consider its VISA waiver program for Europe
ISIS militants may enter Europe posing as refugees
Turkish border is the issue - poor passport control and no VISA requirements mean free crossing of Syrian-Turkish border and many refugees take this route. It is near impossible to separate jihadists from legitimate refugees
The problem at the border● Land border control is at the heart of the problem
○ Islamic State averse to using air travel○ lack of a visa requirement between Turkey and
Syria○ large number of refugees crossing Turkey-
Syria land border○ large number of European ex-pats leaving
Syria○ The Islamic State ‘Trojan horse’ - Jihadist
terrorists radicalised from ex-pats or posing as refugees
● There is currently a high dependence on visas for preventing movements○ This reliance could be relieved by having
effective control at the borders for name / identity checking
Islamic State militant poses with flag
● ~5000 EU citizens currently fighting alongside IS in Syria
● Including 500+ German citizens
● 1000+ French citizens, only ~150 returned to France so far
(Charlie Hebdo was only a handful)
● Compare with less than 200 EU citizens who fought alongside
al-Qaeda / Taliban
● FBI watchlist; 400,000 suspects, 1M aliases, 1,600 new
names, 600 deletions, 4,800 corrections every single day!
The problem in numbers
“Chuy”
R
P