tna how taxonomy applications were built

Jeremie Charlet

04 08 2015

Trial-and-error experiments on Taxonomy Applications

Introduction

What is Taxonomy?

To better understand what it is about,

Let’s make a search on Discovery!

3

Introduction

4

Introduction

Taxonomy is just about classification.

Here it concerns the applications used to apply categories (or subjects) to the records in Discovery.

Project involving several people from Taxonomy teamand Systems Development team

5

Introduction

SolutionAdministration interface for taxonomists

6

Application to categorise everything once1.To do it for the first time2.to apply latest modifications from taxonomists on all documents

Application to categorise documents every day1.to categorise new documents2.to re-categorise documents when they are updated

Plan

This presentation is all about how we built this categorisation system

A.Using category queries1. Get it right2. Get it fast

a) Evolution of the algorithmb) Fine tuningc) Scale out

B. Attempt using machine learningo Using a training set based algorithm

7

A. Using category queriesHow to categorise a document?

Solution (from former system Autonomy):1 category = 1 search Query

8

“air force” "Air Force" OR "air forces" OR "Air Ministry" OR "Air Historical Branch" OR "Air Department" OR "Air Board" OR "Air Council" OR "Department of the Air Member" OR "air army“ …

A.1. Get it rightMany parameters to take into account•Is case sensitiveness important?•Use synonyms?•Ignore stop words (of, the, a, …)?•Which attributes to use (title, description, …)? Are some more important than others?•And many others

> Iterative process

How to evaluate if our results are valid? > Use documents and categories from former system> Categorise them again and compare results

To do that quickly, created Command Line Interface

9

[jcharlet@server ~]$

./runCli.sh -EVALcategoriseEvalDataSet --lucene.index.useSynonymFilter=true

A.1. Get it rightFindings1.To automate evaluation

o saved me a lot of timeo regression toolo benchmarking tool

2.Using a training set based system was not satisfactory

3.Needed to ignore case sensitiveness + punctuation in most cases

10

A.2 Get it fastHow to apply our 140 categories to 22 millions records quickly?

How fast do we need our system to be?

•Former system: 10+ dayso clunkyo Have to wait months to do it againo What if categorisation goes wrong? Start again for 10 days?

•Target: ~1d1 document categorised in 4ms

11

Let’s categorise 1 document at a time

Run queries in parallel

Run inverted queries

Run every query against every document one after another on the file index

Run queries against memory index

Run queries in memory to find candidates and run the candidates against the file index

A.2.a Evolution of the algorithm

12

Solution Time to categorise everything

Works

A few years

Fewer years

About 10 days

?About 10 days (60ms/doc)

A.2.b Fine tuningUse the right driver for your system (NRTCacheDirectory instead of default one) > 1 line in 1 file = 20% faster on search queries

Use filter instead of query to search on only 1 document + use carefully low level api

Profile your application frequently> Identify ugly code, where to add cache, where to add concurrencySpent 7% on creating Query objects for every document: instead, create them once and store them in memory

13

A.2.c Scale outRequires suitable architecture~Micro services like vs monolithic application

14

A.2.c Scale out

Back to the solution…GUI for taxonomists (+ backend for GUI)•Available at all time•Do search queries•Update categories

Application to categorise everything once•Run once in a while•Needs a huge amount of instances to do the job as fast as possible•Categorise everything

Application to categorise documents every day•Run every night•Receive categorisation requests from another system

15

A.2.c Scale outRequires suitable architecture~Micro services like vs monolithic application

16

A.2.c Scale outOn current available platform:

2 * 24 Core CPU40 Go RAM

2 * 6 categorisation processes

Categorise 22m documents in 1d 8h = 5ms to categorise 1 doc

17

Run queries in memory to find candidates and run the candidates against the file index

About 10 days (60ms/doc)

Progress is linear

A.2.c Scale outLet’s imagine that we use cloud services

Let’s suppose we already pay for something equivalent on Microsoft Azure

4 *

How much does it cost to use twice that number of servers to be twice faster (ideally)?NOTHING (* If you shut down your server once process ended)

18

INSTANCE CORES RAM DISK SIZES

PRICE

D3 4 14 GB 200 GB

£0.4179/hr

Plan

This presentation is all about how we built this categorisation system

A.Using category queries1. Get it right2. Get it fast

a) Evolution of the algorithmb) Fine tuningc) Scale out

B. Attempt using machine learningo Using a training set based algorithm

19

Research on a training set based solution for 2 monthsBiggest failure, best learning

1.Take a data set of known (already classified) documents2.Split it into a test set and training set

o Train the system with the training seto Evaluate it using the test seto Iterate until satisfactory

3.Move it to productiono Classify new documents using the trained system

B. Using machine learning

20

B. Using machine learningWhy it did not work1.Using category queries to create the training set

21

B. Using machine learningWhy it did not work1.Using category queries to create the training set

o Highly dependent on the validity/accuracy of the category queries

2.Nature of our categorieso far too many (136)o categories too vague or too similar (“Poverty”): do not suit such a

system

3.Not the right tool? We used Lucene (search engine) built in tool

4.Nature of the data?

22

B. Using machine learningWhy we should get into it•Capabilities are impressive (examples)•Enabled thanks to Cloud Computing (the computing power needed is all available) •Machine Learning As A Service> You can play with it for free (*), start prototyping

23

tna how taxonomy applications were built

Technology