autoclassificaiton - rules versus machine learning

Post on 22-Jan-2018

368 Views

Category:

Internet

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Jeff FriedCTOBA Insight

@jefffried#tbc2016

Rules-Based vs. Document-Based Bake-off

Focused on Search and

SharePoint since 2004

Longtime

Search Nerd

• CTO, BA Insight

• Senior PM, Microsoft

• VP, FAST

• SVP, LingoMotors

About Jeff Fried

Passionate About

• Search

• SharePoint

• Search-driven

applications

• Information Strategy

Blog:

BAinsight.com/blog

Technet Column

“A View from the

Crawlspace”

jeff.fried@bainsight.com

About BA Insight

– Connectivity

– Applications -

– Classification -

– Analytics

Metadata Drives Great User Experiences

Documents from many sourcesAll client or matter-relevant documents are integrated.

Rich MetaDataContent annotated automatically – concepts,

categories, citations, matters, clients, etc

Navigation ControlsExplore, Discover, Drill-down

Manual Tagging is impractical

and remarkably inconsistent

Automation

Called: AutoClassification, AutoTagging, Metadata Generation, Text Analytics, ….

8

Complicators

11

Common Techniques across Applications

-

-

-

-

-

-

-

-

-

-

-

-

Rules-based Approach

Enhanced Content

Enriched with

Metadata and

Content Types

Search Visualization Workflow

Name Blood Type Give Birth Can Fly Live in Water Class

human warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fisheswhale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammalspigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishesturtle cold no no sometimes reptilespenguin warm no no sometimes birdsporcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibiansgila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birdsdolphin warm yes no yes mammalseagle warm no yes no birds

Rule-based Classifier (Example)

R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians

Example Rules Engine UI

Examples of Rules

Boolean

• “IT” OR “Information Technology” or “MIS”

• (“Expert” OR “Witness”) NOT “police”

• “New York” AND “environmental policy”

• *work

• "legal" -briefs

• "Legal" NEAR(5) "issue“

Property-based

• filetype:docx

• title:"2029 L.P" or title:2030

• footer="BA Insight Confidential" or

footer:proprietary or footer:BA*

Overriding/changing Linguistics

• NOSTEM(“illumination")

• CASE("prerequisites")

• SOUNDLIKE("prerech")Regular expressions

• title:REGEX([0-4])

• REGEX("\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))")

Controlling scores & thresholds

Taxonomy Management is often included with Auto-Classification Tools

Where do you get Taxonomies?

20

Semantics! Machine Learning! AI!

Key Concepts

False positives vs. false negativesLook at the impact of each in your context

Machine Learning Approach

Example: identify people as good or bad from their appearance

Decision Tree Classifier

Building an accurate classifier

Training and Test Data

28

Choosing the algorithm

+ Easy to get started

+ Transparent and debuggable

+ Easily controlled (when # rules not too large)

- Need taxonomies

- Rule maintenance effort

- Harder to cover domain fully and to switch domains

+ Don’t need taxonomies

+ Improves without manual maintenance

+ Handles new data types/domains more easily

- Need a training set

- Opaque, usually can’t debug

- Can’t specify or control specific examples

What would you use for

Case StudyContent Identification and Movement

Benchmarks

Large scale example

Combinations of Techniques usually work better

Examples of hybrid configurations

Example: clustering combined with rules

carrot2

Open Source & Platform packages offer an easy way to play

How to get started

Setup up a metadata framework

– keep it simple

Develop or acquire managed vocabularies for

critical elements

Start with rule-driven automation

Test out ML-based techniques as you grow

41

www.BAinsight.com

Jeff.Fried@BAinsight.com

@jefffried

top related