walmart big data expo

Natural Intelligence: the Human Factor in A.I.

Big Data Expo 2017Utrecht, Netherlands

About Me• Former Member of the Search team at @WalmartLabs

• Former Head of Metrics & Measurements team• I also led the Human Evaluation team

• About the Metrics and Measurements team• A team of engineers, analysts and scientists in charge of

providing accurate and exhaustive measurements• we also had an auditing role towards adjacent teams

• What do we measure?• Engineering metrics related to model and data quality• Business metrics (revenue, etc.)• More exotic customer-‐centric metrics (customer value, customer satisfaction, model impact, etc.)

• Currently Head of Data Science at Atlassian• In charge of the Search & Smarts team

q Humans & Big Data• The role of human beings in the era of Big Data• Why do we need to tag data?• How to get tagged data?

q The Era of Crowdsourcing• What is Crowdsourcing?• Use cases and details about Crowdsourcing• Traditional crowds vs. curated crowds

q The Human-‐in-‐the-‐Loop Paradigm• Definition and details about Human-‐In-‐The-‐Loop ML• Introduction to Active Learning

Outline

Humans & Big Data:The Role of Human Beings in the Era of

Machine Learning

The Era of Very Big Data

q VOLUME• More data created from 2013 to 2015 than in the entire previous history of the human race

• By 2020, accumulated data will reach 44 trillion gigabytes

q VELOCITY• By 2020, ~1.7 MB of new data / second / human being

• 1.2 trillion search queries on Google per year

q VARIETY• 31 million messages/2.8 million videos per minute on Facebook

• Up to 300 hours of video / minute are uploaded to YouTube

• In 2015, 1 trillion photos taken; billions shared online

data center at Google

Supervised vs. Unsupervised Machine LearningSupervised ML

requires tagged data

• Classification: problem where the output variable is a categoryexamples: SVM, random forest, Bayesian classifiers

• Regression: problem where the output variable is a real valueexamples: linear regression, random forest

Unsupervised MLdoesn’t require tagged data

• Clustering:discovery of inherent groupings in the dataexamples: k-‐means, k-‐nearest neighbors

• Association rules:discovery of rules describing the dataexample: Apriori algorithm

Unsupervised MLdoesn’t require tagged data

Supervised:• Image Recognition• Speech Recognition

Unsupervised• Feature Learning• Autoencoders

• Clustering:discovery of inherent groupings in the dataexamples: k-‐means, k-‐nearest neighbors

• Association rules:discovery of rules describing the dataexample: Apriori algorithm

The Case of Deep Learningboth supervised and unsupervised applications

NB: Deep Learning algorithms are data-‐greedy…

• Gathering quality tagged training data is a common bottleneck in ML• Expensive• Quality control is hard, requires second human pass• Hardly scalable à heavy use of sampling strategies

• How do companies doing Machine Learning get tagged data?• Implicit tagging: customer engagement• Explicit tagging: manual labor

• A few strategies to get tagged data for cheap/free:• Games (Google Quick Draw)• Incentivization (extra lives or bonuses in games)

Tagged Data

https://quickdraw.withgoogle.com/

Why human input matters: the use case of image colorization

The Wisdom from the Crowd

ColorizationModel

à Colorization is straightforward to humans because they can ‘tap’ into their general knowledge

image recognition

watermelon

grapesbananas

pineappleorange

tagged training data set

“Bananas are generally ”

‘general’ knowledge• obvious for human beings• fastidious for machines

colorization

Crowdsourcing:Human Wisdom at Scale

What is Crowdsourcing?

the process of getting labor or funding, usually online, from a crowd of peopleCrowdsourc ing

Ø Crowdsourcing = 'crowd' + 'outsourcing' Ø Act of taking a function once performed by employees and outsourcing it to an undefined (generally large) network of people in the form of an open call

the process of getting labor or funding, usually online, from a crowd of people

History of Crowdsourcing• Term was first used in 2005 by the editors atWired• Official definition published in Wired article “The Rise of Crowdsourcing”, June 2016• Describes how businesses were using the Internet to “outsource work to the crowd”

What Crowdsourcing helps with:• Scale à peer-‐production (for jobs to be performed collaboratively) • Reach à connect with a large network of potential laborers (if task undertaken by sole individuals)

Crowdsourc ing

the process of getting labor or funding, usually online, from a crowd of people

Crowdsourc ing

the process of getting labor or funding, usually online, from a crowd of peopleCrowdsourc ing

The Nature of Crowdsourcing

• Data generation: user generated content such as reviews, pictures, translations, etc.• Data validation: validation of translation, etc.• Data tagging: image tagging, product categorization, etc.

• Data curation: curation of news feeds, etc.

Microtasks

Funding

Macrotasks• Solution development: algorithm improvement, etc.

• Crowd contest: design competition, algorithmic competition, etc.

Microtasks

Funding

Microtasks

Funding

Some Cool Crowdsourcing Applications

Mapping• Photo Sphere• Google Maps crowdsources info for

wheelchair-‐accessible places

Traffic• Google Traffic• Waze: Traffic reporting app

Translation • Google Translate

Epidemiology• Flu tracking applications

Translation • Google Translate

Companies Based on Crowdsourcing

Quora is a question-‐and-‐answer site where questions are asked, answered, edited and organized by its community of users.

Waze is a community-‐based traffic and navigation app where drivers share real-‐time traffic and road info

Kaggle is a platform for predictive modelling competitions in which companies post data and data miners compete to produce the best models.

Stack Overflow is a platform for users to ask and answer questions and to vote questions and answers up or down and edit them.

Flickr is an image and video hosting website that is widely used by bloggers to host images that they embed in social media.

The Challenges of Crowdsourcing

Reliability • Retail: Absence of emotional involvement (judges are not actually spending money on items)• Waze: Locals were sending fake information to limit traffic in their area

Relevance of knowledge• Retail: Judges might not have appropriate knowledge of the items they are evaluating

Subjectivity• Search: Relevance score varies depending on profile and personal preferences

Speed & cost• Human evaluations take time, can only be performed sporadically and on samples• Not practical for measurement purposes

Crowdsourcing vs. Curated CrowdsTraditional Crowdsourcing Model

+ Speed: • many hands generate light work

+ Lower cost:• typically a few pennies per task

-‐ No quality control-‐ Lack of control:

• little to no incentive to deliver on time-‐ High maintenance:

• clear instructions needed • automated understanding checks

-‐ Lower reliability: • high overlap required

-‐ Lack of confidentiality: • anyone can see your tasks

Curated Crowd$$$$$

+ Quality control: • judges submitted to quality metrics • removed if they don’t deliver required quality

+ Better quality: • very little overlap needed

+ Expertise:• judges become experts at required task

+ Constraints on crowd: • judges less likely to drop out

-‐ More expensive:• typically primary source of income for judges

-‐ Consistency required: • need frequent tasks to keep sharp skills

Catalog Curation• Product Description Curation• Product Tagging & Categorization• Product Deduplication• Taxonomy Testing

Search Relevance Evaluation• Relevance score (query-‐item pair scores)• Engine comparison (ranking-‐to-‐ranking)

Review Moderation• Removal/flagging of obscene reviews

Mystery Shopping• Analysis and discovery of new trends • Evaluation of new products• Competitive analysis

Crowdsourcing Applications in e-‐Commerce

Crowdsourcing Applications in e-‐CommerceThe exam

ple of Product Tagging

Use Case: Evaluation of Search Engine Relevance

à Human evaluation makes it possible to measure the intangible with little risk

Side-‐by-‐Side Engine Comparison

Judge 1:Prefers ranking A

Judge 2:Prefers ranking A

Judge 3:Prefers ranking B

Use Case: Evaluation of Search Engine Relevance

Query-‐Item Relevance Scoring for Measurement of Ranking Quality

𝐷𝐶𝐺$ =&𝑟𝑒𝑙*

𝑙𝑜𝑔-(𝑖 + 1)

𝑛𝐷𝐶𝐺$ =𝐷𝐶𝐺$𝐼𝐷𝐶𝐺$

𝐼𝐷𝐶𝐺$ = &289:; − 1𝑙𝑜𝑔-(𝑖 + 1)

graded relevance of item at position i

Discounted cumulative gain

Human-‐in-‐the-‐Loop:When Human Beings still Outperform the Machine

Fact: the brain has 38 petaflops (thousand trillion operations per second) of processing power…

The Dream of Automation

FIRST REVOLUTION – 1784Mechanical production, railroad, steam power

SECOND REVOLUTION – 1870Mass production, electrical power,

assembly lines

THIRD REVOLUTION – 1969Automated production, electronics,computers

FOURTH REVOLUTION – ongoingArtificial intelligence, big data

The 4 Industrial Revolutions

assembly lines

à Automation is not a new idea

The 4 Industrial Revolutions

assembly lines

The 4 Industrial Revolutionsthe use of various control systems for operating equipment such as machinery and processes with minimal or reduced human intervention.

Automat ion

the use of various control systems for operating equipment such as machinery and processes with minimal or reduced human intervention.

assembly lines

Why?• Automate boring/repetitive tasks• Perform tasks at scale• Perform tasks with enhanced precision• Deliver consistent products• Use machines where they outperform humans

The 4 Industrial Revolutions Automat ion

When Full Automation can’t be Achieved…Human -‐ in -‐ the -‐Loop

Human-in-the-loop or HITL is defined as a model or a system that requires human interaction

The idea of using human beings to enhance the machine is not newWe have been doing Human-‐in-‐the-‐Loop all along…• Example: Autopilot technology for planes

Human intervention/presence is useful:• To handle corner cases (outlier management)• To “keep an eye” on the system (sanity check)• To correct unwanted behavior (refinement)• To validate appropriate behavior (validation)

When Full Automation can’t be Achieved…

Human-in-the-loop or HITL is defined as a model or a system that requires human interactionHuman -‐ in -‐ the -‐Loop

The idea of using human beings to enhance the machine is not newWe have been doing Human-‐in-‐the-‐Loop all along…• Example: Autopilot technology for planes

Human intervention/presence is useful:• To handle corner cases (outlier management)• To “keep an eye” on the system (sanity check)• To correct unwanted behavior (refinement)• To validate appropriate behavior (validation)

When Full Automation can’t be Achieved…

Human-in-the-loop or HITL is defined as a model or a system that requires human interactionHuman -‐ in -‐ the -‐Loop

Human-‐in-‐the-‐Loop ParadigmPare to P r inc ip le

aka the 80/20 rule, the law of the vital few, or the principle of factor sparsity-‐ states that, for many events, roughly 80% of the effects come from 20% of the causes

ML version of the Pareto Principle: • Evidence suggests that some of the most accurate ML systems to date need:

• 80% computer AI-‐driven • 19% human input• 1 % unknown randomness

to balance things out• The combination of machine and human intervention achieves maximum machine accuracy

How can human knowledge be incorporated to ML models?A. Helping label the original dataset that will be fed into a ML modelB. Helping correct inaccurate predictions that arise as the system goes live.

Human-‐in-‐the-‐Loop Paradigm

Pare to P r inc ip le

ML version of the Pareto Principle: • Evidence suggests that some of the most accurate ML systems to date need:

• 80% computer AI-‐driven • 19% human input• 1 % unknown randomness

to balance things out• The combination of machine and human intervention achieves maximum machine accuracy

How can human knowledge be incorporated to ML models?A. Helping label the original dataset that will be fed into a ML modelB. Helping correct inaccurate predictions that arise as the system goes live

Human-‐in-‐the-‐Loop Paradigm

Pare to P r inc ip le

Human-‐In-‐The-‐Loop Use Case #1An example of HITL approach: face recognition

Human-‐In-‐The-‐Loop Use Case #1

Roberto

Victoria

LauraSebastian

Cecelia

An example of HITL approach: face recognition

Roberto

Victoria

LauraSebastian

Cecelia

Accuracy• Facebook's DeepFace Software reaches 97.25% of accuracy

HITL as a feedback loop• When the confidence is below a certain threshold, it:

• suggests a label• ask the uploader to validate/approve or correct the

suggestion

• The new data is used to improve the accuracy of the algorithm

Roberto

Victoria

LauraSebastian

Cecelia

Accuracy• Facebook's DeepFace Software reaches 97.25% of accuracy

HITL as a feedback loop• When the confidence is below a certain threshold, it:

• suggests a label• ask the uploader to validate/approve or correct the

suggestion

• The new data is used to improve the accuracy of the algorithm

Human-‐In-‐The-‐Loop Use Case #2An example of HITL approach: autonomous vehicles

Teaching the machine• Driving systems were trained using a human to oversee the process

Accuracy considerations• Autopilot system is now over 99% accurate

• However, a 99% accuracy means that people can die 1% of the time (!!)

• Though we have seen huge advances in accuracy of pure machine-‐driven systems, they tend to fall short of acceptable accuracy rates

Corner cases• Fun fact: Volvo’s self-‐driving cars fail in Australia because of kangaroos• Reaching 100% is hard because of corner cases• A HITL approach helps get the accuracy to ~100%

• get the accuracy to ~100%

Volvo's driverless cars 'confused' by kangaroos

The Success of Human-‐In-‐The-‐LoopThe Example of Chess

The Human vs. the Machine• In 1997, Chess Master Garry Kasparov is beaten by IBM supercomputer Deep Blue

Garry Kasparov

Freestyle or “Advanced” Chess• Advanced: A human chess master works with a computer to find the best possible move

• Freestyle: A team can be made of any combination of human beings + computers

• In 2005, Steven Cramton, Zackary Stephen and their 3 computers win Freestyle Chess Tournament

Why it works• Computers are great at reading tough tactical situations

• But humans are better at understanding long term strategy

• Computers to limit “blunders” while using their intuition to force the opponent into board states that confuses the computer(s)

Garry Kasparov

Freestyle or “Advanced” Chess• Advanced: A human chess master works with a computer to find the best possible move

• Freestyle: A team can be made of any combination of human beings + computers

• In 2005, Steven Cramton, Zackary Stephen and their 3 computers win Freestyle Chess Tournament

Why it works• Computers are great at reading tough tactical situations

• But humans are better at understanding long term strategy

• Computers to limit “blunders” while using their intuition to force the opponent into board states that confuses the computer(s)

Garry Kasparov

Active Learning:The Best of Both Worlds

Active Learning

a special case of semi-‐supervised ML in which a learning algorithm can interactively query theuser (oracle) to obtain the desired outputs at new data points, maximizing validity and relevance

Act i ve Learn ing

Active Learning

General StrategyIf D is the entire data set, a each iteration i , D is broken up into three subsets1. DK, i : data points where the label is known2. DU, i : data points where the label is unknown3. DQ, i : data points for which the label is queried (sometimes, even when the label is known)

Benefits• Query labels only when necessary (lower cost)

Next Generation Algorithms• Proactive learning:

• relaxes the assumption that the oracle is always right• casts the problem as an optimization problem w/ a budget constraint

Act i ve Learn ing

Active Learning

Act i ve Learn ing

Active Learning

Act i ve Learn ing

Active Learning: How does it Work?

Active Learning: How does it Work?Machine Learning needs

• Logics (algorithm)• Data • Optimization• Feedback ß Human-‐in-‐the-‐Loop

Active Learning = a Machine Learning Algorithm using an “oracle” to reduce mistakes/uncertainty

Query Strategy -‐ Labels are queried when:• Data points for which model uncertainty is high

(uncertainty sampling)• Data points for which the different models of an

ensemble method disagree the most (query by committee)

• Data points causing the most changes on the model(expected model change)

• Data points caused overall variance to be high(variance reduction)

Unlabeled Data

Active Learning Algorithm

select/remove single example

Labeled Data

r Oracle

update

add labeled example

provide correct label

Machine Learning needs • Logics (algorithm)• Data • Optimization• Feedback ß Human-‐in-‐the-‐Loop

Unlabeled Data

Active Learning Algorithm

select/remove single example

Labeled Data

r Oracle

update

add labeled example

provide correct label

Machine Learning needs • Logics (algorithm)• Data • Optimization• Feedback ß Human-‐in-‐the-‐Loop

Machine LearningClassifier

Confidence level high?

Output

Annotation by Human Oracle

Human-‐in-‐the-‐Loop

Active Learning

By adding a human feedback loop, we allow the system to: • actively learn• correct itself where it got it wrong• improve the algorithm over iterations

Machine LearningClassifier

Confidence level high?

Output

Annotation by Human Oracle

Human-‐in-‐the-‐Loop

Active Learning

By adding a human feedback loop, we allow the system to: • actively learn• correct itself where it got it wrong• improve the algorithm over iterations

3 Use Cases using Active Learning in the context of Search/Retail

Active Learning at Walmart e-‐Commerce

qMachine Learning Lifecycle Management (Programming by Feedback)• Automatic monitoring of input and output values for ML algorithm• An algorithm detects failings and outliers in real-‐time and suggest an action• A human validates the action, creating tagged data for full automation

q Diagnosis of Catalog Data Issues (Reinforcement Learning)• Algorithm uncovers demoted items and suggests most likely reason for the demotion• Engineer manually confirms/corrects the suggestion, generating training data for full automation

q Refinement of Query Tagging Algorithm (Optimization)

• Human evaluation team manually measures accuracy of query tagging model• Mistagged queries are used to discover patterns specific to problematic queries, which are reported to engineers• Sample is enriched with problematic queries (evaluation team can diagnose problems with algorithms)

red t-shirt Size M

color product type size

• Why do humans and machine complement each other?• Human beings are memory-‐constrained• Computers are knowledge-‐constrained

• Tagged data more important than ever• But getting quality data is challenging given the volume of data• Crowdsourcing offer more flexibility to tag data at scale

• Human-‐in-‐the-‐Loop paradigm• Improve accuracy of machine learning algorithm (classifiers)• Many examples of successful endeavors using “Augmented Intelligence”• Active Learning is a booming area of ML/AI

Conclusion and Takeaways

Thank You!

walmart big data expo

Data & Analytics