cognitive systems institute talk 8 june 2017 - v.1.0

25
José Hernández-Orallo Dep. de Sistemes Informàtics i Computació, Universitat Politècnica de València [email protected] Talk for the Cognitive Systems Institute Speaker series 8 June 2017 * Based on parts of the book: “The Measure of All Minds”: http://allminds.org

Upload: diannepatricia

Post on 21-Jan-2018

144 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Cognitive systems institute talk   8 june 2017 - v.1.0

José Hernández-Orallo Dep. de Sistemes Informàtics i Computació,

Universitat Politècnica de València

[email protected]

Talk for the Cognitive Systems Institute Speaker series

8 June 2017

* Based on parts of the book:

“The Measure of All Minds”:

http://allminds.org

Page 2: Cognitive systems institute talk   8 june 2017 - v.1.0

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 2

“Greatest accuracy, at the frontiers of science,

requires greatest effort, and probably the most

expensive or complicated of measurement

instruments and procedures”

(David Hand, 2004).

Page 3: Cognitive systems institute talk   8 june 2017 - v.1.0

COGNITIVE SYSTEMS: MUCH MORE THAN AI

Computers:

AI or AGI systems, robots, bots, …

Cognitively-enhanced organisms, cognitive prosthetics

Cyborgs, technology-enhanced humans

Biologically-enhanced computers:

Human computation and their data

(Hybrid) collectives

Virtual social networks, crowdsourcing

Minimal or rare cognition

Artificial life (more like bacteria, plants, etc.)

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 3

Societal impact on

work, leisure, health, etc.,

difficult to assess as we

do not know the cognitive

capabilities of all these

new systems.

Page 4: Cognitive systems institute talk   8 june 2017 - v.1.0

THE EVALUATION DISCORDANCE: AI EVALUATION

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 4

Edited image, originally from wikicommons

"[AI is] the science of making

machines do things that would require

intelligence if done by [humans]."

Marvin Minsky (1968).

They can do the “things” (tasks) without

featuring intelligence.

Once the task is solved (“superhuman”),

it is no longer an AI problem (“AI effect”)

AI would have progressed very significantly (see, e.g., Nilsson, 2009, chap. 32, or Bostrom, 2014, Table 1, pp. 12–13).

But AI is now full of idiots savants.

Page 5: Cognitive systems institute talk   8 june 2017 - v.1.0

THE EVALUATION DISCORDANCE: AI EVALUATION

Specific (task-oriented) AI systems

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 5

Machine translation, information retrieval,

summarisation

Warning! Intelligence

NOT included.

PR: computer vision,

speech recognition, etc.

Robotic

navigation

Driverless

vehicles

Prediction and

estimation

Planning and

scheduling

Automated

deduction Knowledge-

based assistants

Game

playing

Warning! Intelligence

NOT included. Warning! Intelligence

NOT included.

Warning! Intelligence

NOT included.

Warning! Intelligence

NOT included.

Warning! Intelligence

NOT included.

Warning! Intelligence

NOT included.

Warning! Intelligence

NOT included.

Warning! Intelligence

NOT included.

All images from wikicommons

Page 6: Cognitive systems institute talk   8 june 2017 - v.1.0

THE EVALUATION DISCORDANCE: AI EVALUATION

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 6

Specific domain evaluation settings: CADE ATP System Competition PROBLEM BENCHMARKS

Termination Competition PROBLEM BENCHMARKS

The reinforcement learning competition PROBLEM BENCHMARKS

Program synthesis (Syntax-guided synthesis) PROBLEM BENCHMARKS

Loebner Prize HUMAN DISCRIMINATION

Robocup and FIRA (robot football/soccer) PEER CONFRONTATION

International Aerial Robotics Competition (pilotless aircraft) PROBLEM BENCHMARKS

DARPA driverless cars, Cyber Grand Challenge, Rescue Robotics PROBLEM BENCHMARKS

The planning competition PROBLEM BENCHMARKS

General game playing AAAI competition PEER CONFRONTATION

BotPrize (videogame player) contest HUMAN DISCRIMINATION

World Computer Chess Championship PEER CONFRONTATION

Computer Olympiad PEER CONFRONTATION

Annual Computer Poker Competition PEER CONFRONTATION

Trading agent competition PEER CONFRONTATION

Robo Chat Challenge HUMAN DISCRIMINATION

UCI repository, PRTools, or KEEL dataset repository. PROBLEM BENCHMARKS

KDD-cup challenges and ML kaggle competitions PROBLEM BENCHMARKS

Machine translation corpora: Europarl, SE times corpus, the euromatrix, Tenjinno competitions… PROBLEM BENCHMARKS

NLP corpora: linguistic data consortium, … PROBLEM BENCHMARKS

Warlight AI Challenge PEER CONFRONTATION

The Arcade Learning Environment PROBLEM BENCHMARKS

Pathfinding benchmarks (gridworld domains) PROBLEM BENCHMARKS

Genetic programming benchmarks PROBLEM BENCHMARKS

CAPTCHAs HUMAN DISCRIMINATION

Graphics Turing Test HUMAN DISCRIMINATION

FIRA HuroCup humanoid robot competitions PROBLEM BENCHMARKS

Page 7: Cognitive systems institute talk   8 june 2017 - v.1.0

THE EVALUATION DISCORDANCE: AI EVALUATION

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 7

Cognitive robots

Intelligent assistants

Pets, animats and other

artificial companions

Smart environments

Agents, avatars, chatbots Web-bots, Smartbots, Security bots…

How to evaluate general-purpose systems and cognitive components?

Warning! Some intelligence

MAY BE included.

Warning! Some intelligence

MAY BE included.

Warning! Some intelligence

MAY BE included.

Warning! Some intelligence

MAY BE included.

Warning! Some intelligence

MAY BE included.

Warning! Some intelligence

MAY BE included.

Page 8: Cognitive systems institute talk   8 june 2017 - v.1.0

THE EVALUATION DISCORDANCE: AI EVALUATION

“Mythical Turing Test” (Sloman, 2014)

and its myriad variants…

Mythical human-level machine intelligence

A red herring for general-purpose AI!

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 8

Page 9: Cognitive systems institute talk   8 june 2017 - v.1.0

THE EVALUATION DISCORDANCE: AI EVALUATION

What benchmarks? More comprehensive?

ARISTO (Allen Institute for AI) : College science exams

Winograd Schema Challenge : Questions targeting understanding.

Weston et al. “AI-Complete Question Answering” (bAbI)

CLEVR : Relations over visual objects

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 9

BEWARE: AI-Completeness claimed before

Calculation, Chess, Go, Turing test, …

Now AI is superhuman on most of them! (e.g., https://arxiv.org/pdf/1706.01427.pdf)

Page 10: Cognitive systems institute talk   8 june 2017 - v.1.0

THE EVALUATION DISCORDANCE: TEST MISMATCH

What about psychometric tests or animal tests in AI?

In 2003, Sanghi & Dowe :

simple program passing many IQ tests.

About 960 lines of code in Perl!

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 10

This made the point

unequivocally:

programs passing IQ

tests are not

necessarily intelligent

Page 11: Cognitive systems institute talk   8 june 2017 - v.1.0

THE EVALUATION DISCORDANCE: TEST MISMATCH

This has not been a deterrent!

Psychometric AI (Bringsjord and Schmimanski 2003):

An “agent is intelligent if and only if it excels at all

established, validated tests of intelligence”.

Detterman, editor of the Intelligence Journal, posed “A

challenge to Watson” (Detterman 2011)

2nd level to “be truly intelligent”: tests not seen

beforehand.

“IQ tests are not for machines, yet” (Dowe & Hernandez-Orallo

2012)

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 11

Page 12: Cognitive systems institute talk   8 june 2017 - v.1.0

THE EVALUATION DISCORDANCE: TEST MISMATCH

What about developmental tests (or tests for children)?

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 12

Developmental robotics:

Battery of tests (Sinapov, Stoytchev, Schenk 2010-13)

Cognitive architectures:

Newell “test” (Anderson and Lebiere 2003)

“Cognitive Decathlon” (Mueller 2007).

AGI: high-level competency areas (Adams

et al. 2012), task breadth (Goertzel et al 2009,

Rohrer 2010), robot preschool (Goertzel and

Bugaj 2009).

a taxonomy for

cognitive architectures

a psychometric

taxonomy (CHC)

Page 13: Cognitive systems institute talk   8 june 2017 - v.1.0

THE EVALUATION DISCORDANCE: TEST MISMATCH

Adapting tests between disciplines (AI, psychometrics, comparative

psychology) is problematic:

Test from one group only valid and reliable for the original group.

Not necessary and/or not sufficient for the ability.

Machines and hybrids represent a new population.

Nowadays, many benchmarks are assuming that AI will use deep

learning or millions of examples.

But machines and hybrids are also an opportunity to understand how

to evaluate cognition. Still,

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 13

We need a different foundation

Page 14: Cognitive systems institute talk   8 june 2017 - v.1.0

THE ALGORITHMIC CONFLUENCE: WHAT IQ TESTS MEASURE

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 14

“Beyond the Turing Test”…

“Intelligence” definition and test (C-test) based on algorithmic information theory (Hernandez-Orallo 1998-2000).

Letter series common in cognitive tests (Thurstone).

Here generated from a TM with properties (projectibility, stability, …).

Their difficulty is calculated by Kt

Linked with Levin’s universal search, Solomonoff’s inductive inference, Kolmogorov complexity.

Page 15: Cognitive systems institute talk   8 june 2017 - v.1.0

THE ALGORITHMIC CONFLUENCE: WHAT IQ TESTS MEASURE

Metric derived by slicing by difficulty h (Kt) and :

This is IQ-test re-engineering!

Intelligence no longer “what intelligence tests measure” (Boring, 1923).

Clues about what IQ tests really measure? Inductive inference.

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 15

Human performance

correlated with the difficulty

(h) of each exercise.

But remember Sanghi and Dowe 2003!

Page 16: Cognitive systems institute talk   8 june 2017 - v.1.0

THE ALGORITHMIC CONFLUENCE: SITUATED TESTS

Passive to interactive view: Intelligence as performance in a range of worlds.

The set of worlds M is described by Turing machines.

Intelligence is measured as an aggregate:

R aggregates ri and p assigns probabilities to environments. How?

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 16

π μ

ri

oi

ai

Page 17: Cognitive systems institute talk   8 june 2017 - v.1.0

THE ALGORITHMIC CONFLUENCE: SOLUTIONAL APPROACH

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 17

Three approaches:

Range of difficulties Diversity of solutions

[universal, e.g. Legg and Hutter]

[uniform]

[universal]

[universal]

[uniform]

[uniform]

[With the choices in brackets, they are NOT equivalent]

Page 18: Cognitive systems institute talk   8 june 2017 - v.1.0

THE ALGORITHMIC CONFLUENCE: SOLUTIONAL APPROACH

A different view of “general intelligence”:

Policy-general intelligence: aggregate by difficulty (e.g., bounded

uniform distribution) and for each difficulty look for diversity.

Connected to the task-independence of the g factor.

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 18

Raises a fascinating question: Is there a universal g factor?

Ability to find, integrate and emulate a

diverse range of successful policies.

Page 19: Cognitive systems institute talk   8 june 2017 - v.1.0

FROM TASKS TO ABILITIES: CLUSTERING BY SIMILARITY

Focus first on intermediate levels between tasks and abilities:

Do we have an intrinsic notion of similarity between tasks?

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 19

Task breadth? Arrange abilities?

Hierarchically (e.g., Catell-Horn-Carroll)

Spatially (e.g., Guttman’s model)

Page 20: Cognitive systems institute talk   8 june 2017 - v.1.0

FROM TASKS TO ABILITIES: CLUSTERING BY SIMILARITY

Example (ECA rules as tasks).

Task description is not used. No population is used either.

The best solutions are used instead and compared.

Using similarity as difficulty increases (18 rules of difficulty 8):

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 20

Dendrogram using complete linkage Metric multidimensional scaling

Page 22: Cognitive systems institute talk   8 june 2017 - v.1.0

IS THIS SUFFICIENT? OPEN QUESTIONS

What do these platforms / test measure? Depends on the tasks we define!

Many things to be done

Task analysis, their similarities, difficulties, their requirements (data)

Abilities: be conceptualised and identified.

Ability-oriented (or feature-oriented) evaluation

Incremental, gradual, curriculum, …: task similarity → dependency

Recent (EGPAI@ECAI2016, MAIN@NIPS2016) and upcoming workshops

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 22

EGPAI@IJCAI2017

MAIN@NIPS2017 ?

Page 23: Cognitive systems institute talk   8 june 2017 - v.1.0

IS THIS SUFFICIENT? OPEN QUESTIONS

We want cognitive components that could be easily integrated into

standalone cognitive systems.

What to measure:

“specific entities”, “networks” or “services” (Spohrer and Banavar 2015)

We need a different kind of 'specification' of

What the components are able to do.

What the integrated systems will be able to do,

Depending on their integration (tight, loose, teams, etc.).

Understanding the inclusion or emergence of general abilities.

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 23

Page 24: Cognitive systems institute talk   8 june 2017 - v.1.0

CONCLUSIONS

Increasing need for the evaluation of cognitive systems:

Plethora of new systems: AI, hybrids, collectives, etc.

Crucial to assess their cognitive profiles unlike and beyond humans’.

Critical for recognising what professions can be automated first.

Compensating for several cognitive impairments (e.g., aging).

From a task-oriented to an ability-oriented evaluation:

Evaluating cognitive abilities requires a change of paradigm:

From a populational to a universal perspective,

From agglomerative (task diversity) to solutional (policy diversity) approaches,

Hierarchical view, clustering bottom-up.

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 24

Page 25: Cognitive systems institute talk   8 june 2017 - v.1.0

E V A L U A T I N G C O G N I T I V E S Y S T E M S : T A S K - O R I E N T E D O R

A B I L I T Y - O R I E N T E D ? 25

THANK YOU!

More info:

BOOK

“The Measure of All Minds: Evaluating Natural and Artificial Intelligence”, Cambridge University Press, 2017. http://www.allminds.org

An AI Evaluation Survey

"Evaluation in artificial intelligence: From task-oriented to ability-oriented measurement", Artificial Intelligence Review, 2016