making sense of cyberspace, keynote for software engineering institute cyber intelligence...

Post on 16-Jul-2015

129 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 1

Making Sense of Cyberspace

Cyber Intel Research Consortium

May 5, 2015

Jason Hong

ComputerHumanInteraction:MobilityPrivacySecurity

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 2

•Bandwidth

Time

Computing Trends

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 3

•Bandwidth

•Storage

Time

Computing Trends

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 4

Time

Computing Trends

•Bandwidth

•Storage

•Computing Power

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 5

Time

Computing Trends

•Bandwidth

•Storage

•Computing Power

•Information

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 6

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 7

Time

Human Capabilities

Cognitive Processing

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 8

Cognitive Processing

Visual acuity

Communication bandwidth

Attention

Time

Human Capabilities

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 9

7 2

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 1

0

Evidence suggests it’s more like 4

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 1

1

How to Scale Up Our Ability to Understand and Analyze?

• Research in mobility, privacy, security

– Anti-phishing

– Smartphone apps, Internet of Things

• Key themes today

– Information visualization

– Machine learning

– Wisdom of crowds

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 1

2

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 1

3

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 1

4

•Which state has highest income?

• Is there a relationship between

income and education?

•Are there any outliers?

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 1

5

Visualization for ElicitingKnowledge from Data

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 1

6

How You Represent Info Matters

• What is VIII x XCI = ???

• Another example:

– Game with two players

– Alternately choose numbers between 1-15

– First to get three numbers to 15 wins

– Each number can be taken only once

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 1

7

How You Represent Info Matters

• It’s actually Tic-Tac-Toe

– Trivial in this representation

8 1 6

3 5 7

4 9 2

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 1

8

Other Design Principles?

1. Start out going Southwest on ELLSWORTH AVE

Towards BROADWAY by turning right.

2: Turn RIGHT onto BROADWAY.

3. Turn RIGHT onto QUINCY ST.

4. Turn LEFT onto CAMBRIDGE ST.

5. Turn SLIGHT RIGHT onto MASSACHUSETTS AVE.

6. Turn RIGHT onto RUSSELL ST.

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 1

9

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 2

0

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 2

1

Some Lessons

1. How you represent info matters

2. InfoViz is what you do show as well as what you don’t

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 2

2

Focus + Context

• More pixels to stuff that matters, fewer pixels to supporting context

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 2

3

Focus + Context

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 2

4

Appropriate Use of Color

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 2

5

Appropriate Use of Color

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 2

6

Some Lessons

1. How you represent info matters

2. InfoViz is what you do show as well as what you don’t

3. Focus + Context

4. Appropriate use of colors (perception)

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 2

8

InfoViz’s Can Show and Hide Info

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 2

9

All Viz’s Show and Hide Info

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 3

0

InfoViz’s Can Show and Hide Info

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 3

1

All Viz’s Show and Hide Info

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 3

2

Some Lessons

1. How you represent info matters

2. InfoViz is what you do show as well as what you don’t

3. Focus + Context

4. Appropriate use of colors (perception)

5. All visualizations have inherent biases

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 3

3

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 3

4

Visualizations Don’t Have to Be Complex to be Effective

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 3

5

Example from Jeff Heer

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 3

6

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 3

7

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 3

8

Work by Jeff Heer

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 3

9

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 4

0

Collaborative Analysis

• Many Eyes (by IBM)

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 4

1

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 4

2

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 4

3

Can We Have Crowds of People Help Analyze Data?

• What are better ways of getting groups of people working together?

• One approach is to use lots of crowd workers to improve accuracy and time

– Common structure in your problem?

– Can you break it down into small tasks?

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 4

4

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 4

5

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 4

6

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 4

7

Wisdom of Crowds Approach

• Mechanics of PhishTank

– Submissions require at least 4 votes and 70% agreement

– Some votes weighted more

• Total stats (Oct2006 – Feb2011)

– 1.1M URL submissions from volunteers

– 4.3M votes

– resulting in about 646k identified phish

• Why so many votes for only 646k phish?

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 4

8

PhishTank Statistics

Jan 2011

Submissions 16019

Total Votes 69648

Valid Phish 12789

Invalid Phish 549

Median Time 2hrs 23min

• 69648 votes max of 17412 labels

– But only 12789 phish and 549 legitimate identified

– 2681 URLs not identified at all

• Median delay of 2+ hours still has room for improvement (Jan 2015 is 14+ hours)

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 4

9

Ways of Smartening the Crowd

• Change the order URLs are shown

– Ex. most recent vs closest to completion

• Change how submissions are shown

– Ex. show one at a time or in groups

• Adjust threshold for labels

– PhishTank is 4 votes and 70%

– Ex. vote weights, algorithm also votes

• Motivating people / allocating work

– Filtering by brand, competitions, teams of voters, leaderboards

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 5

0

Clustering Phish

• Motivations

– Most phish are generated by toolkits and thus similar

– Labeling single sites as phish can be hard, easier if multiple examples

– Reduce labor by labeling suspicious sites in bulk

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 5

1

Clustering Phish

• Motivations

– Most phish are generated by toolkits and thus similar

– Labeling single sites as phish can be hard, easier if multiple examples

– Reduce labor by labeling suspicious sites in bulk

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 5

2

Most Phish Can be Clustered

• With all data over two weeks, 3180 of 3973 web pages can be grouped (80%)

– Used shingling and DBSCAN (see paper)

– 392 clusters, size from 2 to 153 URLs

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 5

3

Comparing Coverage and Time

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 5

4

CrowdVerify• This is the New York Times Privacy Policy• Can we use crowd to help summarize?

Flag surprises? • Can we build predictive language models?

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 5

5

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 5

6

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 5

7

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 5

8

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 5

9

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 6

0

Which statement is more important?

Privacy and SecurityCrowdVerify

We receive data whenever you visit a game, application, or website that uses Facebook Platform …

When you sign up for Facebook, you are required to provide information such as your name, email address, birthday, and gender.

A B

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 6

1

Privacy and SecurityCrowdVerify on Google’s Policies

0

10

20

30

40

50

60

70

80

90

100

110

120

130

140

Sco

res

Statement ID

Frequency Experiment Sum of Scores

Google 08

Google 13

Google 07

Google 30

Google 10

Google 12

Google 04

Google 16

Google 03

Google 11

Google 05

Google 09

Google 28

Google 02

Google 06

Google 23

Google 14

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 6

2

Top Statement - Google

• Google 08: When you use our services or view content provided by Google, we may automatically collect and store certain information in server logs. This may include: telephony log information like your phone number, calling-party number, forwarding numbers, time and date of calls, duration of calls, SMS routing information and types of calls.

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 6

3

Bottom 3 Statements - Google

• Google 29: Whenever you use our services, we aim to provide you with access to your personal information.

• Google 26: For example, you can: Take information out of many of our services.

• Google 24: People have different privacy concerns. Our goal is to be clear about what information we collect, so that you can make meaningful choices about how it is used.

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 6

4

Designing Crowd Systems

• Motivation for contributing?

– Money, altruism, fun?

• Quality control?

– How to ensure good quality? Prevent bozos?

• How is crowd wisdom aggregated?

– How are people and computers organized?

– How are decisions made?

• Skill level required?

– Novice, intermediate, expert?

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 6

5

One Real-World SuccessClickworkers

• 100k+ volunteers identified 2M Mars craters from space photographs (14k hours of free work!!)

• Aggregate results “virtually indistinguishable” from experts

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 6

6

Another Real-World SuccessGalaxy Zoo

• Problem: you are an astronomer looking at the Sloan Digital Sky survey, which has images of over 1M galaxies. You want to classify these by galaxy type.

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 6

7

DARPA Red Balloon

• 2009 challenge to find 10 red weather balloons at unknown locations in the USA

• Winner gets $40k

• Surprise: top team wins in < 9 hours

– Other teams did well too

– 9 balloons in 9 hours (1)

– 8 balloons (2 teams)

– 7 balloons (5 teams)

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 6

8

DARPA Red Balloon

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 6

9

DARPA Red Balloon

• MIT Team (10 balloons in less than 9 hours)

• Overall team strategy

– Get as many people as possible (over 5k people)

– Use multiple approaches to verify

• How people were organized

– Core team / recruit / spotters

• “Social engineering”

– Recursive financial structure to get more people

– Incentivizes people who might not be able to help

• Technologies used

– Surprisingly little, one centralized web site + google maps

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 7

0

MIT Reward Structure

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 7

1

Not Quite as SuccessfulSearch for Steve Fossett

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 7

2

DARPA Network Challenge

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 7

3

Summary

• Collecting info is easy,understanding it is hard

• How to scale?

– Information visualization

– Machine learning

– Wisdom of crowds

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 7

4

More Resources

• A Tour through the Visualization Zoo– https://queue.acm.org/detail.cfm?id=1805128

• Interactive Dynamics for Visual Analysis

– https://queue.acm.org/detail.cfm?id=2146416

• Tableau Software– http://www.tableau.com/

• Many Eyes– http://www-01.ibm.com/software/analytics/many-eyes/

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 7

5

More Resources

• DARPA Red Balloon Challenge– http://cacm.acm.org/magazines/2011/4/106587-reflecting-

on-the-darpa-red-balloon-challenge/fulltext

• Pirolli and Card, The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive Task Analysis

• Apolo, combining data mining with infoviz– https://www.youtube.com/watch?v=EFxYVWkj1aE

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 7

6

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 7

7

Patterns in Crowd Analysis

Unit of Work

Unit of Work

Un

it

Un

it

Un

it

Final

Voting(Ex. PhishTank)

MapReduce(Ex. CrowdVerify)

Unit of Work

Funnel

Unit of Work

Unit of WorkDecision

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 7

8

Visualization of the Internet

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 7

9

Combine Data Mining + Viz

User can specify exemplars of a groupBelief Propagation to find more nodes

©2015 C

arn

egie

Mellon U

niv

ers

ity

: 8

0

top related