making sense of cyberspace, keynote for software engineering institute cyber intelligence...
TRANSCRIPT
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 1
Making Sense of Cyberspace
Cyber Intel Research Consortium
May 5, 2015
Jason Hong
ComputerHumanInteraction:MobilityPrivacySecurity
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 2
•Bandwidth
Time
Computing Trends
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 3
•Bandwidth
•Storage
Time
Computing Trends
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 4
Time
Computing Trends
•Bandwidth
•Storage
•Computing Power
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 5
Time
Computing Trends
•Bandwidth
•Storage
•Computing Power
•Information
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 6
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 7
Time
Human Capabilities
Cognitive Processing
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 8
Cognitive Processing
Visual acuity
Communication bandwidth
Attention
…
Time
Human Capabilities
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 9
7 2
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 1
0
Evidence suggests it’s more like 4
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 1
1
How to Scale Up Our Ability to Understand and Analyze?
• Research in mobility, privacy, security
– Anti-phishing
– Smartphone apps, Internet of Things
• Key themes today
– Information visualization
– Machine learning
– Wisdom of crowds
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 1
2
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 1
3
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 1
4
•Which state has highest income?
• Is there a relationship between
income and education?
•Are there any outliers?
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 1
5
Visualization for ElicitingKnowledge from Data
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 1
6
How You Represent Info Matters
• What is VIII x XCI = ???
• Another example:
– Game with two players
– Alternately choose numbers between 1-15
– First to get three numbers to 15 wins
– Each number can be taken only once
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 1
7
How You Represent Info Matters
• It’s actually Tic-Tac-Toe
– Trivial in this representation
8 1 6
3 5 7
4 9 2
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 1
8
Other Design Principles?
1. Start out going Southwest on ELLSWORTH AVE
Towards BROADWAY by turning right.
2: Turn RIGHT onto BROADWAY.
3. Turn RIGHT onto QUINCY ST.
4. Turn LEFT onto CAMBRIDGE ST.
5. Turn SLIGHT RIGHT onto MASSACHUSETTS AVE.
6. Turn RIGHT onto RUSSELL ST.
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 1
9
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 2
0
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 2
1
Some Lessons
1. How you represent info matters
2. InfoViz is what you do show as well as what you don’t
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 2
2
Focus + Context
• More pixels to stuff that matters, fewer pixels to supporting context
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 2
3
Focus + Context
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 2
4
Appropriate Use of Color
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 2
5
Appropriate Use of Color
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 2
6
Some Lessons
1. How you represent info matters
2. InfoViz is what you do show as well as what you don’t
3. Focus + Context
4. Appropriate use of colors (perception)
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 2
7
US Election 2004
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 2
8
InfoViz’s Can Show and Hide Info
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 2
9
All Viz’s Show and Hide Info
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 3
0
InfoViz’s Can Show and Hide Info
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 3
1
All Viz’s Show and Hide Info
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 3
2
Some Lessons
1. How you represent info matters
2. InfoViz is what you do show as well as what you don’t
3. Focus + Context
4. Appropriate use of colors (perception)
5. All visualizations have inherent biases
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 3
3
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 3
4
Visualizations Don’t Have to Be Complex to be Effective
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 3
5
Example from Jeff Heer
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 3
6
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 3
7
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 3
8
Work by Jeff Heer
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 3
9
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 4
0
Collaborative Analysis
• Many Eyes (by IBM)
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 4
1
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 4
2
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 4
3
Can We Have Crowds of People Help Analyze Data?
• What are better ways of getting groups of people working together?
• One approach is to use lots of crowd workers to improve accuracy and time
– Common structure in your problem?
– Can you break it down into small tasks?
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 4
4
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 4
5
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 4
6
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 4
7
Wisdom of Crowds Approach
• Mechanics of PhishTank
– Submissions require at least 4 votes and 70% agreement
– Some votes weighted more
• Total stats (Oct2006 – Feb2011)
– 1.1M URL submissions from volunteers
– 4.3M votes
– resulting in about 646k identified phish
• Why so many votes for only 646k phish?
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 4
8
PhishTank Statistics
Jan 2011
Submissions 16019
Total Votes 69648
Valid Phish 12789
Invalid Phish 549
Median Time 2hrs 23min
• 69648 votes max of 17412 labels
– But only 12789 phish and 549 legitimate identified
– 2681 URLs not identified at all
• Median delay of 2+ hours still has room for improvement (Jan 2015 is 14+ hours)
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 4
9
Ways of Smartening the Crowd
• Change the order URLs are shown
– Ex. most recent vs closest to completion
• Change how submissions are shown
– Ex. show one at a time or in groups
• Adjust threshold for labels
– PhishTank is 4 votes and 70%
– Ex. vote weights, algorithm also votes
• Motivating people / allocating work
– Filtering by brand, competitions, teams of voters, leaderboards
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 5
0
Clustering Phish
• Motivations
– Most phish are generated by toolkits and thus similar
– Labeling single sites as phish can be hard, easier if multiple examples
– Reduce labor by labeling suspicious sites in bulk
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 5
1
Clustering Phish
• Motivations
– Most phish are generated by toolkits and thus similar
– Labeling single sites as phish can be hard, easier if multiple examples
– Reduce labor by labeling suspicious sites in bulk
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 5
2
Most Phish Can be Clustered
• With all data over two weeks, 3180 of 3973 web pages can be grouped (80%)
– Used shingling and DBSCAN (see paper)
– 392 clusters, size from 2 to 153 URLs
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 5
3
Comparing Coverage and Time
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 5
4
CrowdVerify• This is the New York Times Privacy Policy• Can we use crowd to help summarize?
Flag surprises? • Can we build predictive language models?
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 5
5
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 5
6
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 5
7
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 5
8
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 5
9
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 6
0
Which statement is more important?
Privacy and SecurityCrowdVerify
We receive data whenever you visit a game, application, or website that uses Facebook Platform …
When you sign up for Facebook, you are required to provide information such as your name, email address, birthday, and gender.
A B
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 6
1
Privacy and SecurityCrowdVerify on Google’s Policies
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
Sco
res
Statement ID
Frequency Experiment Sum of Scores
Google 08
Google 13
Google 07
Google 30
Google 10
Google 12
Google 04
Google 16
Google 03
Google 11
Google 05
Google 09
Google 28
Google 02
Google 06
Google 23
Google 14
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 6
2
Top Statement - Google
• Google 08: When you use our services or view content provided by Google, we may automatically collect and store certain information in server logs. This may include: telephony log information like your phone number, calling-party number, forwarding numbers, time and date of calls, duration of calls, SMS routing information and types of calls.
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 6
3
Bottom 3 Statements - Google
• Google 29: Whenever you use our services, we aim to provide you with access to your personal information.
• Google 26: For example, you can: Take information out of many of our services.
• Google 24: People have different privacy concerns. Our goal is to be clear about what information we collect, so that you can make meaningful choices about how it is used.
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 6
4
Designing Crowd Systems
• Motivation for contributing?
– Money, altruism, fun?
• Quality control?
– How to ensure good quality? Prevent bozos?
• How is crowd wisdom aggregated?
– How are people and computers organized?
– How are decisions made?
• Skill level required?
– Novice, intermediate, expert?
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 6
5
One Real-World SuccessClickworkers
• 100k+ volunteers identified 2M Mars craters from space photographs (14k hours of free work!!)
• Aggregate results “virtually indistinguishable” from experts
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 6
6
Another Real-World SuccessGalaxy Zoo
• Problem: you are an astronomer looking at the Sloan Digital Sky survey, which has images of over 1M galaxies. You want to classify these by galaxy type.
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 6
7
DARPA Red Balloon
• 2009 challenge to find 10 red weather balloons at unknown locations in the USA
• Winner gets $40k
• Surprise: top team wins in < 9 hours
– Other teams did well too
– 9 balloons in 9 hours (1)
– 8 balloons (2 teams)
– 7 balloons (5 teams)
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 6
8
DARPA Red Balloon
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 6
9
DARPA Red Balloon
• MIT Team (10 balloons in less than 9 hours)
• Overall team strategy
– Get as many people as possible (over 5k people)
– Use multiple approaches to verify
• How people were organized
– Core team / recruit / spotters
• “Social engineering”
– Recursive financial structure to get more people
– Incentivizes people who might not be able to help
• Technologies used
– Surprisingly little, one centralized web site + google maps
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 7
0
MIT Reward Structure
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 7
1
Not Quite as SuccessfulSearch for Steve Fossett
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 7
2
DARPA Network Challenge
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 7
3
Summary
• Collecting info is easy,understanding it is hard
• How to scale?
– Information visualization
– Machine learning
– Wisdom of crowds
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 7
4
More Resources
• A Tour through the Visualization Zoo– https://queue.acm.org/detail.cfm?id=1805128
• Interactive Dynamics for Visual Analysis
– https://queue.acm.org/detail.cfm?id=2146416
• Tableau Software– http://www.tableau.com/
• Many Eyes– http://www-01.ibm.com/software/analytics/many-eyes/
•
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 7
5
More Resources
• DARPA Red Balloon Challenge– http://cacm.acm.org/magazines/2011/4/106587-reflecting-
on-the-darpa-red-balloon-challenge/fulltext
• Pirolli and Card, The Sensemaking Process and Leverage Points for Analyst Technology as Identified Through Cognitive Task Analysis
• Apolo, combining data mining with infoviz– https://www.youtube.com/watch?v=EFxYVWkj1aE
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 7
6
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 7
7
Patterns in Crowd Analysis
Unit of Work
Unit of Work
Un
it
Un
it
Un
it
…
Final
Voting(Ex. PhishTank)
MapReduce(Ex. CrowdVerify)
Unit of Work
Funnel
Unit of Work
Unit of WorkDecision
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 7
8
Visualization of the Internet
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 7
9
Combine Data Mining + Viz
User can specify exemplars of a groupBelief Propagation to find more nodes
©2015 C
arn
egie
Mellon U
niv
ers
ity
: 8
0