what to expect when you are visualizing (v.2)

Post on 15-Jan-2017

83 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

WHAT TO EXPECT WHEN YOU ARE VISUALIZING

Krist Wongsuphasawat / @kristw

Based on true stories Forever querying

Never-ending cleaning Hopelessly prototyping

Last minute coding and many more…

Computer Engineer Bangkok, Thailand

Chulalongkorn University

Krist Wongsuphasawat / @kristw

Programming + Soccer

Computer Engineer Bangkok, Thailand

Krist Wongsuphasawat / @kristw

Programming + Soccer

Computer Engineer Bangkok, Thailand

Krist Wongsuphasawat / @kristw

(P.S. These are actually not my robots, but our competitors’.)

Krist Wongsuphasawat / @kristw

Computer Engineer Bangkok, Thailand

Krist Wongsuphasawat / @kristw

Computer Engineer Bangkok, Thailand

PhD in Computer Science Information Visualization Univ. of Maryland

Krist Wongsuphasawat / @kristw

Computer Engineer Bangkok, Thailand

IBMMicrosoft

PhD in Computer Science Information Visualization Univ. of Maryland

PhD in Computer Science Information Visualization Univ. of Maryland

IBMMicrosoft

Data Visualization Scientist Twitter

Krist Wongsuphasawat / @kristw

Computer Engineer Bangkok, Thailand

#interactive visualizations

Open-source projects

Visual Analytics Tools

DATA =ME+ VIS

Me

clients, data, requirements, etc.

WHAT TO EXPECT?

1. EXPECT POTENTIAL MISMATCHES

INPUT (DATA)What clients think they have

INPUT (DATA)What clients think they have What they usually have

YOUWhat clients think you are

YOUWhat clients think you are What they will get

OUTPUT (VIS)What clients ask for

OUTPUT (VIS)What clients ask for What they really need

COMMUNICATE

I need this. Take this.

I need this. Here you are.

I need this. Take this.

& COMPROMISE

2. EXPECT DIFFERENT REQUIREMENTS

DIFFERENT GOALSPresent Communicate information effectively

Explore Exploratory analysis, Reusable tools for exploration

Explore + Present Analyze data + tell story

Enjoy More flexible

DIFFERENT GOALSPresent Communicate information effectively

Explore Exploratory analysis, Reusable tools for exploration

Explore + Present Analyze data + tell story

Enjoy More flexible

3. EXPECT TO CLEAN DATA

DATA SOURCESOpen data Publicly available

Internal data Private, owned by clients’ organization

Self-collected data Manual, site scraping, etc.

Combine the above

MANY FORMS OF DATAStandalone files txt, csv, tsv, json, Google Docs, …, pdf*

APIs better quality with more overhead

Databases doesn’t necessary mean they are organized

Big data bigger pain

HAVING ALL TWEETSHow people think I feel.

How people think I feel. How I really feel.

HAVING ALL TWEETS

CHALLENGESGet relevant Tweets hashtag: #oscars keywords: “spotlight” (movie name)

Too big Need to aggregate & reduce size

Slow Long processing time (hours)

Hadoop Cluster

GETTING BIG DATA

Data Storage

Pig / Scalding (slow)

GETTING BIG DATAHadoop Cluster

Data Storage

Tool

Hadoop Cluster

Pig / Scalding (slow)

GETTING BIG DATA

Data Storage

Tool

Pig / Scalding (slow)

GETTING BIG DATAHadoop Cluster

Data Storage

Tool

Your laptop Smaller dataset

Hadoop Cluster

Pig / Scalding (slow)

Data Storage

Tool

Final dataset

Tool node.js / python / excel (fast)

Your laptop

GETTING BIG DATA

Smaller dataset

CLEANINGData come in different formats. tsv to json

Quality of data collection. null, missing data, typos, timestamp

Filter Remove unnecessary data

Conversion Change country code from 3-letter (USA) to 2-letter (US) Correct time of day based on users’ timezone Convert lat/lon to county

etc.

4. EXPECT TO CLEAN DATA A LOT

70-80% of time cleaning data

“DATA JANITOR”

WHY?Definition of “clean” depends on the task. e.g. Restaurant reviews

USER RESTAURANT RATING========================A MCDONALD’S 3B MCDONALDS 3C MCDONALD 4D MCDONALDS 5E IHOP 4F SUBWAY 4

WHY?Definition of “clean” depends on the task. e.g. Restaurant reviews

Data issue can present itself anytime. in the project timeline

RAMSAY & RAMSEY

WHY?Definition of “clean” depends on the task. e.g. Restaurant reviews

Data issue can present itself anytime. in the project timeline

It takes time to process data. Run. Wait… Oops! Re-run. Wait…

RECOMMENDATIONSAlways think that you will have to do it again document the process, automation

Reusable scripts break a gigantic do-it-all function into smaller ones

Reusable data keep for future project

5. EXPECT TO TRY AND BREAK THINGS

https://twitter.com/hashtag/d3brokeandmadeart

#D3BROKEANDMADEART

6. EXPECT TO ITERATE UNTIL IT WORKS

7. EXPECT DEADLINE

EXAMPLE PROJECTS

EXAMPLE 1: STORYTELLING

WHAT TO EXPECTtimely Deadline is strict. Also can be unexpected events.

wide audience easy to explain and understand, multi-device support

one-off projects

content screening

from fans’ conversations

Reveal the talking points of every episode of

Problem is coming.CHAPTER I

Problem

Want to know what the audience talk about a TV show

from Tweets

HBO’s Game of Thrones

Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons.

Brief Story

A King dies. 

A lot of contenders wage a war to reclaim the throne.

Minor characters with no claim to the throne set their own plans in action to gain power

when all the major characters end up killing each other.

Brave/Honest/Honorable characters die.

Intelligent but shady characters and characters who know nothing

continue to live.

While humans are busy killing each other, ice zombies “White walkers” are invading from the North.

The only group who seems to care about this is neutral group called the Night’s Watch.

HBO’s Game of Thrones

Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons.

Many characters. Anybody can die.

6 seasons (60 episodes) so far

Multiple storylines in each episode

Problem

Want to know what the audience talk about a TV show

from Tweets

Ideas

Common words Too much noise

Ideas

Common words Too much noise

Characters How o!en each character were mentioned?

I demand a trial by prototyping.CHAPTER II

Prototyping

Pull sample data from Twitter API

Entity recognition and counting naive approach

List of namesDaenerys Targaryen,Khaleesi

Jon Snow

Sansa Stark

Tyrion Lannister

Arya Stark

Cersei Lannister

Khal Drogo

Gregor Clegane,Mountain

Margaery Tyrell

Joffrey Baratheon

Bran Stark

Theon Greyjoy

Jaime Lannister

Brienne

Eddard Stark,Ned Stark

Ramsay Bolton

Sandor Clegane,Hound

Ygritte

Stannis Baratheon

Petyr Baelish,Little Finger

Robb Stark

Bronn

Varys

Catelyn Stark

Oberyn Martell

Daario Naharis

Davos Seaworth

Jorah Mormont

Melisandre

Myrcella Baratheon

Tywin Lannister

Tommen Baratheon

Grey Worm

Tyene Sand

Rickon Stark

Missandei

Roose Bolton

Robert Baratheon

Jojen Reed

Jeor Mormont

Tormund Giantsbane

Lysa Arryn

Yara Greyjoy,Asha Greyjoy

Samwell Tarly,Sam

Hodor

Victarion Greyjoy

High Sparrow

Dragon

Winter

Dothraki

Sample Tweet

Sample Tweet

Sample data

Character CountHodor 10000

Jon Snow 5000

Daenerys 4000

Bran Stark 3000

… …

*These numbers are made up for presentation, not real data.

When you play the game of vis, you iterate or you die.

CHAPTER III

Where to go from here?

+ episodes

The Guardian & Google Trendshttp://www.theguardian.com/news/datablog/ng-interactive/2016/apr/22/game-of-thrones-the-most-googled-characters-episode-by-episode

+ emotion

+ connections

+ connections

Gain insights from a single episode emotion & connections

Sample data

Character CountJon Snow+Sansa 1000

Tormund+Brienne 500

Bran Stark+Hodor 300

… …

Character CountHodor 10000

Jon Snow 5000

Daenerys 4000

… …

INDIVIDUALS CONNECTIONS

+ top emojis + top emojis

*These numbers are made up for presentation, not real data.

Graph

NODES LINKS

+ top emojis + top emojis

Character CountJon Snow+Sansa 1000

Tormund+Brienne 500

Bran Stark+Hodor 300

… …

Character CountHodor 1000

Jon Snow 500

Daenerys 400

… …

*These numbers are made up for presentation, not real data.

Network Visualization

Node-link diagram

Force-directed layout http://blockbuilder.org/kristw/762b680690e4b2b2666dfec15838a384

Issue: Hairball

Why?Too many nodes & edges

nodes = nodes.filter(n => n.count > 100)links = links.filter(l => l.count > 100)

The force is (too) strong.

force .charge(…) .gravity(…) .linkDistance(…) .linkStrength(…)

Issue: Occlusions

Tried: Fixed positions

+ Collision Detection

http://blockbuilder.org/kristw/2850f65d6329c5fef6d5c9118f1de6e6

+ Community Detection

https://github.com/upphiminn/jLouvain

+ Collision Detection (with clusters)

https://bl.ocks.org/mbostock/7881887

Tormund + Brienne

Issue: Convex hull

http://bl.ocks.org/mbostock/4341699

d3.geom.hull(vertices)

x & y only, no radius

Example

Fix it

Fix it

Let’s get other episodes.

Hadoop remembers.CHAPTER IV

More data

Hadoop

Rewrite the scripts in Scalding to get archived data

How much data do we need?

Whole week?

5 days?

2 days?

A day?

etc.

How much data do we need?

Transitions

not so smooth

A#er switching episode1. Store old positions for existing objects.

2. Assign new initial positions.*

Initial positionsDefault: random

Better starting points Heuristics based on degree of nodes

A#er switching episode1. Store old positions for existing objects.

2. Assign new initial positions.*

3. Run simulation without updating <svg> for n rounds

4. Animate objects from old to new positions.

5. Resume simulation and update <svg> every tick.

Animate Nodes & LinksRemove

delay

Move & Change size/thickness

Add new

const selection = svg.selectAll('g.node') .data(nodes, d => d.entity.id);

selection.exit() .transition() .duration(1000) .style('opacity', 0) .remove();

const sEnter = selection.enter().append('g') .classed('node', true) .attr('transform', d => `translate(${d.x},${d.y})`) .style('opacity', 0) .call(force.drag);

sEnter.append('circle') .attr('r', d=>d.r) .style('fill', d => options.colorScale(d.entity.group));

const sTrans = selection.transition() .delay(1000) .duration(2000) .attr('transform', d => `translate(${d.x},${d.y})`) .style('opacity', 1)

sTrans.select('circle') .attr('r', d=>d.r)

Add “enter” nodes with opacity 0

After 1s delay, use transition to move nodes

and fade in new nodes

Fade “exit” nodes to opacity 0 and remove

Create selection

Animate CommunitiesRemove

delay

Move & Change shape*

Add new

http://blockbuilder.org/kristw/f9ffe87dd8b4038b5867e853c27cebb7

Default

t=0 t=1

Smoother

t=0 t=1t=0.5 t=0.51

Code

// originalpath.attr('d', hull);

// with custom interpolationpath.attrTween('d', (d,i,currentAttr) => interpolateHull(d, currentAttr))

ColorsDefault: d3.category10() Distinct but nothing about the context

Custom palette Colors related to the groups/houses.

Black = Night’s Watch Blue = North Red = Daenerys Gold = Lannister …

Hold the vis.CHAPTER V

The vis is not enough.

Legend

Navigation

Top 3

Adjust threshold

Recap

Filtered Recap

Tooltip

Demohttps://interactive.twitter.com/game-of-thrones

Mobile Support

A visualizer always evaluates his work.CHAPTER VI

“Feedback is the breakfast of champion.”

— Ken Blanchard

Self & Peer

Does it solve the problem?

Google Analytics

Pageviews

Visitors

Actions

Referrals Sites/Social

Feedback

Feedback

EXAMPLE 2: VISUAL ANALYTICS TOOLS

Data sources

Output

explore

analyze

present

get

*

*

Data sources

Output

explore

analyze

present

get

*

*

ad-hoc scripts

Data sources

Output

explore

analyze

present

get

*

*

ad-hoc scripts tools for exploration

WHAT TO EXPECTricher, more features to support exploration of complex data

more technical audience product managers, engineers, data scientists

accuracy

designed for dynamic input

long-term projects

USER ACTIVITY LOGS

UsersUseTwitter

UsersUse

Product Managers

Curious

Twitter

UsersUse

Curious

Engineers

Log datain Hadoop

Write Twitter

Instrument

Product Managers

WHAT ARE BEING LOGGED?

tweet

activities

WHAT ARE BEING LOGGED?

tweet from home timeline on twitter.com tweet from search page on iPhone

activities

WHAT ARE BEING LOGGED?

tweet from home timeline on twitter.com tweet from search page on iPhone

sign up log in

retweet etc.

activities

ORGANIZE?

LOG EVENT A.K.A. “CLIENT EVENT”

[Lee et al. 2012]

LOG EVENT A.K.A. “CLIENT EVENT”

client : page : section : component : element : actionweb : home : timeline : tweet_box : button : tweet

1) User ID 2) Timestamp 3) Event name

4) Event detail

[Lee et al. 2012]

LOG DATA

UsersUse

Curious

Engineers

Log datain Hadoop

Twitter

Instrument

Write

Product Managers

bigger than Tweet data

UsersUse

Curious

Engineers

Log datain Hadoop

Data Scientists

Ask

Twitter

Instrument

Write

Product Managers

UsersUse

Curious

Engineers

Log datain Hadoop

Data Scientists

Find

Ask

Twitter

Instrument

Write

Product Managers

LOG DATA

UsersUse

Curious

Engineers

Log datain Hadoop

Data Scientists

Find, Clean

Ask

Twitter

Instrument

Write

Product Managers

UsersUse

Curious

Engineers

Log datain Hadoop

Data Scientists

Find, Clean

Ask

Monitor

Twitter

Instrument

Write

Product Managers

UsersUse

Curious

Engineers

Log datain Hadoop

Data Scientists

Find, Clean, Analyze

Ask

Monitor

Twitter

Instrument

Write

Product Managers

Log data

EngineersData Scientists

Usersin Hadoop

Find, Clean, Analyze

Use

Monitor

Ask

Curious

1 2

Twitter

Instrument

Write

Product Managers

Scribe Radar

Project / Find & Monitor client events

GOALSSearch for client events

Explore client event collection

Monitor changes

CLIENT EVENT HIERARCHY

iphone home -

- - impression

tweet tweet click

iphone:home:-:-:-:impression

iphone:home:-:tweet:tweet:click

DETECT CHANGES

iphone home -

- - impression

tweet tweet click

iphone home -

- - impression

tweet tweet click

TODAY

7 DAYS AGO

compared to

CALCULATE CHANGES

+5% +5% +5%

+10% +10% +10%

-5% -5% -5%

DIFF

DISPLAY CHANGES

iphone home -

- - impression

tweet tweet click

Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]

DISPLAY CHANGES

home -

- - impression

tweet tweet click

iphone

Demo Demo Demo

Demo / Scribe Radar

Twitter for Banana

WORKFLOWRequested / Identify needs

Design & Prototype Make it work for sample dataset

Refine & Generalize

Productionize

Document & Release

Maintain & Support Keep it running, Feature requests & Bugs fix

8. EXPECT TO REFINE AND POLISH

REFINE & POLISHUX / UI

Color

Animation

Mobile support

Performance Loading time, Data file size

“The little of visualisation design” by Andy Kirk http://www.visualisingdata.com/2016/03/little-visualisation-design/

9. EXPECT TO GET FEEDBACK

FEEDBACKLogging

User study

Forum, User group

Office hours

10. EXPECT TO IMPROVE

HOW TO BE BETTER?Time is limited.

Grow the team

Expand skills

Improve tooling Solve a problem once and for all

Automate repetitive tasks

http://twitter.github.io/labella.js

Demo / Labella.js

https://github.com/twitter/d3kit

Demo / d3Kithttp://www.slideshare.net/kristw/d3kit

yeoman.io

Demo / Yeoman

SUMMARY

INPUT YOU OUTPUT

EXPECT1) potential mismatches

2) different requirements

3) to clean data

4) to clean data a lot

5) to try and break things

Krist Wongsuphasawat / @kristwkristw.yellowpigz.com

6) to iterate until it works

7) deadline

8) to refine and polish

9) to get feedback

10) to improve

#VOTE

Nicolas Garcia Belmonte, Robert Harris, Miguel Rios, Simon Rogers, Jimmy Lin, Linus Lee, Chuang Liu,

and many colleagues at Twitter.

ACKNOWLEDGEMENT

RESOURCESImages Banana phone http://goo.gl/GmcMPq Bar chart https://goo.gl/1G1GBg Boss https://goo.gl/gcY8Kw Champions League http://goo.gl/DjtNKE Database http://goo.gl/5N7zZz Fishing shark http://goo.gl/2fp4zW Globe visualization http://goo.gl/UiGMMj Harry Potter http://goo.gl/Q9Cy64 Holding phone http://goo.gl/It2TzH Kiwi orange http://goo.gl/ejQ73y Kiwi http://goo.gl/9yk7o5 Library https://goo.gl/HVeE6h Library earthquake http://goo.gl/rBqBrs

Minion http://goo.gl/I19Ijg NBA http://goo.gl/p7HBdG NFL http://goo.gl/feQMZs Orange & Apple http://goo.gl/NG6RIL Pile of paper http://goo.gl/mGLQTx Premier League http://goo.gl/AqIINO Scrooge McDuck https://goo.gl/aKv8D7 The Sound of Music https://goo.gl/dqHlzj Trash pile http://goo.gl/OsFfo3 Tyrion http://goo.gl/WaBonl Watercolor Map by Stamen Design

THANK YOU

QUESTIONS?

top related