data and society lecture 2: big data 1bermaf/data course 2018/lecture 2...about big data [strata]...

45
Fran Berman, Data and Society, CSCI 4370/6370 Data and Society Lecture 2: Big Data 1 1/26/18 Big Data, Data and Commerce, Data and the Election

Upload: others

Post on 22-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Data and Society

Lecture 2: Big Data 1

1/26/18

Big Data, Data and Commerce, Data and the Election

Page 2: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Announcements 1/26

• Reminders from last time:

– Please sign attendance sheet each time you are here (your participation grade depends partly on attendance).

– If you decide to drop the class, please let me know ([email protected]) and I will let someone in on the waiting list.

– Office Hours: Friday 1-2 (AE 218) or by appointment (send email to [email protected])

• No Wednesday class next week

• Op-Ed draft due February 9 – instructions in Lecture 1

• Discussion article for next week (Friday, February 2). Please read:

– “I had my DNA Picture Taken with Varying Results”, New York Times, http://www.nytimes.com/2013/12/31/science/i-had-my-dna-picture-taken-with-varying-results.html

Page 3: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

More guidance for op-eds

• Choose a data-related topic that for which someone may need persuading, i.e. has some “controversy”– You don’t have to persuade most people that “World peace is good”

• Make a persuasive argument and don’t just present an opinion– “I like pizza” may be an opinion, but it is not an op-ed topic.

• Choose a topic that is specific enough that you can find studies, reports, etc. to back you up– “Automation is good” is hard to justify. “Autonomous vehicles on

the highways are safer than human-driven cars” is easier

• Make sure that there is a data or technology angle

Page 4: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Wednesday Section Friday lecture

First Half of Class Second Half of Class Assignments

January 17 : NO class January 19 L!: CLASS INTRO AND LOGISTICS Presentation Model / Op-Ed Instructions

Op-Ed instructions

January 24: NO class January 26 L2: BIG DATA 1 4 Presentations

January 31: NO class February 2 L3: BIG DATA 2 -- IoT 4 Presentations

February 7: NO class February 9 L4: DATA AND SCIENCE 4 Presentations Op-Ed due Feb. 9

February 14: 5 Presentations

February 16 L5: DATA AND HEALTH / LESLIE McINTOSH GUEST SPEAKER

4 Presentations Op-Ed drafts returned Feb. 21

February 21: 5 Presentations

February 23 L6: DATA STEWARDSHIP AND PRESERVATION

4 Presentations Research Paper instructions

February 28: 5 Presentations

March 2 L7: DATA INFRASTRUCTURE 4 Presentations Op-Ed Final due March 2

March 7 : 5 Presentations March 9: NO CLASS / PAPER PREPARATION

March 14: Spring Break March 16 SPRING BREAK

March 21: NO class March 23: NO CLASS / PAPER PREPARATION

March 28: 5 Presentations

March 30 L8: DATA RIGHTS, POLICY, REGULATION 4 Presentations Research Paper due March 28

April 4: NO class April 6 L9: DATA AND ETHICS 4 Presentations

April 11: 5 Presentations April 13 L10: DATA AND COMMUNICATION 4 Presentations

April 18: 5 Presentations April 20 L10: DATA FUTURES 4 Presentations

April 25: 5 Presentations April 27 L11: HOT TOPICS / TBD

Page 5: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Today (1/26/18)

• Lecture 2: Big Data Applications

• Discussion

• Break

• 4 Student Presentations

5

Page 6: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Lecture 2: Big Data Applications

Page 7: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Outline

• What is big data and what does it tell us?

• Data-driven Commerce: Data and Target

• Data-driven Politics: Data and the 2016 Election

• Discussion:

• “Eight (no Nine) Problems with Big Data”, New York Times,

https://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-

with-big-data.html

Page 8: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

What is big data?

• Wikipedia: “Broad term for data sets so large or complex that traditional data processing applications are inadequate.”

• McKinsey: “Datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze”

• O’Reilly Radar: “Data that exceeds the processing capacity of conventional database systems. The data that is too big, moves too fast, or doesn’t fit the structures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”

Page 9: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

What does big data tell us?

• Big data is often noisy, dynamic, heterogeneous. Inter-

related and untrustworthy. Why do we find it useful?

– General statistics obtained from frequent patterns and

correlation analysis can disclose more reliable hidden patterns

and knowledge

– Interconnected big data forms large heterogeneous

information networks, with which information redundancy can

be explored to compensate for missing data, cross check

conflicting cases, validate trustworthy relationships, disclose

inherent clusters, and uncover hidden relationships and

models.

Page 10: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Big data visualization from the

Cooper Hewitt Design Museum

(Thanks to Sarah Schattschneider!)

• “Flight Patterns” by Aaron Koblin:https://www.youtube.com/watch?v=ttH7sQ48n5k

• From https://collection.cooperhewitt.org/objects/68743525/: “Flight Patterns is a data visualization project that traces domestic airline traffic during a single 24-hour period over North America. Flight paths, using datasets provided by the Federal Aviation Administration, are rendered as arced trajectories. The result is a stunning visual animation that elegantly renders air traffic data as cartography.”

Page 11: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

About Big Data [Strata]

• Value of big data: analytical use, enabling new products

• Ways that big data impacts infrastructure

– Volume: big data calls for scalable storage and a distributed approach to querying

– Velocity: big data infrastructure must adapt to the speed of the input and the need for quick analysis and turnaround. Need for stream processing technologies

– Variety: Source data often “messy”, non-homogeneous, unstructured. Infrastructure must organize and find meaning from it.

Page 12: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

12

Image adapted from NIST. Original credit: Jason Kolb, Applied Data Labs; Modified from the original at:

www.applieddatalabs.com/content/new-reality-business-intelligence-and-big-data

Things You Know

Things You Don’t Know

Questions

You’re

Asking

Questions

You

Haven’t

Thought OfConventional

Data Analytics

Data

Acquisition

BIG

DATA

Data-enabled

Exploration

Big Data – Potential for Innovation

Page 13: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

How is big data useful in industry?

• (Big) data is being used by virtually every industry and is being used to boost/improve production

• Big data contributing to new ways of creating value:

– Creating transparency

– Enabling experimentation to discover needs, expose variability and improve performance

– Segmenting populations to customize actions

– Replacing / supporting human decision making with automated algorithms

– Supporting new business models, products, services

• Big data becoming a competitive advantage and means of industry growth

• Big data enabling substantial growth in productivity and customer satisfaction.

• Big data enabling new insights and discoveries

Page 14: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Big data can mean big profits

Page 15: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

McKinsey’s take on Big Data (circa 2011)

Page 16: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Capitalizing on big data – easy or hard?

Road blocks to capturing the value of big data

• Need for data policy –privacy, security, intellectual property and liability (ownership, rights, fair use, etc.)

• Need for new and evolving systems, technologies and techniques for managing and leveraging big data

• Need for new practice, policy and infrastructure to gain/provide access to data

• Need to evolve / change industry structure and culture

Page 17: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Beware of too much inference from Big Data! Correlation vs. Causation

• Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. [http://whatis.techtarget.com/definition/correlation]

• Causation, or causality, is the capacity of one variable to influence another. The first variable may bring the second into existence or may cause the incidence of the second variable to fluctuate.

• Causation is often confused with correlation, which indicates the extent to which two variables tend to increase or decrease in parallel. However, correlation by itself does not imply causation. There may be a third factor, for example, that is responsible for the fluctuations in both variables. [http://whatis.techtarget.com/definition/causation]

Correlations from Spurious correlations: http://www.tylervigen.com/spurious-correlations

Page 18: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Data-Driven Commerce

Page 19: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Predictive Analytics

• Retailers highly interested in the buying habits of their customers: what you like, what you need, which coupons will help draw you to their store, etc.

• Retailers also use highly sophisticated models of human behavior: buying behavior, formation of habits, etc. to help determine how to best draw customers

• Many retailers hiring statisticians, mathematicians, data scientists to improve the bottom line through strategic marketing, including Target

Page 20: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Predictive Analytics at Target

• Target develops profile of customer information for each customer

– Information indexed by a unique guest ID number: credit card information, name, email address, purchases, demographic information as available, etc.

– Information is collected by Target or bought from other sources (information available includes ethnicity, job history, magazines you read, if you’ve declared bankruptcy or gotten divorce, what kinds of topics you talk about online, etc.)

• Retailers know that at major life events, old routines fall apart and usual brand loyalties and buying habits are in flux: graduating from college, birth of a child, moving to a new area / town, etc.

• Target wanted to focus on the life event of having a child

– New parents will develop new buying routines for diapers, toys, lotion, baby food, clothes, etc.

– If Target can change the buying habits of new parents before the birth of the baby, they are pre-competitive and can win big

Page 21: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Marketing to Pregnant

Women

• Target statistician Andrew Pole analyzed data from customers who had

signed up in Target’s baby registry

• Analyses identified ~25 products that, when analyzed together,

contributed to a “pregnancy prediction” score (e.g. unscented lotion,

vitamin supplements, etc.). Score also estimated due date.

• Target used pregnancy prediction score and estimated due date to

identify which target customers to send baby product coupons to, what

and when

• Anecdote:

Page 22: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Minimizing the “creepiness factor”

• Behavioral research and data analysis helping drive much more in-depth predictive analytics

• Combining prediction and analysis with marketing infrastructure:

• Target had the capacity to send customers customized ad books. Once it is determined that they are potentially pregnant, seemingly random pregnancy and baby products can be included with other ads that accurately target the consumer.

• Company began to mix baby products with other things (e.g. lawn mowers, wineglasses, etc.)

• Customers found this less creepy and used the baby coupons

Page 23: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Personalized marketing

• Soon after the new ad campaign, Target’s “Mom and Baby” sales greatly increased and grew over time ($44B in 2002 to $67B in 2010)

• Similar data mining approach being used in many, many stores and businesses: department stores, Facebook, Google, etc.

• Key issues about privacy remain and your rights within the burgeoning market for data about you are yet to be sorted out.

Page 24: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

21 Things “Big Data” Knows about You (Forbes) -- 1http://www.forbes.com/sites/bernardmarr/2016/03/08/21-scary-things-big-data-knows-about-you/#23aec89b66a7

1. Your browser knows what you’ve searched for.

2. Google also knows your age and gender — even if you never told them. They make a pretty comprehensive ads profile of you, including a list of your interests (which you can edit) to decide what kinds of ads to show you.

3. Facebook knows when your relationship is going south. Based on activities and status updates on Facebook, the company can predict (with scary accuracy) whether or not your relationship is going to last.

4. Google knows where you’ve travelled, especially if you have an Android phone.

5. And the police know where you’re driving right now — at least in the U.K., where closed circuit televisions (CCTV) are ubiquitous. Police have access to data from thousands of networked cameras across the country, which scan license plates and take photographs of each car and their driver. In the U.S., many cities have traffic cameras that can be used similarly.

6. Your phone also knows how fast you were going when you were traveling. (Be glad they don’t share that information with the police!)

7. Your phone has also probably deduced where you live and work.

8. The Internet knows where your cat lives. Using the hidden meta-data about the geographic location of where the photo was taken which we share when we publish photos of our cats on sites like Instagram and other social media networks.

9. Your credit card company knows what you buy. Of course your credit card company knows what you buy and where, but this has raised concerns that what you buy and where you shop might impact your credit score. They can use your purchasing data to decide if you’re a credit risk.

10. Your grocery store knows what brands you like. For every point a grocery store or pharmacy doles out, they’re collecting mountains of data about your purchasing habits and preferences. The chains are using the data to serve up personalized experiences when you visit their websites, personalized coupon offers, and more.

Page 25: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

21 Things “Big Data” Knows about You (Forbes) -- 2http://www.forbes.com/sites/bernardmarr/2016/03/08/21-scary-things-big-data-knows-about-you/#23aec89b66a7

11.HR knows when you’re going to quit your job. An HR software company called Workday is testing out an algorithm that analyzes text in documents and can predict from that information, which employees are likely to leave the company.

12.Target knows if you’re pregnant. (Sometimes even before your family does.)

13.YouTube knows what videos you’ve been watching. And even what you’ve searched for on YouTube.

14.Amazon knows what you like to read, Netflix knows what you like to watch. Even your public library knows what kinds of media you like to consume.

15.Apple and Google know what you ask Siri and Cortana.

16.Your child’s Barbie doll is also telling Mattel what she and your child talk about.

17.Police departments in some major cities, including Chicago and Kansas City, know you’re going to commit a crime — before you do it.

18.Your auto insurance company knows when and where you drive — and they may penalize you for it, even if you’ve never filed a claim.

19.Data brokers can help unscrupulous companies identify vulnerable consumers. For example, they may identify a population as a “credit-crunched city family” and then direct advertisements at you for payday loans.

20.Facebook knows how intelligent you are, how satisfied you are with your life, and whether you are emotionally stable or not – simply based on a big data analysis of the ‘likes’ you have clicked.

21.Your apps may have access to a lot of your personal data. Angry Birds gets access to your contact list in your phone and your physical location. Bejeweled wants to know your phone number. Some apps even access your microphone to record what’s going on around you while you use them.

Page 26: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Big Data and the 2016 Election

Page 27: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

What Happened?

Why were most predictions off?

• Were the models wrong?

• Was the data wrong?

• Were the samples wrong?

• Were the interpretations wrong?

• Was voter behavior just one of the low probability outcomes?

Map from http://www.270towin.com/2016_Election/interactive_map

Page 28: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

How

people

voted:

Exit Polls

and

Election

Results

Characteristic Breakdown(s)

Age • 18-29: 55% for Clinton, 37% for Trump• 30-44: 50% for Clinton, 42% for Trump• 45+: 53% for Trump• 45-64: 44% for Clinton• 65+: 45% for Clinton

Gender • 54% of women voted for Clinton; 42% of women voted for Trump• 53% of men voted for Trump; 41% of men voted for Clinton

Ethnicity • White voters: 58% for Trump, 37% for Clinton• Black voters: 88% for Clinton; 8% for Trump• Hispanic and Asian voters: 65% for Clinton; 29% for Trump

Education • College grads: 49% for Clinton• Postgrads: 58% for Clinton• High school or less: 51% for Trump• Some college / Associate degree: 52% for Trump

Religion • Catholic: 52% for Trump• Protestants / Christians: 58% for Trump• Jewish: 71% for Clinton• Other: 62% for Clinton• No religion: 68% for Clinton

Income • Under $30K: 53% for Clinton• $30K-$49.99K: 51% for Clinton• $50K - $99.99K: 50%for Trump• $100K-$199.99K: 48% for Trump• $250K+: 48% for Trump

Locale (Urban vs. Rural)

• Cities with > 50K residents: 59% for Clinton, 35% for Trump• Rural areas: 62% for Trump, 34% for Clinton• Suburbs: 50% for Trump, 45% for Clinton

Page 29: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Voter stats

All Americans320,000,000+

Voting age population251,107,404 (78.5%)

Eligible voters231,556,622 (72.4%)

Registered voters~200,000,000 (62.5%)Voters

138,884,643(43.4% of all citizens, 60%

of eligible voters)

Statistics from: http://heavy.com/news/2016/11/eligible-voter-turnout-for-2016-data-hillary-clinton-donald-trump-republican-democrat-popular-vote-registered-results/

Page 30: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

What Happened?

Why were most predictions off?

• Were the models wrong?

• Were the interpretations wrong?

• Was the data wrong?

• Were the samples wrong?

• Was voter behavior just one of the low probability outcomes?

Map from http://www.270towin.com/2016_Election/interactive_map

Page 31: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Model and Interpretation Accuracy

Many challenges in modeling and interpretation:

• Raw polling data supplemented by estimates on how many people will vote and what undecided voters will do

• Historical inferences about past patterns of turnout, demographics, economic conditions and party loyalty may not be accurate for present day

• If polls shows that candidate “wins” by a small margin within the margin of error, it is risky to interpret this as a “win”

From: https://projects.fivethirtyeight.com/2016-election-forecast/

From: http://www.latimes.com/politics/la-na-pol-usc-latimes-poll-20161108-story.html

Page 32: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Data Integrity – Was poll data accurate?

• Many suspected that people lied about voting for Trump

• Trafalgar Group’s approach to improving data accuracy -- Adjust numbers to account for people’s hesitance to admit a Trump vote

– Used robotic calls for which Trump voters seemed more comfortable

– Added a “neighbor” question -- Who do you think your neighbors will vote for? – and checked to see if the numbers were different

– Created a demographic of people who had not voted in 6+ years but planned to vote for Trump

• Trafalgar predicted Trump win in Pennsylvania and Michigan (but not all states)

Page 33: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Sampling Accuracy

Figure from http://www.forbes.com/sites/startswithabang/2016/11/09/the-science-of-error-how-polling-botched-the-2016-election/#75748a437da8

Key sampling questions

• How representative is the sample of

population?

• How biased are the sampling vehicles –

land lines, human interviews, tweets,

non-self screening respondents, etc.?

• How representative is the sample of

turnout? For eligible voters? For eligible

voters who actually vote?

• How accurate is the data (are people

lying)?

• How big is the sample / what is the

margin of error?

Page 34: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Who made correct predictions?

• Investor’s Business Daily (IBD/TIPP poll)

Predicted: Trump would win by 1.6%

Approach:

– Start with random sample from public, adjust for census statistics and age, gender, religion, look at registered and likely voters, adjust for party registrations, enthusiasm

– Poll made more calls to smartphones than landlines.

– People represented wide range of people in the country (including a representative sample of types of phones used)

– Poll questioned respondents about enthusiasm and factored this into results

• USC / LA Times poll (USC economics prof Arie Kapteyn)

Predicted: Trump would win by 3%

Approach:

– Pollsters sought to balance both big groups (e.g. men and women) and smaller groups (e.g. young minority voters)

– Weighting of responses in polls used to make them more fully representative. Sample includes representation of demographic statistics including race, gender, age.

Page 35: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Who made correct

predictions?

• Primary Model.com / Helmut Norpoth (Stonybrookpolitical science prof)

Predicted: Trump would win against Clinton with 87% certainty.

– Predicted last 5 presidential elections correctly; predicted the results of every presidential election except 1 in last 104 years.

Approach:

– Uses primaries rather than polls to predict outcomes.

– Takes “swing of the electoral pendulum” into consideration (Republicans favored after two democratic terms)

• Alan Lichtman / American University historian

Predicted: Trump wins

Approach:

– Developed 13 T/F keys that predict election outcome. True favors incumbent party. If 6+ are false, change is predicted.

– Has worked in every election for the last 30 years.

• Lichtman’s Keys:

1. Party Mandate: After the midterm elections, the incumbent party holds more seats in the U.S. House of Representatives than after the previous midterm elections.

2. Contest: There is no serious contest for the incumbent party nomination.

3. Incumbency: The incumbent party candidate is the sitting president.

4. Third party: There is no significant third party or independent campaign.

5. Short-term economy: The economy is not in recession during the election campaign.

6. Long-term economy: Real per capita economic growth during the term equals or exceeds mean growth during the previous two terms.

7. Policy change: The incumbent administration effects major changes in national policy.

8. Social unrest: There is no sustained social unrest during the term.

9. Scandal: The incumbent administration is untainted by major scandal.

10. Foreign/military failure: The incumbent administration suffers no major failure in foreign or military affairs.

11. Foreign/military success: The incumbent administration achieves a major success in foreign or military affairs.

12. Incumbent charisma: The incumbent party candidate is charismatic or a national hero.

13. Challenger charisma: The challenging party candidate is not charismatic or a national hero.

Page 36: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Bottom Line

• Big data is a great tool but it is not a guarantee of outcomes

• Predictions and estimations are not a guarantee of outcomes

• Garbage in, garbage out

• Bias can be built into the system at many places: data collection, data sampling, models, interpretation, etc.

• There is no magic bullet …

Page 37: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Lecture 2 Sources 1

• “Big data: The next frontier for innovation, competition and productivity”, Report from the McKinsey Global Institute, http://www.mckinsey.com/insights/business_technology/big_data_the_next_frontier_for_innovation

• “What is big data?” O’Reilly Radar, http://radar.oreilly.com/2012/01/what-is-big-data.html

• “How Target figured out a teen girl was pregnant before her father did,” Forbes, http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#5df063ed34c6

• “How Companies Learn your secrets”, The New York Times, http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html?pagewanted=all&_r=0

• “Exit polls and election results – what we learned”. The Guardian, https://www.theguardian.com/us-news/2016/nov/12/exit-polls-election-results-what-we-learned

• “The Science of Error: How Polling Botched The 2016 Election”, Forbes, http://www.forbes.com/sites/startswithabang/2016/11/09/the-science-of-error-how-polling-botched-the-2016-election/#75748a437da8

• “The trouble is not with polling but with the limits to human interpretation of data,” Quartz, http://qz.com/832908/confirmation-bias-is-why-we-couldnt-predict-a-trump-victory/

Page 38: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Lecture 2 Sources 2

• “There are Many Ways to Map election Results. We’ve Tried Most of Them.”, NY Times, http://www.nytimes.com/interactive/2016/11/01/upshot/many-ways-to-map-election-results.html?_r=0

• “Trump’s win isn’t the death of data – it was flawed all along,” Wired, https://www.wired.com/2016/11/trumps-win-isnt-death-data-flawed-along/

• “2016 Election Oracles: These People Predicted Trump Would Win”, Heavy, http://heavy.com/news/2016/11/2016-final-election-results-predictions-helmut-norpoth-abramowitz-michael-moore-nate-silver-vote-count-turn-out-electoral-college-maps-donald-trump-hillary-clinton-polls-forecasting-pennsylvania-michi/

• “No, one 19-year-old Trump supporter probably isn’t distorting the polling averages all himself”, LA Times, http://www.latimes.com/politics/la-na-pol-daybreak-poll-questions-20161013-snap-story.html

• “How IBD Accurately Gauged Voter Enthusiasm and Got the Polls Right,” Townhall, http://townhall.com/tipsheet/cortneyobrien/2016/11/11/how-ibd-got-the-polls-right-n2244109

Page 39: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Discussion Article

• “Eight (no Nine) Problems with Big Data”, New York Times, https://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.html

Page 40: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

• (From “Eight (No, Nine!) Problems with Big Data”, NY Times). Limitations of big data:

1. “… although big data is very good at detecting correlations, …, it never tells us which correlations are meaningful”

2. “ … big data can work well as an adjunct to scientific inquiry but rarely succeeds as a wholesale replacement.”

3. “ … many tools that are based on big data can be easily gamed.”

4. “ … even when the results of a big data analysis aren’t intentionally gamed, they often turn out to be less robust than they initially seem.”

5. “ … whenever the source of information for a big data analysis is itself a product of big data, opportunities for vicious cycles abound [echo chamber effect].”

6. “ … risk of too many correlations.”

7. “ … big data is prone to giving scientific-sounding solutions to hopelessly imprecise questions.”

8. “ …big data is at its best when analyzing things that are extremely common, but often falls short when analyzing things that are less common.”

9. “ … the hype.”

Page 41: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

No class next Wednesday. Class next Friday.

• Next Friday: Data and Science ; Discussion

• Read for February 2 Discussion:

– “I had my DNA Picture Taken with Varying Results”, New York Times,

http://www.nytimes.com/2013/12/31/science/i-had-my-dna-picture-

taken-with-varying-results.html

Page 42: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Presentation Articles for February 2

• “Rethinking Storage in the Age of Big Data”, ComputerWeekly.com, http://www.computerweekly.com/feature/Rethinking-storage-in-the-age-of-big-data (Jiyu H)

• “Google Flu Trends: The Limits of Big Data”, New York Times, http://bits.blogs.nytimes.com/2014/03/28/google-flu-trends-the-limits-of-big-data/ (Griffin M)

• “The Shazam Effect”, The Atlantic, http://www.theatlantic.com/magazine/archive/2014/12/the-shazam-effect/382237/ (Madison W)

• “Giving Viewers What They Want,” NY Times, http://www.nytimes.com/2013/02/25/business/media/for-house-of-cards-using-big-data-to-guarantee-its-popularity.html?pagewanted=all&_r=1&(Reilly K)

Page 43: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Presentation Articles for February 9

• “The LSST and Big Data Science”, Astronomy,

http://www.astronomy.com/news/2017/12/the-lsst-and-big-data-science (Alex C)

• “On Long Migrations, Birds Chase an Eternal Spring”, New York Times,

https://www.nytimes.com/2017/01/05/science/on-long-migrations-birds-chase-an-

eternal-spring.html (Ethan G)

• “Rhinocerous DNA Database Successful in Aiding Poaching Prosecutions”, The

Guardian, https://www.theguardian.com/science/2018/jan/08/rhinoceros-dna-

database-successful-in-aiding-poaching-prosecution (Sarah M)

• “What Is Multispectral Imaging And How Is It Changing Archaeology And Digital

Humanities Today?”, Forbes,

https://www.forbes.com/sites/drsarahbond/2017/11/30/what-is-multispectral-

imaging-and-how-is-it-changing-archaeology-and-digital-humanities-

today/#30ce91cf5151 (Tae P)

Page 44: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Break

Page 45: Data and Society Lecture 2: Big Data 1bermaf/Data Course 2018/Lecture 2...About Big Data [Strata] •Value of big data: analytical use, enabling new products •Ways that big data

Fran Berman, Data and Society, CSCI 4370/6370

Presentation Articles for Today

• “We gave four good pollsters the same raw data. They had four different results.”, NY

Times, http://nyti.ms/2cWlWL7 (Josh S)

• “Silicon Valley’s Mission to Save California Ag from Dying of Thirst,” Wired,

https://www.wired.com/2017/05/silicon-valley-aims-save-californias-water/ (Alagu C)

• “Where you live can have a lot to say about your health,” Washington Post,

https://www.washingtonpost.com/national/health-science/where-you-live-can-have-a-

lot-to-say-about-your-health/2016/12/30/6d94c510-cc73-11e6-a747-

d03044780a02_story.html?utm_term=.689dd5bb5168 (Nathalie P)

• “98 Personal Data Points that Facebook uses to Target ads at You,” Washington Post,

https://www.washingtonpost.com/news/the-intersect/wp/2016/08/19/98-personal-

data-points-that-facebook-uses-to-target-ads-to-

you/?tid=a_inl&utm_term=.c8df6d534fda (Ethan S)