
Bill Howe, PhD

Introduction to Data Science

This morning

• Context for “Data Science”• Databases and Relational Algebra• NoSQL

“The intuition behind this ought to be very simple: Mr. Obama is maintaining leads in the polls in Ohio and other states that are sufficient for him to win 270 electoral votes.”

Nate Silver, Oct. 26, 2012

“…the argument we’re making is exceedingly simple. Here it is: Obama’s ahead in Ohio.”

Nate Silver, Nov. 2, 2012

“The bar set by the competition was invitingly low. Someone could look like a genius simply by doing some fairly basic research into what really has predictive power in a political campaign.”

Nate Silver, Nov. 10, 2012


Nate Silver

“…the biggest win came from good old SQL on a Vertica data warehouse and from providing access to data to dozens of analytics staffers who could follow their own curiosity and distill and analyze data as they needed.”

Dan Woods Jan 13 2013, CITO Research

“The decision was made to have Hadoop do the aggregate generations and anything not real-time, but then have Vertica to answer sort of ‘speed-of-thought’ queries about all the data.”

Josh Hendler, CTO of H & K Strategies

Related: Obama campaign’s data-driven ground game

"In the 21st century, the candidate with [the] best data, merged with the best messages dictated by that data, wins.”

Andrew Rasiej, Personal Democracy Forum

Hurricane Sandy Josef Fruehwald

Hurricane Sandy

Josef Fruehwald

Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030

1) Convert all the digitized books in the 20th century into n-grams (Thanks, Google!)


2) Label each 1-gram (word) with a mood score. (Thanks, WordNet!)

3) Count the occurences of each mood word

A 1-gram: “yesterday”A 5-gram: “analysis is often described as”

Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030

Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030

…2. Michel J-P, Shen YK, Aiden AP, Veres A, Gray MK, et al. (2011) Quantitative analysis of culture using millions of digitized books. Science 331: 176–182. doi: 10.1126/science.1199644. Find this article online3. Lieberman E, Michel J-P, Jackson J, Tang T, Nowak MA (2007) Quantifying the evolutionary dynamics of language. Nature 449: 713–716. doi: 10.1038/nature06137. Find this article online4. Pagel M, Atkinson QD, Meade A (2007) Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 449: 717–720. doi: 10.1038/nature06176. Find this article online…6. DeWall CN, Pond RS Jr, Campbell WK, Twenge JM (2011) Tuning in to Psychological Change: Linguistic Markers of Psychological Traits and Emotions Over Time in Popular U.S. Song Lyrics. Psychology of Aesthetics, Creativity and the Arts 5: 200–207. doi: 10.1037/a0023195. Find this article online…

What is Data Science?

• Fortune – “Hot New Gig in Tech”

• Hal Varian, Google’s Chief Economist, NYT, 2009: – “The next sexy job”– “The ability to take data—to be able to understand it,

to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”

• Mike Driscoll, CEO of metamarkets: – “Data science, as it's practiced, is a blend of Red-Bull-

fueled hacking and espresso-inspired statistics.”– “Data science is the civil engineering of data. Its

acolytes possess a practical knowledge of tools & materials, coupled with a theoretical understanding of what's possible.”

Drew Conway’s Data Science Venn Diagram

What do data scientists do?

“They need to find nuggets of truth in data and then explain it to the business leaders”

Data scientists “tend to be “hard scientists”, particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem.”

-- DJ Patil, Chief Scientist at LinkedIn

-- Rchard Snee, EMC

Mike Driscoll’s three sexy skills of data geeks

• Statistics– traditional analysis

• Data Munging – parsing, scraping, and formatting data

• Visualization – graphs, tools, etc.

“Data Science refers to an emerging area of work concerned with the collection, preparation, analysis, visualization, management and preservation of large collections of information.”

Jeffrey Stanton Syracuse University School of Information Studies

An Introduction to Data Science

Data Science is about Data Products

• “Data-driven apps”– Spellchecker– Machine Translator

• Interactive visualizations– Google flu application– Global Burden of Disease

• Online Databases– Enterprise data warehouse– Sloan Digital Sky Survey

(Mike Loukides)

Data science is about building data products, not just answering questions

Data products empower others to use the data.

May help communicate your results (e.g., Nate Silver’s maps)

May empower others to do their own analysis (e.g., Global Burden of Disease)

A Typical Data Science Workflow

1) Preparing to run a model

2) Running the model

3) Interpreting the results

Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging

“80% of the work”

-- Aaron Kimball

“The other 80% of the work”




What are the abstractions of data science?

“Data Jujitsu”“Data Wrangling”“Data Munging”

Translation: “We have no idea what this is all about”

1850s: matrices and linear algebra (today: engineers and scientists)1950s: arrays and custom algorithms (today: C/Fortran performance junkies)1950s: s-expressions and pure functions (today: language purists)1960s: objects and methods (today: software engineers)1970s: files and scripts (today: system administrators)1970s: relations and relational algebra (today: large-scale data engineers)1980s: data frames and functions (today: statisticians)2000s: key-value pairs + one of the above (today: NoSQL hipsters)

But what are the abstractions of data science?

Pre-Relational: if your data changed, your application broke.

Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code.

“Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.”

Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation

Relational Database History

-- Codd 1979

Key Idea: “Physical Data Independence”

physical data independence

files and pointers


SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’;

f = fopen(‘table_file’);fseek(10030440);while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .

Key Idea: An Algebra of Tables



join join

Other operators: aggregate, union, difference, cross product

Equivalent logical expressions


σp=knows(R) o=s (σp=holdsAccount(R) o=s σp=accountHomepage(R))

(σp=knows(R) o=s σp=holdsAccount(R)) o=s σp=accountHomepage(R)

σp1=knows & p2=holdsAccount & p3=accountHomepage (R x R x R)

right associative

left associative


Why do we care? Algebraic Optimization

N = ((z*2)+((z*3)+0))/1

Algebraic Laws: 1. (+) identity: x+0 = x2. (/) identity: x/1 = x3. (*) distributes: (n*x+n*y) = n*(x+y)4. (*) commutes: x*y = y*x

Apply rules 1, 3, 4, 2:N = (2+3)*z

two operations instead of five, no division operator

Same idea works with the Relational Algebra!

So what? RA is now ubiquitous

• Galaxy – “bioinformatics workflows”

• Pandas and Blaze: High Performance Arrays in Pythonmerge(left, right, on=‘key’)

• dplyr in Rfilter(x), select(x), arrange(x), groupby(x), inner_join(x, y), left_join(x, y)

• Hadoop and contemporaries all evolved to support RA-like interfaces: Pig, HIVE, Cascalog, Flume, Spark/Shark, Dremel

“…Operate on Genomics Intervals -> Join”

PaperScale to



Indexes TransactionsJoins/


Constraints Views Language/

AlgebraData model my label

1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql2006 BigTable/Hbase ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O ext. record nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O     key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like

NoSQL and related systems, by feature

04/15/2023 30Bill Howe, UW


PaperScale to



Indexes TransactionsJoins/


Constraints Views Language/

AlgebraData model my label

1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O ext. record nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O     key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like

Scale was the primary motivation!

Rick Cattel’s clustering from“Scalable SQL and NoSQL Data Stores”SIGMOD Record, 2010

extensible record stores

document stores

key-value stores


PaperScale to



Indexes TransactionsJoins/


Constraints Views Language/

AlgebraData model my label

1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O key-val nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O     key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like

PaperScale to



Indexes TransactionsJoins/


Constraints Views Language/

AlgebraData model my label

1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql

2006BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql

2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O ext. record nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O     key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql2011 Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like

MapReduce-based Systems

MapReduce-based Systems

non-Google open source implementationdirect influence / shared features


implementation of




PaperScale to



Indexes TransactionsJoins/


Constraints Views Language/

AlgebraData model my label

1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables sql-like2003 memcached ✔ ✔ O O O O O O key-val nosql2004 MapReduce ✔ O O O ✔ O O O key-val batch2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document nosql

2006BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql

2007 MongoDB ✔ ✔ ✔ EC, record O O O O document nosql2007 Dynamo ✔ ✔ O O O O O O ext. record nosql2008 Pig ✔ O O O ✔ / O ✔ tables sql-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables sql-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val nosql2009 Voldemort ✔ ✔ O EC, record O O O O key-val nosql2009 Riak ✔ ✔ ✔ EC, record MR O     key-val nosql2010 Dremel ✔ O O O / ✔ O ✔ tables sql-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables nosql2011 Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables sql-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables sql-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables sql-like2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record nosql2013 Impala ✔ O O O ✔ ✔ O ✔ tables sql-like

NoSQL Systems

direct influence / shared features


implementation of


Voldemort Riak





A lot of these systems give up joins!

Year sourceSystem/Paper

Scale to 1000s


SecondaryIndexes Transactions

Joins/ Analytics

Integrity Constraints Views


Data model my label

1971many RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables SQL-like2003 other memcached ✔ ✔ O O O O O O key-val lookup2004Google MapReduce ✔ O O O ✔ O O O key-val MR2005 couchbase CouchDB ✔ ✔ ✔ record MR O ✔ O document filter/MR2006Google BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record filter/MR2007 10gen MongoDB ✔ ✔ ✔ EC, record O O O O document filter2007Amazon Dynamo ✔ ✔ O O O O O O key-val lookup2007Amazon SimpleDB ✔ ✔ ✔ O O O O O ext. record filter2008 Yahoo Pig ✔ O O O ✔ / O ✔ tables RA-like2008 Facebook HIVE ✔ O O O ✔ ✔ O ✔ tables SQL-like2008 Facebook Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val filter2009 other Voldemort ✔ ✔ O EC, record O O O O key-val lookup2009 basho Riak ✔ ✔ ✔ EC, record MR O     key-val filter2010Google Dremel ✔ O O O / ✔ O ✔ tables SQL-like2011Google Megastore ✔ ✔ ✔ entity groups O / O / tables filter2011Google Tenzing ✔ O O O ✔ ✔ ✔ ✔ tables SQL-like2011 Berkeley Spark/Shark ✔ O O O ✔ ✔ O ✔ tables SQL-like2012Google Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables SQL-like2012Accumulo Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record filter2013 Cloudera Impala ✔ O O O ✔ ✔ O ✔ tables SQL-like

• Ex: Show all comments by “Sue” on any blog post by “Jim”

• Method 1:– Lookup all blog posts by Jim– For each post, lookup all comments and filter for “Sue”

• Method 2:– Lookup all comments by Sue– For each comment, lookup all posts and filter for “Jim”

• Method 3: – Filter comments by Sue, filter posts by Jim,– Sort all comments by blog id, sort all blogs by blog id– Pull one from each list to find matches

04/15/2023 38Bill Howe, UW


PaperScale to



Indexes TransactionsJoins/


Constraints Views Language/

AlgebraData model my label

1971 RDBMS O ✔ ✔ ✔ ✔ ✔ ✔ ✔ tables SQL-like2003 memcached ✔ ✔ O O O O O O key-val lookup2004 MapReduce ✔ O O O ✔ O O O key-val MR2005 CouchDB ✔ ✔ ✔ record MR O ✔ O document filter/MR2006 BigTable (Hbase) ✔ ✔ ✔ record compat. w/MR / O O ext. record filter/MR2007 MongoDB ✔ ✔ ✔ EC, record O O O O document filter2007 Dynamo ✔ ✔ O O O O O O key-val lookup2008 Pig ✔ O O O ✔ / O ✔ tables RA-like2008 HIVE ✔ O O O ✔ ✔ O ✔ tables SQL-like2008 Cassandra ✔ ✔ ✔ EC, record O ✔ ✔ O key-val filter2009 Voldemort ✔ ✔ O EC, record O O O O key-val lookup2009 Riak ✔ ✔ ✔ EC, record MR O     key-val filter2010 Dremel ✔ O O O / ✔ O ✔ tables SQL-like2011 Megastore ✔ ✔ ✔ entity groups O / O / tables filter2011 Tenzing ✔ O O O O ✔ ✔ ✔ tables SQL-like2011 Spark/Shark ✔ O O O ✔ ✔ O ✔ tables SQL-like2012 Spanner ✔ ✔ ✔ ✔ ? ✔ ✔ ✔ tables SQL-like2012 Accumulo ✔ ✔ ✔ record compat. w/MR / O O ext. record filter2013 Impala ✔ O O O ✔ ✔ O ✔ tables SQL-like

• Two value propositions– Performance: “I started with MySQL,

but had a hard time scaling it out in a distributed environment”

– Flexibility: “My data doesn’t conform to a rigid schema”

NoSQL Criticism

Stonebraker CACM (blog 2)

04/15/2023 40

NoSQL Criticism: flexibility argument

• Who are the customers of NoSQL? – Lots of startups

• Very few enterprises. Why? most applications are traditional OLTP on structured data; a few other applications around the “edges”, but considered less important

Stonebraker CACM (blog 2)

Some Takeaways

• Data wrangling is the hard part of data science, not statistics

• Relational algebra is the right abstraction for reasoning about data wrangling

• Even “NoSQL” systems that explicitly rejected relational concepts eventually brought them back

