nosql activity - the stanford university infolab

Outline

● What is NoSQL?● What is MapReduce/Hadoop?● Example MapReduce Problem● Exercise: Write your own queries in

Hadoop!

● “No SQL”○ No relational database

● Umbrella term for many different types of datastores○ Key-Value Stores, Document Stores, Graph

Database systems, etc.● (Really, it’s more like “Not Only SQL” – we don’t

want to abandon the relational DBMS entirely)

What is NoSQL?

Why NoSQL?

● In general, we want our databases to be:○ Convenient○ Reliable○ Safe○ Scalable○ Efficient

Why NoSQL?

● In general, we want our databases to be:○ Convenient○ Reliable○ Safe○ Scalable○ Efficient

● Nowadays, we care a lot more about scalability and efficiency

Gaining in popularity…..

…..but still got a long ways to go

RethinkDBRavenDB

Amazon DynamoDB

VoldemortBerkeleyDB

Amazon SimpleDB

Cloudata

ThruDB

Terrastore

RaptorDB

FoundationDB

LevelDB

What is MapReduce?

● Created in 2004 at Google● Problem: 100’s of data files distributed

across 1,000’s of machines○ how do we get that information, quickly?

● Solution: Extract the data from the files in parallel○ take advantage of the fact that the data is

distributed over 1,000’s of machines

What is MapReduce?

● No data model, data stored in files● Primarily used on distributed filesystems● Users provide two functions

○ map function (data transformation)○ reduce function (data aggregation)

● System takes care of parallelizing the process

Mapping and Reducing

● map: divide problem into subproblems○ input: single line from data file○ output: 0 or more (key, value) pairs

● reduce: work on each subproblem, combine results○ input: (key, list of values)○ output: 0 or more output records

What is Hadoop?

● An open source implementation of MapReduce○ Google didn’t want to share :-(

● Also used over distributed filesystems● Same mechanics as Google’s version of

MapReduce

Example Problem: Word Counts How now brown cowBrown cow is blue

Example Problem: Word Counts

How now brown cow

Brown cow is blue

InputHow now brown cowBrown cow is blue


How now brown cow

Brown cow is blue

Input MapHow now brown cowBrown cow is blue


How now brown cow

Brown cow is blue

Input Map (how, 1)

(now, 1)

(brown, 1)

(cow, 1)

(brown, 1)

(cow, 1)

(is, 1)

(blue, 1)

How now brown cowBrown cow is blue


How now brown cow

Brown cow is blue

Input Map (how, 1)

(now, 1)

(brown, 1)

(cow, 1)

(brown, 1)

(cow, 1)

(is, 1)

(blue, 1)


Reduce


How now brown cow

Brown cow is blue

Input Map (how, 1)

(now, 1)

(brown, 1)

(cow, 1)

(brown, 1)

(cow, 1)

(is, 1)

(blue, 1)


Reduce

how, 1now, 1brown, 2cow, 2is, 1blue, 1

Now it’s your turn!● ssh into corn, and copy the Hadoop starter code

○ cp -r /usr/class/cs145/NoSQL-activity .

○ cd NoSQL-activity/● Run the initialization script

○ local-hadoop/start-local-hadoop.py○ Don’t forget to run local-hadoop/stop-

local-hadoop.py before you log out!

Query #1: Word Counts (again!)● (We’ll do this one together.)● Starter code can be found in src/query1● Dataset can be found in /usr/class/cs145/NoSQL-data● Compile and run your code using query1-wordcount.sh● Results will show up in output1/ directory

○ check results by running diff output1/part-00000 /usr/class/cs145/NoSQL-answers/output1/part-00000

Query #2: Hashtag Counts

● Count the number of times each Hashtag appears in the Twitter dataset○ a hashtag is a term that starts with ‘#’

● Answer should be of the form:○ <Hashtag> <Integer>

● How many times does #goStanford appear in the dataset?

Query #3: Inverted Index on Mentions● Create a mapping from a Twitter

username to a list of Tweets that the username appears in○ A username always starts with ‘@’

● Answer should be of the form:○ <@Username> <List of Tweet ID Numbers>

● What Tweet IDs include mentions of @BillCosby? @AndrewYNg?

nosql activity - the stanford university infolab

Documents