store and process big data with hadoop and cassandra

Store and Process Big Data with Hadoop and Cassandra

Apache BarCamp

By Deependra Ariyadewa

WSO2, Inc.

Store Data with

● Project site : http://cassandra.apache.org

● The latest release version is 1.0.7

● Cassandra is in use at Netflix, Twitter, Urban Airship, Constant Contact, Reddit, Cisco, OpenX, Digg, CloudKick and Ooyala

● Cassandra Users : http://www.datastax.com/cassandrausers

● The largest known Cassandra cluster has over 300 TB of data in over 400 machines.

● Commercial support http://wiki.apache.org/cassandra/ThirdPartySupport

Cassandra Deployment architecture

key => {(k,v),(k,v),(k,v)} hash(key) => key order

hash(key1)

hash(key2)

How to Install Cassandra

● Download the artifact apache-cassandra-1.0.7-bin.tar.gz from http://cassandra.apache.org/download/

● Extract tar -xzvf apache-cassandra-1.0.7-bin.tar.gz

● Set up folder paths mkdir -p /var/log/cassandra chown -R `whoami` /var/log/cassandra mkdir -p /var/lib/cassandra chown -R `whoami` /var/lib/cassandra

How to Configure CassandraMain Configuration file : $CASSANDRA_HOME/conf/cassandra.yaml cluster_name: 'Test Cluster' seed_provider: - seeds: "192.168.0.121" storage_port: 7000 listen_address: localhost rpc_address: localhost

rpc_port: 9160

Cassandra Clustering

initial_token:

partitioner: org.apache.cassandra.dht.RandomPartitioner

http://wiki.apache.org/cassandra/Operations

Cassandra DevOps $CASSANDRA_HOME/bin$ ./cassandra-cli --host localhost

[default@unknown] show keyspaces;Keyspace: system: Replication Strategy: org.apache.cassandra.locator.LocalStrategy Durable Writes: true Options: [replication_factor:1] Column Families: ColumnFamily: HintsColumnFamily (Super) "hinted handoff data" Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.BytesType/org.apache.cassandra.db.marshal.BytesType Row cache size / save period in seconds / keys to save : 0.0/0/all Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider Key cache size / save period in seconds: 0.01/0 GC grace seconds: 0 Compaction min/max thresholds: 4/32 Read repair chance: 0.0 Replicate on write: true Bloom Filter FP chance: default Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy

Cassandra CLI

[default@apache] create column family Location with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;f04561a0-60ed-11e1-0000-242d50cf1fbfWaiting for schema agreement...... schemas agree across the cluster [default@apache] set Location[00001][City]='Colombo';Value inserted.Elapsed time: 140 msec(s). [default@apache] list Location;Using default limit of 100-------------------RowKey: 00001=> (column=City, value=Colombo, timestamp=1330311097464000)

1 Row Returned.Elapsed time: 122 msec(s).

Store Data with Hectorimport me.prettyprint.cassandra.service.CassandraHostConfigurator;import me.prettyprint.hector.api.Cluster;import me.prettyprint.hector.api.factory.HFactory;

import java.util.HashMap;import java.util.Map;

public class ExampleHelper {

public static final String CLUSTER_NAME = "ClusterOne"; public static final String USERNAME_KEY = "username"; public static final String PASSWORD_KEY = "password"; public static final String RPC_PORT = "9160"; public static final String CSS_NODE0 = "localhost"; public static final String CSS_NODE1 = "css1.stratoslive.wso2.com"; public static final String CSS_NODE2 = "css2.stratoslive.wso2.com";

public static Cluster createCluster(String username, String password) { Map<String, String> credentials = new HashMap<String, String>(); credentials.put(USERNAME_KEY, username); credentials.put(PASSWORD_KEY, password); String hostList = CSS_NODE0 + ":" + RPC_PORT + "," + CSS_NODE1 + ":" + RPC_PORT + "," + CSS_NODE2 + ":" + RPC_PORT; return HFactory.createCluster(CLUSTER_NAME, new CassandraHostConfigurator(hostList), credentials); }

}

Store Data with HectorCreate Keyspace: KeyspaceDefinition definition = new ThriftKsDef(keyspaceName); cluster.addKeyspace(definition); Add column family: ColumnFamilyDefinition familyDefinition = new ThriftCfDef(keyspaceName, columnFamily); cluster.addColumnFamily(familyDefinition); Write Data: Mutator<String> mutator = HFactory.createMutator(keyspace, new StringSerializer());

String columnValue = UUID.randomUUID().toString(); mutator.insert(rowKey, columnFamily, HFactory.createStringColumn(columnName, columnValue));

Read Data: ColumnQuery<String, String, String> columnQuery = HFactory.createStringColumnQuery(keyspace); columnQuery.setColumnFamily(columnFamily).setKey(key).setName(columnName); QueryResult<HColumn<String, String>> result = columnQuery.execute(); HColumn<String, String> hColumn = result.get(); System.out.println("Column: " + hColumn.getName() + " Value : " + hColumn.getValue() + "\n");

Variable Consistency● ANY: Wait until some replica has responded.

● ONE: Wait until one replica has responded.

● TWO: Wait until two replicas have responded.

● THREE: Wait until three replicas have responded

.● LOCAL_QUORUM: Wait for quorum on the datacenter the connection was

stablished.

● EACH_QUORUM: Wait for quorum on each datacenter.

● QUORUM: Wait for a quorum of replicas (no matter which datacenter).

● ALL: Blocks for all the replicas before returning to the client.

Variable Consistency

Create a customized Consistency Level:

ConfigurableConsistencyLevel configurableConsistencyLevel = new ConfigurableConsistencyLevel();Map<String, HConsistencyLevel> clmap = new HashMap<String, HConsistencyLevel>();

clmap.put("MyColumnFamily", HConsistencyLevel.ONE);

configurableConsistencyLevel.setReadCfConsistencyLevels(clmap);configurableConsistencyLevel.setWriteCfConsistencyLevels(clmap);

HFactory.createKeyspace("MyKeyspace", myCluster, configurableConsistencyLevel);

CQL

Insert data with CQL:

cqlsh> INSERT INTO Location (KEY, City) VALUES ('00001', 'Colombo'); Retrieve data with CQL

cqlsh> select * from Location where KEY='00001';

Apache

● Project Site: http://hadoop.apache.org

● Latest Version 1.0.1

● Hadoop is in use at Amazon, Yahoo, Adobe, eBay, Facebook

● Commercial support : http://hortonworks.com

http://www.cloudera.com

Hadoop deployment Architecture

How to install Hadoop

● Download the artifact from: http://hadoop.apache.org/common/releases.html

● Extract : tar -xzvf hadoop-1.0.1.tar.gz

● Copy and extract installation to each data node. scp hadoop-1.0.1.tar.gz user@datanode01:/home/hadoop

● Start Hadoop : $HADOOP_HOME:/bin/start-all

Hadoop CLI - HDFS

Format Namenode :

$HADOOP_HOME:/bin/hadoop namenode -format

File operations on HDFS:

$HADOOP_HOME:/bin/hadoop dfs -lsr / $HADOOP_HOME:/bin/hadoop dfs -mkdir /users/deep/wso2

Mapreduce

source:http://developer.yahoo.com/hadoop/tutorial/module4.html

Simple Mapreduce Job

Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }

Simple Mapreduce Job Reducer: public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Simple Mapreduce Job

Job Runner:

JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);

High level Mapreduce Interfaces

● Hive

● Pig

store and process big data with hadoop and cassandra

Technology

cassandra clidefault

cassandra devops

ooyala cassandra users

cassandra clustering

store data

key order

quorum of replicas

db marshal