daniel kador interview on scaleyourcodecassandra (data store) they are hosted entirely in softlayer...

View original post

Daniel Kador Interview on ScaleYourCode.com

5 mins summary

Feel free to email, tweet, blog, and pass this around the web … but

please don’t alter any of its contents when you do. Thanks, and enjoy!

ScaleYourCode.com

https://scaleyourcode.com/interviews/interview/17

https://scaleyourcode.com/

View original post

Who is Dan Kador?

Daniel, or Dan, has been working on

Keen for about 4 years now. Before

that, he was at Salesforce where he

spent 5 years there as an engineer on

their developer platform team.


https://twitter.com/dkador

View original post

What is Keen.io?

Keen makes it easy for any software developer to collect data from anything

connected to the Internet. It could be a browser, mobile app, backend server, or a

wearable device.

Once data is collected, it lives on their servers where they handle the hard

problems of storing, processing, and securing data. Then, they make it really easy

for the owners of that data to ask questions using Keen's API.

For a basic example, you can send data on product purchases and query that data

to figure out how many sales of product X happened in the last day.

Once you have those results, you can easily share them through charts,

dashboards, or other API integrations.

Keen has lots of different use cases but one of the main ones is for people who

want to deliver analytics to their customers. Many of their customers rely on

Keen's service for their business to work.


https://keen.io/

View original post

What does Keen's stack look like?

Running on SoftLayer

Nginx (Load balancer)

Flask (Python, apps)

Tornado (Python, for API)

Kafka (Event queuing)

Storm (Distributed computing)

Cassandra (Data Store)

They are hosted entirely in SoftLayer today. That decision was made back in 2012

because they needed more IOPS than AWS could give them at the time.

They use Nginx as a load balancer for both the API and their website.

Webservers are Python built on top of Flask. The API is Python built on top of

Tornado.

There are a couple of other services hiding underneath:

There is a data processing tier which is where most of the action is. Kafka is what

handles their event queuing.

Then they have Storm, which handles the distributed computing for both writing

data (persisting it to disk) and querying data.

Finally, they have Cassandra which is their persistence layer.


http://www.softlayer.com/

http://aws.amazon.com/

http://nginx.org/

http://flask.pocoo.org/

http://www.tornadoweb.org/en/stable/

http://kafka.apache.org/

http://storm.apache.org/

http://cassandra.apache.org/

View original post

How many requests per second? How much data are they

handling?

Some of their stats aren't public, but Dan says they get thousands of requests a

second. He says that the more important stat is how much data they process—

which is trillions of events a day.

Events are rich JSON documents that potentially include hundreds of properties.


View original post

Why Cassandra?

Cassandra has a distributed system from the core. It was built to be a distributed

system. It has a columnar structure which maps well to analyzing specific

properties within a JSON document. It also gives them the ability to scale by

adding more nodes.

Most customers they talk to have something like MySQL. They've tried creating

their events table, and if they achieve any reasonable amount of success, that

event table breaks their database. It's too much data for a single SQL node.

How did they tweak Cassandra to fit their needs?

They've had to do a lot of work to get it to work just right for them. There are

certain JVM tweaks you can do like heap size tuning and GC tuning.

There aren't any magic numbers that work here, it depends on your workload.

Daniel says that the most interesting thing they did with Cassandra was to figure

out how to store JSON documents in a columnar way. They've had to tweak things

at the data model layer like transforming groups of events into columns and then

storing those columns in a more compressed format. This makes it easier for

them to do bulk aggregations. The most important optimization has been the data

model.

If you're interested in learning how they figured out the JSON challenge, check

out slides by Josh Dzielak.


http://searchsoa.techtarget.com/definition/Java-virtual-machine

http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_tune_jvm_c.html

http://www.cubrid.org/blog/dev-platform/how-to-tune-java-garbage-collection/

https://speakerdeck.com/dzello/store-json-the-hard-way?slide=72

View original post

How do they use Kafka?

Kafka is a core piece of their stack because it is the layer that all incoming data is

persisted to. When an event is sent to Keen, they don't consider it safe until it is

persisted to Kafka. Then, Storm listens to the Kafka queues and that, in turn,

persists data to Cassandra.


View original post

How do they use Storm?

Storm is a stream processing service.

All events from all customers go to the same queue. Storm is responsible for

collating all these events and collecting them into the appropriate collections (the

equivalent to tables in MySQL).

In addition, they make sure that they are keeping data secure and not giving

access to anyone who shouldn't have access.

They also do a bunch of data processing related to what they call add-ons. These

add-ons can include IP addresses, geo information, parsing user agent strings, and

a few other options.

After that, they do storage optimizations like compression and serialization.


View original post

What do they use for data visualization?

http://blog.cartodb.com/visualizing-endangered-species-trades-at-ecohacknyc/

I told Dan I've used D3.js to visualize data, and asked him what they use to create

dashboards and charts.

Keen's JavaScript SDK is built on a couple of things. When they started, they used

a wrapper around Google Charts. These days, they also have wrappers for D3.

D3 is a very powerful tool that has been used to create amazing documents.

There's an example from the New York Times that shows a mapping of the decline

of "Stop-and-Frisk", which uses thousands if not millions of data sets, but you can

clearly see a difference within seconds of looking at the map. Could you imagine

sifting through data on paper instead?


http://d3js.org/

http://www.nytimes.com/interactive/2014/09/19/nyregion/stop-and-frisk-map.html?_r=0

View original post

What do they use to monitor?

This is incredibly important. Dan said they have probably spent more time

building monitoring tools than in some ways the core product.

They actually use Keen to monitor Keen, at least in part. Of course that doesn't

help much if the system is down, so they also use other systems. The biggest one

they rely on is DataDog. They use it as a central aggregation point for metrics.

Metrics from frontend load balancer stats to Cassandra and JVM. Alerting

happens through PagerDuty.


https://www.datadoghq.com/

https://www.pagerduty.com/

View original post

They had their first failure in 2013. What happened, and

how do they handle failures since their service is critical to

businesses?

In 2013, when they were still using MongoDB, they ran into an issue where a

customer sent a query that should have been killed by their rate limiting, but it

wasn't. The result?

Frozen app servers.

Unfortunately, their load balancer was configured to send the requests off to

another app server if it didn't receive a response within a certain amount of time.

Nginx ended up sending the bad query to every single one of their app servers,

grinding the entire service down to a halt. Ouch.

How did they fix it? They ended up doing a few things, but one of the first

changes was to switch Nginx from passing requests to another server on error

only instead of timeout. If you want more info, read the report here.

They have to monitor 24/7 to be able to detect these failures and prevent them if

possible, but sometimes things just go wrong. Systems fail in really unexpected

ways, and it can be hard to know how to detect these failures. The way they

handle these situations is to get the entire engineering team on board, figure out

a protocol, and just get to work.


https://keen.io/blog/56191254159/our-first-outage

View original post

There was another failure early this year. What

happened?

This was a failure in their Kafka tier.

This time, not all services were down. Queries were still working, but since Kafka

was down, they were rejecting new events.

The outage was caused by a maintenance event that should have been routine.

Kafka started behaving in nasty ways. To fix it, they had to shut the system down,

reconfigure it in some critical ways, and bring it back up.

How do you prevent these things from happening again?

A lot of it has to do with monitoring more things and getting more visibility in

your systems. You want to detect failures as soon as you can, or even before they

are about to happen if possible.

Once these failures happen, take a good look at what caused them and

implement new protocols & procedures to make sure they never happen again.


View original post

Conclusion

I have to try and fit as much information in this summary as possible without

making it too long. This means I cut out some information that could really help

you out one day. I highly recommend you view the full interview if you found this

interesting, as there are more hidden gems.

You can watch it, read it, or even listen to it. There's also an option to download

the MP3 so you can listen on your way to work. Of course, the show is also on

iTunes (https://itunes.apple.com/tt/podcast/scale-your-code-

podcast/id987253051?mt=2) and Stitcher

(http://www.stitcher.com/podcast/scaleyourcode?refid=stpr)

Thanks for reading, and please let me know if you have any feedback!

- Christophe Limpalair (@ScaleYourCode)



https://itunes.apple.com/tt/podcast/scale-your-code-podcast/id987253051?mt=2

https://itunes.apple.com/tt/podcast/scale-your-code-podcast/id987253051?mt=2

http://www.stitcher.com/podcast/scaleyourcode?refid=stpr

daniel kador interview on scaleyourcodecassandra (data store) they are hosted entirely in softlayer...

Documents