the path to trstrank: building one click twitter influence metrics

3
Since the launch of Twitter , people have clamored for ways to access and “slice and dice” its data. One of the most common ways people use the Twitter data corpus is to measure a person’s importance and influence. Klout is an example of one product that specializes in this kind of “influencer” data. A few years ago, we created our own special version of Klout, one that took advantage of our vast historical record of the relationships to create an accurate number describing how influential a Twitter user is. It’s called TrstRank and it ranks a user on a scale of 1-10, with 10 being the most influential you can get. Coming up with such a number like TrstRank is no small task. Setting aside the issues of getting the data, there are some very real Big Data problems surrounding the product that require special tools for getting it done efficiently. And when you’re a bootstrapped startup, like we were at the time, you have to be resourceful if you are going to get by. The biggest issue with pursuing a new data product like TrstRank is the same one any company faces when they decide to venture into new territory - the high risks of wasting time and money. Wasting Time One of the first problems you run into as a small team trying your hand at data science is the excess time spent on server and ma- chine configuration, instead of focusing on modeling, algorithms, and manipulating the data. © 2012 Infochimps, Inc. All rights reserved. 1 The Path to TrstRank Building One-Click Twitter Influence Metrics What is TrstRank? TrstRank is an Infochimps developed dataset and API that provides Twitter influence metrics. This API provides Twitter influence metrics with the click of a button! TrstRank measures Twitter user reputation, importance and influence in a far more robust way than counting the number of followers. It is a sophisticated measure of a user’s relative importance within the entire Twitter network.

Upload: infochimps-a-csc-big-data-business

Post on 20-Aug-2015

345 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: The Path to TrstRank: Building One Click Twitter Influence Metrics

Since the launch of Twitter, people have clamored for ways to access and “slice and dice” its data. One of the most common ways people use the Twitter data corpus is to measure a person’s importance and influence. Klout is an example of one product that specializes in this kind of “influencer” data.

A few years ago, we created our own special version of Klout, one that took advantage of our vast historical record of the relationships to create an accurate number describing how influential a Twitter user is. It’s called TrstRank and it ranks a user on a scale of 1-10, with 10 being the most influential you can get.

Coming up with such a number like TrstRank is no small task. Setting aside the issues of getting the data, there are some very real Big Data problems surrounding the product that require special tools for getting it done efficiently. And when you’re a bootstrapped startup, like we were at the time, you have to be resourceful if you are going to get by.

The biggest issue with pursuing a new data product like TrstRank is the same one any company faces when they decide to venture into new territory - the high risks of wasting time and money.

Wasting TimeOne of the first problems you run into as a small team trying your hand at data science is the excess time spent on server and ma-chine configuration, instead of focusing on modeling, algorithms, and manipulating the data.

© 2012 Infochimps, Inc. All rights reserved. 1

The Path to TrstRankBuilding One-Click Twitter Influence Metrics

What is TrstRank?TrstRank is an Infochimps developed dataset and API that provides Twitter influence metrics. This API provides Twitter influence metrics with the click of a button! TrstRank measures Twitter user reputation, importance and influence in a far more robust way than counting the number of followers. It is a sophisticated measure of a user’s relative importance within the entire Twitter network.

Page 2: The Path to TrstRank: Building One Click Twitter Influence Metrics

© 2012 Infochimps, Inc. All rights reserved.

Ramp-up time for even the first phase of a project like TrstRank can be a whole day or more of engineering time.

Wasting MoneyFrom our earliest days Infochimps has been based on Amazon Web Services’ (AWS) cloud, taking advantage of the flexibility and scalability it provides. With AWS, you pay for what you use, so you are always inclined to eliminate waste. In our early days we even created decision trees for when to shut down a cluster or not, depending on how many hours it was to be up but not used.

This can set conflicting goals for the data scientist who would prefer to leave a cluster up overnight, even if it’s unused, so they don’t have to deal with setting everything up again the next day!

Enter Ironfan We created Ironfan to solve our own problems of how to save time and money during our data science operations in the cloud. When we came up with the idea for TrstRank, it was a simple operation to spin up a cluster for early analysis and experimenta-tion. We could validate some of our algorithms and ideas on a simple cluster before moving to something more heavyweight.

Ironfan and TrstRank, Now Ironfan has continued as a key tool for our monthly TrstRank operation. We continue to scrape Twitter for follower information, and with the updated data every month we crunch the TrstRank numbers again.

With Ironfan, we’re able to run a multiple step operation on 8 billion tweets on clusters of 30 m1.xlarge EC2 machines, while only running the resources we need when they’re needed. TrstRank takes 72 hours to complete, with resources being paid for commensurately. Without Ironfan, we’d be looking at 2-3x the costs in time and money!

2

Page 3: The Path to TrstRank: Building One Click Twitter Influence Metrics

© 2012 Infochimps, Inc. All rights reserved. 8

About Infochimps

Our mission is to make the world’s data more accessible. Infochimps helps companies understand their data. We provide tools and services that connect their internal data, leverage the power of cloud computing and new technologies such as Hadoop, and provide a wealth of external datasets, which organizations can connect to their own data.

Contact UsInfochimps, Inc.1214 W 6th St. Suite 202 Austin, TX 78703

1-855-DATA-FUN (1-855-328-2386)

[email protected]

Twitter: @infochimps

Get a free Big Data consultationLet’s talk Big Data in the enterprise!

Get a free conference with the leading big data experts regarding your enterprise big data project. Meet with leading data scientists Flip Kromer and/or Dhruv Bansal to talk shop about your project objectives, design, infrastructure, tools, etc. Find out how other compa-nies are solving similar problems. Learn best practices and get recommendations — free.