2 - 3 - what is data- (11-25)

Upload: m-faheem-aslam

Post on 04-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 2 - 3 - What Is Data- (11-25)

    1/6

    What is data? Seems like a pretty goodplace to start for a class called dataanalysis to define what we mean by dataand where else would we turn for thatdefinition but Wikipedia.Data are values of qualitative orquantitative variables, belonging to aset of items.This is a pretty good definition and eachpart of the phrase tells us somethingimportant.Let's start with the end and work our waytoward the beginning.A set of items, this is the set ofobjects that you're interested in knowingsomething about.In a statistics class, this is sometimesreferred to as the population.The set of items you care about dependson the question you are asking.It might be a set of people with aparticular disease, a set of carsproduced by a specific manufacturer,

    a set of visits to a website or a set ofcredit card transactions.Corresponding to each item are a set ofvariables.Variables are measurements orcharacteristics of an item.The reason they are called variables isbecause the most interesting measurementsor characteristics are those that varyfrom item to item,although knowing that a variable doesn'tactually vary across items can also beinformative.

    Variables are broken down into two types:quantitative and qualitative.Quantitative variables are variables thatcan be measured with an ordered set ofnumbers.In a clinical drug study, quantitativevariables might be the height, weight orblood pressure of patients.Qualitative variables are variables thatcan be defined by a label.Qualitative variables might be thecountry of origin of a patient, the sexor the treatment status of that patient.

    An important distinction when it comes todata is whether it is raw or processed.Raw data comes from the original sourcewithout any modifications made by thedata analyst.It is often hard to use for analysisbecause it is large or it has problemsthat need to be fixed.Data analysis includes the process ofpre-processing these data into a form

  • 8/13/2019 2 - 3 - What Is Data- (11-25)

    2/6

  • 8/13/2019 2 - 3 - What Is Data- (11-25)

    3/6

    together like a puzzle to get thesequence of letters in the person'sgenome.How, here the raw data could be the imagefiles but they're huge and oftenterabytes of data.It could be the intensity files which arealso large and often unwieldy foranalysis,or it could be the short sequences ofletters estimated for each fragment.These are, in fact, what most analystsuse as the raw data when building humangenomes.Regardless of what is considered the rawdata, it's pretty clear that the way theimages are processed and the way basepairs are estimated with a statisticalmodel might have a pretty big impact onthe genome produced when the shortfragments are pieced together.So keeping these steps in mind isimportant for the analysis and theyshould be recorded so that people that

    use the data downstream are able tounderstand what particular nuances of theprocessing steps could impact theiranalysis.So, what do raw data look like? This isthe raw data at the level of this shortfragments that are produced by asequencing machine.They include the sequence of letters aswell as some information about thequality of those estimates, estimatedletters.This is another example of some raw data.

    This comes from the Twitter API orApplication Programming Interface.These are interfaces that allow you toaccess the data that are being producedby companies like Twitter and Facebook.When you access the data, they come in avery structured format.The structured format may or may not bevery easy to analyze directly in order toget information about the way that usersuse these services.Another example is an electronic medicalrecord.

    Electronic medical records containmeasurements of quantitative andqualitative variables.They also may contain free text typed bythe doctor about allergies or medicationhistory.These data are often needed to beprocessed in order to be able to analyzethem with statistical models.So what do processed data look like?

  • 8/13/2019 2 - 3 - What Is Data- (11-25)

    4/6

    We're going to be talking a little aboutdata processing in a, coming up nextweek.But to give you a flavor of what we'regoing to be talking about, this is whatwe're going for.Processed data or tidy data have thefollowing properties.Each variable forms a column.So in each of these columns are themeasurements for one specific variableand each observation forms a roll, row.In this case, this was a study of peerreviewers in an experiment we performedin 2011.So each row corresponds to a particularquestion solved by a particular reviewer.The corresponding variables for thatquestion lie in each of the columns.So row 1 contains all the values forquestion 1.Each table or file stores data about onekind of observation.So, for example, in a clinical study, you

    wouldn't include in the same tableinformation about patients as well asinformation about the hospitals, thatare, that those patients are beingincluded in.The goal is to separate the data in sucha way that it's easy to answer thequestions that you're trying to answer inyour downstream analysis.So how much data is out there? This is aninfographic that describes how muchinformation is available at any givenyear or being collected.

    Here, it suggests that about 1.8zettabytes were created in 2011.You might dispute the exact value of thisnumber but it gives you an idea of theorder of magnitude of data being createdeach year.1.8 zettabytes is equivalent to about 3tweets per minute for every person in theUnited States in every minute for anentire year.That's a lot of information and it's whyyou hear very often about big data.Big data is usually defined as data sets

    that are so large they can, they cannotbe analyzed with a single computer.Despite the fact that they can't, aredifferent in this way, that they can't beanalyzed by a single computer, they'resimilar in that the data are still beingused to answer specific questions thatpeople want to address and typically thecommon statistical and machine learningalgorithms can be applied to these data

  • 8/13/2019 2 - 3 - What Is Data- (11-25)

    5/6

    once it's possible to handle the datathemselves.So an important thing to keep in mindwhen talking about big data is that itreally depends on your perspective.This is a picture of an IBM 350 harddrive.This hard drive can store about 5megabytes of data.Some of the data sets that you'll analyzeduring this class on your laptop will belarger than 5 megabytes and so, to thepeople in these pictures, they would bebig data.The big data that we analyze today arebig largely because our computers are notable to handle them.So why is big data a big deal right now?Here's an example.In 1969,296 individuals in Nebraska and Bostonwere given letters with the goal ofmailing them to a friend who would thenmail them to a friend who would end, with

    the eventual goal of the letter ending upin the hands of a specific person inBoston.64 such letter chains made it back to thetarget person and, on average, it tookabout 5.2 people,between the person that was mailing theoriginal letter and the targetindividual.This number was rounded up and became thebasis for the usual 6 degrees ofseparation that you hear about in themedia.

    Well, recently, a similar study wasperformed on 30 billion conversationsfrom 240 million people, a substantiallylarger data set.They ended up estimating the averagenumber of degrees of freedom betweenpeople to be 6.6, which would round thenumber up to 7 degrees of separation.The interesting thing here is that thesedata are now much easier to collect andit's happened so rapidly that it mightnot be possible for our computers to keepup.

    This is why people talk a lot about bigdata and, in particular, they strugglewith how to handle the data given thatour computers have not grown as fast asthe data have grown.Regardless of whether the data are big orsmall, this is something to keep in mind.This is a quote by John Tukey, one of themost famous data analysts.The data may not contain the answer.

  • 8/13/2019 2 - 3 - What Is Data- (11-25)

    6/6

    The combination of some data and anaching desire for an answer does notensure that a reasonable answer can beextracted from a given body of data.So it's important to keep in mind that ifyou're trying to answer a specificquestion, which is the basis for mostgood data analysis, you may not have thedata to answer that question. You knowthat's a hard decision to come to.The only thing I'd add to this quote isthat, no matter how big your data are,you still need the right data to answeryour question.So taking a step back and thinking aboutwhether your data will answer thequestion you're trying to answer is theimportant first step in doing a dataanalysis.