3 - 7 - summarizing data (23-21)

Upload: m-faheem-aslam

Post on 04-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    1/12

    This video is about summarizing data.In the previous videos, we talked alittle bit about how we would get datafrom files either on the internet orfiles that we have on our local computer.Once we have the data loaded into R, thenwe want to do some sort of summarizationto sort of see if the data have anyproblems or to identify characteristicsof the data that might be useful to usduring the analysis.So, why do we summarize data?The first thing to keep in mind is thatalmost always, the data are too big tolook at the whole thing.So, except in very extreme circumstances,it's very difficult to just eyeball theentire data set and see interestingpatterns or see potential problems withthe data.So since the first step to find thoseproblems or to find issues that areinteresting to look at downstream, you

    definitely need to summarize the data inways that are useful for you to be ableto identify identify those patterns.So, when you do these summaries, somethings that you might be looking out forare missing values values that areoutside of the expected ranges.So, if you're defining temperature inCelsius and the measurements inBaltimore, if you see a measurement of250, that's probably a little bit high,so you should know, look for those sortsof things.

    And it, less you think that those sortsof things never happen, it's almostalways at least one or two crazy valuesthat occur in at least every data setthat I've seen.you might look for values that seem to bein the wrong unit, so if most of themeasurements are in Celsius and onemeasurement is in Fahrenheit.You also want to look for mislabeledvariables or columns and variables thatare the wrong class.So, variables that look like they should

    be quantitative but are actually labeledas character variables and so forth.So, we're going to talk a little bitabout the ways that you can summarizedata.Again, this is not comprehensive, sothere are a long number of ways that youcan summarize data.And depending on the type of data thatyou're looking at, some will be better

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    2/12

    than others.So, I'm going to sort of give you anoverview of the, the basic and mostuseful ways to summarize data and if youneed to summarize data in other ways,that the best way is to do that is tosort of the, look at the data type thatyou're summarizing and search on, oneGoogle, I mean, that's seriously, that'sthe best way to do it.So this is an earthquake data set thatwe're going to be using to illustratesome of these ideas.this is available from data.gov.So, this is another one of those exampleswhere the time that you download the dataset really matters.So, this data set is actually updatedevery week.And it's only the earthquakes for thepast seven days.So, if you run, if you're running theseslides at some unspecified time in thefuture.

    And it's seven days after I created them,some of the exact numbers that you'regoing to be seeing are going to be alittle bit different.So, that's just something to keep in mindwhen you're running these slides or whenyou're looking at these commands.If you get something slightly different,it might just be because you ran it at adifferent time.So, this is the URL for the dataset thatwe're going to be looking at.So, again, what we can do is we can use

    the download.file command we learnedabout in getting lectures, in gettingdata lectures.And so, we can pass the file URL todownload.file and we could assign thedataset to the earthquakedata.csv file.and then we could also look at the datethat is was downloaded.So, these slides were created this timeon Sunday January 27th,2013.So again, if it's seven days beyond, thatyour going to get a slightly different

    data set.And then we can read the csv file thatresults in using the read.csv file, andnow we have that stored in the eDatavariable.So again the purpose, the purpose ofsummarizing is that it's very hard tolook at the whole data set. So, if I justtype edata and hit Return in R afterloading it in, I get a very long data

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    3/12

    frame.So, it, it gives me sort of the variableshere acrossthe columns and in the rows are each ofthe observations.And each one of these corresponding to aspecific earthquake.And so, we get the the source, theequation ID, I'm sorry, earthquake ID,the version, the date, and time.And actually, you can't actually see allthe other variables that are beingcollected or being output as well, theyall actually fall off the screen here.So, when you get the full dataset is, isnot a, a viable option for understandingpotential patterns in the data.So, here is some important first view,the very first things that you always runwhen you load a dataset into R.And so, they are, first, you look at thedimension in the data frame.So here, in this case, I did dim(eData)and I actually end up seeing that's

    there's 1,000 earthquakes, or 1,057 rowsexactly, and there are 10 columns.One reason that I do this always as oneof the first commands that I run isbecause if I know that there are 11variables and I only see 10 here, thenthere was a problem running, reading thedata into R.Similarly, if I know there should be, youknow, 10,000 rows and I only see a 1,057,which often will happen if the data arestored in a weird format, you canactually detect if the data had been read

    in, incorrectly from this very simplesummary.The only thing you can do is look at thenames of the variables in the data frameand so again, you apply names to eDataand we get the list of names of the 10variables.And so, this should be the variable namesthat you're expecting, in this case, forthe earthquake data in R.And you could also look at specifically,the number of rows or the number ofcolumns in the dataset.

    And dim actually gives you both the rownumbers,that's the first number,and the column numbers,that's the second number that you getfrom dim.Or you can get them individually usingthe num, nrow or ncolumn commands.So then there's some other ways that youcan start summarizing the data once

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    4/12

    you've, sort of, looked at the verybasics is in terms of just the size andshape of the dataset that you've loadedin.And so, one of these is, for quantitativevariables, you can look at the quantilesof that data set.So, the quantiles are sort of like thepercentiles.You can imagine, if you took the SAT, andyou were at the 99th percentile, then 99%of the people who took the SAT that yeargot a lower score than you did.This is the same sort of thing, for thequantiles.So 0% of the values are less than -61 forthe eData latitude variable.This is when I do the, the quantileapplied to the eData latitude.I get the 0, 25th percent, 50thpercentile, the 75th percentile, and the100th percentile.So, I can see that this is, this gives mean idea of the range of values that I

    observe.And sort of where the middle of thevalues are and so I can use this toidentify if some of the values are reallyoutside of the range.So, if I saw, for example, a latitude of5,064, you know that that was eithermeasured on some scale that's verydifferent or it's an incorrect value.You can also apply the summary command tothe entire data frame.And so, when you do that, you get aquantile information for some of the

    quantitative variables but you alsoactually get some other information forother variables, say, for qualitativevariables.So, for example, for the source variablein the data frame, this is not aquantitative variable, it actually hasthese different characters correspondingto different detectors and in this casemost of them were detected with this akdetector and there are 330 of them thatcorresponded to that.And so, it actually summarizes both the

    quantitative and qualitative variablesfor you so you can get a first glancelook at what the dataset looks like inand if you notice any particularproblems.The other thing that's very useful is todetermine if variables that should becharacters are being loaded by R asnumeric variables or vice versa.The more likely scenario is that a

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    5/12

    numeric variable is loaded as a charactervariable.So you can do that.First, you can look at the class of theentire data frame and, of course, itcomes up as a data frame.Then, what you can do is, you can look atthe class of each individual column.So, to do this, there, this is sort of atricky way of doing that, so what you dois you look at the first row of the data,the data frame.So, this is the eData frame and byselecting just the first row by the, thiscomma here tells you,if you put a number before the comma, itwill select a row, if you put, put anumber after the comma, it will select acolumn.And so here, we've selected the first rowof that eData dataset and what we wouldlike to do is apply that class functionto every single element of that first rowand we can do that with this sapply

    function.So, what sapply does is it basically runsalong and to every value in this vector,it applies this function.So, we see that for the source variable,we get a factor, for the equation ID, weget a factor and so forth.For latitude and longitude, we getnumeric variables, as well as formagnitude, we get a numeric variable.These are all sort of what we wereexpecting.So, this is another way to determine

    whether the data had been loaded inproperly and whether the variables wereloaded in a way that you expect them tobe loaded.The next step is to start looking at theactual values that you see for differentvariables.And so, a couple of very useful functionsare unique, length, and table.So, one example is to look at the uniquevalues.So, unique values subvariables,particularly qualitative variables, will

    only have a certain number of uniquevalues.Quantitative variables might haveentirely unique values.So, we're looking at this qualitativevariable source and we're going to lookat the unique values and you can see thatit's listed here, all the values thatthat variable takes.This is our way of summarizing, very

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    6/12

    succinctly, a qualitative variable.And if you see that there are classes ofthat variable that should not be here,you can start exploring them further.You can also look at the length of theunique values for particular variables.So again, we've taken the unique valuesfor this variable source and we've lookedat the length of that.And so, we see that there are 11 uniquevalues for source.This is another way of succinctlysummarizing how many values you see andif you expected to see more or less, youcan quickly access That t you have aproblem with the data.You can also do a table of thequalitative variables.If you do table of quantitative variable,you're going to get a very big tablebecause every value will be unique andyou'll get exactly one for each of thecategories.But for a qualitative variable, if you do

    table of a qualitative variable in thiscase eData$Src, you can see that each ofthe unique values is listed andunderneath the unique values is thenumber of times that it appears.So, remember in summary, we saw that theak the ak appeared 330 times in thesource variable and so again, when wetake the table of that source variable,we see that it appears 330.But we also see that for all the othervalues that that variable can take thenumber of times that it takes that value.

    So, this gives you an idea a little bitabout the distribution of qualitativevariables.The table command is actually moreflexible than just allowing you to lookat single variables.So, suppose we want to look at therelationship between the source variableand the version variable for this dataset, we can do table of the firstvariable, eData$Src, comma, the secondvariable, eData$Version, and we actuallysee a two-dimensional table now. So, what

    this table shows is first along the, y,the y, sort of the rows here,we actually see the values of the sourcevariable ak, ci, and so forth.And along the columns, we see thedifferent versions that you can have.So, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, butalso A, B, D and E.And then, what you see is the count foreach cell of this table, of the number of

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    7/12

    times that the source variable is equalto ak,for one of the rows in the data frame.And the version variable is equal to 2.So, 211 rows of the data frame have thesource variable equal to ak and theversion variable equal to 2.And it's the same that you get for eachof the cells.So, there you can kind of see therelationship between these two variablesand see that for example, most of thevalues seem to be occurring up here amongthese smaller number of detectors andthese smaller number of versions.You can also see, for example, placeswhere there are no values like this orplaces where these particular sources,ak, ci, and so forth, do not have anyvalues that come from these versions.So, another way that you can look at datain addition to table and unique, is tolook at any and all.So, any and all are particularly useful

    when looking at missing data.But also, if you want to see if there'ssome particular characteristic and see ifit exists for any of the variables in adataset.So, for example, if we look at thelatitude data, so if you go eData$Lat,and we look at the first ten values, sothis is just subsets to the first ten ofthose latitude values, we see thatthey're listed here.And so, suppose we want to see which ofthose values are greater than 40, it's

    kind of hard, hard to eyeball itdirectly, but you can use you can definesort of a logical variable.And the way that you do that iseData$Lat[1:10]>40.And so, what that does is for everyvalue, it checks whether it's greaterthan 40 or not.And then, if it is greater than 40, itreports true.So, for example, the third value is 65,it's greater than 40, so it's true.And if it's less than 40, here, in this

    case, 38.83, then it reports false forthat value.And so now, we have a new vector that'sthe same length as this original 10 longvector of latitudes, and it tells uswhich of the ones are greater than 40.And then, if we want to see if any ofthem are greater than 40, we can justsay, any of these values are greater than40, and it tells us, true.

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    8/12

    So sometimes, what you're looking for isjust, for some variable, either there anyof them that has some particularcharacteristic and you can use the anycommand to be able to check and see ifthat's true.The all command, on the other side hand,is looking to see if all values have thesame properties. So, for example, weagain define this eData$Lat latitudevariable to be, the all the values thatare greater than 40 and so what we can dois see if all of these values are greaterthan 40 by applying all to that variableand in this case we actually get set tothat vector and in this case, we actuallyget false because there are a largenumber of these that are actually equalto false.The only way that this would return trueis if every single value of this vectorwas equal to true.So, these two variables, or these twofunctions, any and all, allow you to

    evaluate if there are particular patternsthat you observe in the data setparticularly, if there's patterns thataffect all of the variable or affecteithwe one of the variables.The only thing we can do is is we cansubset the values.And we can do this in more complicatedways than we saw in the originallectures.So, one example is that we can use theampersand sign to do and operations. So,what we're going to do here is we're

    looking at the data frame eData, andwe're going to take the columns thatequal latitude and longitude, for thisdataset.And so, we actually just want to subsetto those values that have the namestatitude and longitude.Those are the columns that we're going tolubset to because it's after the comma.Before the comma, we actually want tolook at the rows that we're going tosubset to.And so, what we're going to do is, again,

    define these logical vectors so this iseData$Lat sign is greater than 0, will beall, will be equal to true whenever thelatitude is greater than 0 and will beequal to false whenever it's less thanzero or equal to zero.Similarly, we can define the same sort ofthing for longitude, we can say, we'regoing to define a logical vector that'sequal to true, whenever the longitude is

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    9/12

    greater than zero.And it's equal to false wheneverlongitude is less than or equal to 0.And then, we want to say, find all thecases where both latitude and longitudeare greater than zero.So, to do that, we just stick anampersand in between them and what youget out is a set of values where both thelatitude and the longitude are greaterthan 0.Another case that you might want to lookat is suppose you want to look at a casewhere either the latitude or thelongitude is greater than 0 but one ofthose two things has to be true.So here, we use the or symbol here to beable to determine whether theeither the latitude or the longitude isgreater than 0.So, in this case, you see some caseswhere latitude is positive and longitudeis negative and some cases, the otherway, where longitude is positive and

    latitude is negative.But one or the other of these twoconditions has to hold.Either the latitude is positive or thelongitude is positive.So, you don't see any cases where boththe latitude and longitude take onnegative values.So now after we've looked at a couple ofdifferent ways that we can subset thedata, and look at unique values, and allsorts of other things,what we're going to do is we're going to

    actually look at a dataset.This is a dataset that actually was puttogether for a paper that I wrote acouple of years ago on submissions andreviews and in experiment.So, in this experiment, people solvedproblems, like SAT problems.They were submitted to a computer.The computer then randomly assigned themto other people to review.And then, the other people that reviewedthosethose problems could either say that it

    was correct or incorrect.And then, we can learn a little bit aboutthe peer review system.This is particularly relevant becauseyour data analyses will be graded on apeer review system.And we learned that cooperation betweenpeer reviewers and people who areauthors, increase the accuracy of thereview process.

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    10/12

    So, we're going to look at this databecause it will show us a couple ofothers ways that we can manipulate datasets and look at summaries and figure outhow they're working.So here, we need to download actually twodata sets, and so they're on Dropbox.And so, we've assigned here the two URLsfor the two data sets.And then, we download the two files usingthe similar methodology that we've donebefore.Then, we can read them, they're both csvfiles, so we use read.csv and read thetwo files in.And we can look at a top of those files.So, here is the top of the reviews file,we see it has an ID, a solution ID, areviewer ID, a start and stop time, andso forth, and you can see that they'reall quantitative variables here.And then, we also look at the head of thesolutions file, again, it has an ID butnow a problem ID and then some of the

    similar variables that we saw before.So, one thing that we might want to do isdetermine if there are any missing valuesand one way to use that, to do that is touse the is.na function.So, suppose that we want to look at thereviews time left variable, we can lookat the first 10 values of that variableand see which of them are NA.So, if you use is.na applied to a vector,what it will do is it will look at everyvalue one at a time and then tell youwhether that value is NA or missing.

    So, in this case, the first, second,third, fourth, fifth, sixth, seventh,eigth value is a missing value of thattime left variable, and all the rest arefalse because they're not NA values.So then, the other thing that we can dois, if we have this logical vectordefined by using is.na on the entire timeleft vector,what we can do is just use sum tocalculate the total number of times thatyou see in NA value.So remember, true means that there's an,

    it's missing.It's an NA value.And so, if we do sum of a logical ve, vevector.What it does is it counts up the numberof times that you see true.And so you see 84 missing values for thisreviews$time_left variable.and indeed, if you do a table ofis.na(reviews$time_left), you're now

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    11/12

    going to look at a table of this logicalvector of whether it's true a missingvalue or not.And you see that 84 of the times, it'smissing and 115 of the times, it is notmissing.So, an important issue about dealing withtables and missing values is going to beillustrated with this example.So here, I've just created, this hasnothing to do with the previousexperiment, but you can see I've createda vector and it has values 0, 1, 2, 3,NA, 3, 3. 2, 2, 3.NA being the missing value.If I type table of that vector, Iactually see the number of times that 0,1, 2, and 3 appear, but I don't see thenumber of times that the missingindicator appears.And that's because one of the options,the useNA option is defined by default,not to show NAs.So, if you run table on that exact same

    vector, but you set the useNA option to,if any, then if there are NA values, youwill see that the NA value appears here,as well.So, there's one missing value in thatvector.So, that's just an important little trickto remember if your looking at the numberof values in a vector and you want tomake sure that you see the missing valuesas well, you need to change the useNAparameter that you're passing to thetable function.

    So, another thing that you can do is youcan actually summarize by rows andcolumns.So, another way to summarize data ratherthan to summarize the individualbariables at the level of a table is thatyou can actually just see what the sum ofall the values are in particular column,or the mean of all the values in aparticular column.So, this could be useful when you'relooking at, seeing if there's sort of anysort of variables that have an un,

    unusually high mean or unusually lowmean.It should only apply really to valuesthat are quantitative.And so, what you can see is, since we'reusing only quantitative variables inthese reviews, we can do the column sums.The column sums tells you the sum of allthe reviewer IDs, in this case that mightnot be a necessarily very useful number.

  • 8/13/2019 3 - 7 - Summarizing Data (23-21)

    12/12

    But you see that here the column sums forreviews are NA for the start, stop, timeleft, and accept parameters.And that's because if there are any NAvalues, the sum will always be equal tona.So, you might need to use this na.rmparameter to be able to ignore the NAvalues.And so, for example, if you take thecolumn means of the same reviews dataframe and you set na.rm=TRUE, then whathappens is actually it takes the means ofeach column.And it does that by completely skippingany columns where there is an NA for thatvariable.So, for example, for the start variable,it takes the mean of the start variable,and it does it by completely ignoring anyvalues that are equal equal to NA.And so, you end up with, these are thevalues that you end up getting for eachof the column means ignoring the NA

    values.You can also do the same thing for rowmeans.And so, all this does is, instead ofgetting a mean for each column, it gets amean for each row.And again, you might need to setna.rn=TRUE, because otherwise, any rowwith an na will get a value of na whenyou apply rowMeans to it.So, I know this was a super quick summaryof ways that you can summarize data.But this is the first pass in data

    analysis, there's always to run one ormore or several of these functions andsort of get a feel for what the shape,structure, number of NAs and so forth,exist in that data set.It also lets you summarize a little bitthe quantitative distributions ofvariables using quantile and things likethat.So, the next thing that we're going totalk about is data monging and that'sgoing to be the key component of any dataanalysis and it's usually performed after

    summarizing the datasets but can also beperformed before summarizing thedatasets.