statistics m2 assignment 1 - paris diderot universityedunbar/stats... ·...

Statistics M2 Assignment 1

Assignment 1

Goals: set yourself up on Github; start learning R; practice the statistical reasoning we did in class.

Preparing: Git and Github

From this point on, you’re going to use a “version control” system to do all your work. Version control is a bit like an “undo”feature, but far more sophisticated. The main thing a version control system gives you is a detailed record of everythingyou’ve ever done. If you have ever had the thought “I’m sure everything was working / was just the way I wanted it lastTuesday…”, this is the solution. (If you have not, you almost certainly will over the course of this class.) It also gives youthe possibility to add multiple collaborators to a project. The advantage over shared folders (for example, Dropbox) for thispurpose is that, since each change is logged, you know exactly who changed what when, and can go back to any previousstate at any time.

Create a Github account

Github is a site that provides free hosting for code projects managed through Git (we will find out what this means in aminute). You store the files for a single project in what’s called a “repository.” We’ll discuss this more below.

The only condition for storing your project on Github is that all of the files be publicly available for viewing (not formodification), unless you pay. Since, as a researcher funded by public money, all your work should be publicly availableanyway (and more and more people will bother you about this; and under increasingly many conditions this is legallyrequired; and some journals will require it; and people in chemistry look at us like we’re criminals when we tell them wedon’t keep a detailed record of literally every single thing we do, open to scrutiny; etc etc), this should be no problem foryou. Sometimes people have legitimate concerns that their data or their analysis will be used by some nefarious competitorbefore they can publish. In linguistics as it is today, this fear is almost never well-founded. In fast-moving fields wherepeople publish every couple of weeks, it may be a concern. At any rate, in the future, if you want to wait until after you’vepublished your paper to make your work available to the world, you can pay for private hosting on Github, you can useBitBucket (another service), or you can set up a private Git server in your lab or research group, that allows you to controlaccess.

In this class, however, you’re required to put everything in a public Github repository (and, yes, this means that you cancopy each other, and, no, I don’t mind if you get ideas from one another, but if you copy and paste code it will be obvious,and you will often have to explain your code in words, which will also be pretty obvious if you copy).

So: if you don’t have one already, create an account at http://github.com/, and tell me your username (e.g., by email). Ifyou don’t do this, I won’t be able to mark your assignment.

Create a repository

When you log in to Github, you should see a green button called “New repository.” Create a new repository called“assignment-1” with the “Initialize this repository with a README” box checked. Also have it create a “.gitignore” file(i.e., check that box). Once you create the repository, you’ll be taken to the main page of the repository, which will alwaysshow the README file. The README file is there to explain what is in the repository. Notice, however, that you can’tedit the README file from here.

What did you just do?

Git is a system for storing code. Actually, you can use it to store anything you’re working on. (Almost. There are certainthings you should never use it for, or bad things will happen immediately. More on this later.) It stores things, however, ina complicated way, which you will get used to and come to appreciate—after you understand it.

1

http://github.com/

The first thing to understand if you have never programmed before is that you will be storing your code in files. These fileswill have names, and you will be able to find them someplace, stored away on some computer, in some folder. These thingswill become clear soon. The data you work with will also be stored in files. If you work on our class server, these will bestored in a folder you have access to on the class server. Let’s call this collection of files your workspace. What git does isto help you create three different copies of the files you’re working on (the index or staging area; the repository or localrepository; and the remote repository)—for reasons which will become clear.

In fact, let’s call these four copies The Four Books of Git.

I know we don’t know what they are yet. But hold tight. You have just created the rightmost copy: the remote repository.

Creating your local repository

I’m going to assume you’re working on the class server at Paris V (there are instructions on how to set up Git, R, and RStudioon your own computer at the end of the assignment, but that is optional). Log in using your username and password. (Keepa tab open with your new Github repository, though.)

In the upper right hand corner of RStudio, you will see a button that says “Project: (None)”. Click it to bring down a menu,and create a new project using the option “Version Control: Checkout a project from a version control repository.” Select“Git.” Remember that in the last step, we created a repository on Github, stored on the Github servers. That was one of theFour Books of Git. (Which one?) You’re about to ask RStudio to create the other three.

The remote repository is called “remote” because it’s stored in a different place, remote from the rest. In this case, onGithub’s servers. The other Three Books of Git are stored in the same place as each other. They’re called “local” becausethey’re where you’re working. You have direct access to them from where you are working. Since you’re working on theclass server, they’re not stored on the computer right in front of you (you won’t find copies on your hard drive or on thecomputers in the classroom), but they are stored on the computer you’re working on (i.e., the class server, to which youare connected). But since the remote repository is remote, you need to give RStudio its address. To find this, go back intothe tab you left open to your new repository on Github, and click the green “Clone or download” button. Click on “UseHTTPS”. Then, copy the address given (it should be of the form “https://github.com/ … (etc etc) …”).

Now go back into RStudio and paste this in the top field. This is the location of the remote repository. It’s also asking youfor a directory name. RStudio wants to a name for the directory which will contain the other Three Books of Git. Call it“assignment-1”.

(Directory equals folder. It is a widely known fact that you can easily use computers quite ably your whole life withoutthinking too much about directories/folders and filesystems. I am of the opinion that files and folders are really not a very

2

effective way for human beings to interact with computers for most tasks, so it’s not surprising that most people manifestlydo not think much about what folders things are in, and get surprised when they’re asked a question about “what folder.”If you feel surprised at the idea that everything on a computer is in a folder/directory, read over the Wikipedia page forDirectories.)

RStudio has just created a directory called “assignment-1.” (Again, assuming you’re working on the class server, thisdirectory is on the class server, and not on your own computer.) This directory acts as your workspace (The First Bookof Git). Hidden in this directory, in places you can’t easily see, are the index (The Second Book of Git) and the localrepository (The Third Book of Git).

You can see what is in this directory in the file browser in the lower right hand panel of RStudio. Notice that it contains theREADME file you created when you created your repository, called “README.md”. Using the file browser in the lowerright hand panel of RStudio, you can see the contents of one folder at a time. You can browse what’s in the parent folder,which is your “Home directory” on the class server. To look at what’s in it, click the button marked “Home” at the top ofthe file browser. You’ll see that it contains the folder you just created (the “assignment-1” folder). Click on that folder toset the file browser to show its contents again.

Updating the Four Books of Git

The point of the workspace is just to let you work freely and make changes to what you’re doing: changes which are savedsomewhere (in your workspace), but which aren’t being tracked. If you open up the “README.md” file by clicking on it inthe file browser, you’ll see what it contains: just a title, the name of the repository. Since the README file should actuallyhave some useful information in it, underneath that, you can add in some information: type in something explaining whatthis repository contains. (Hint: this repository contains your work for Assignment 1, so you could put, “This repositorycontains my work for Assignment 1.”) You can then save that file in the normal way that you would save things in mostother programs (by clicking File > Save, or by clicking the floppy disk icon, or by hitting Control-S on Windows or Linuxor Command-S on Mac).

These changes are not being tracked. If you make another change (for example, in the future, you’re probably going toforget what “Assignment 1” refers to, so you might want to change it to say “for Assignment 1, Stats, Fall 2017 (Paris 7)”instead of just “for Assignment 1”), and then save again, you will never be able to go back to the old version. (Yes, there isan Undo button in RStudio, but once you close RStudio, all will be forgotten.) Nor can anyone else see these changes—andthat includes you, if you don’t happen to be connected to the class server. They are saved, but they aren’t “in the system.”

To get things “in the system,” you need to go open up the “Git” panel in RStudio. This is a little hidden: it’s the third tab inthe upper right hand panel, next to “Environment” and “History”. Before RStudio will let you do anything, you need to tellit who you are, so that Git can save this information when it tracks your changes. Give Git your name and email address.In the Git panel, click “More” (it’s got a picture of a blue gear; make sure you click the one in the “Git” panel, though, asthere’s also one of these in the file browser); now hit “Shell…”. This is access to the class server’s command line. Type:

git config --global user.name "John Doe"

and hit return (replace “John Doe” with your name). Now type:

git config --global user.email "[email protected]"

and hit return (replace ”[email protected]” with your email).

Close the Shell window. Now you’re ready. You should never have to do this again, unless you set RStudio/Git up onanother computer.

You’ll see that there are a couple of things showing in the main “Git” panel. There is the “assignment-1.Rproj” file, whichhas yellow question marks next to it. And there is the “README.md” file, which has a blue “M” next to it.

The yellow question marks are Git telling you, “This file is unknown to me.” It’s in your workspace, but it hasn’t beenregistered anywhere else. In fact, RStudio created this file when I asked it to create the project folder for Assignment 1. It’sthere to store some of your RStudio settings specific to this project, so that you can go back to working on Assignment 1with RStudio open exactly the way you left it last time. I don’t usually save this in my Git repository unless I have a goodreason to, but it would never really hurt anything, so you can click the empty checkbox to the left. You’ll see the yellow

3

https://en.wikipedia.org/wiki/Directory_(computing)

https://en.wikipedia.org/wiki/Directory_(computing)

mailto:%[email protected]

question marks turn into a green “A”. The “A” stands for “Added”. This means that this file is now in the index, the SecondBook of Git.

The purpose of the index is to let you choose what gets copied into the local repository, the Third Book of Git. The Git panelin RStudio gives you a list of all the files that are different in your workspace with respect to the local repository. The goalis to pick a single thing that you did (like, when you finish the answer to one of the questions on the assignment, or somesub-part of an answer to one of the questions on the assignment): a change that you would like to track. Remember that Gitdoesn’t track the individual changes you make to your workspace. It’s up to you to track them in the Books of Git, one ata time, so that you can come back to them later. It’s a good habit to start doing this every time you feel you’ve made someprogress (even if later you discover that you were actually wrong, and the thing didn’t work). The index is a temporaryspace where you can select the changes you’ve made to the workspace that correspond to one small unit of progress. Thefirst bit of progress you made was setting up the project in RStudio, so, having added this change to the index, you’re nowready to copy it into the local repository.

To do this, click the Commit button in the Git panel. This will show the changes that have been added to the index (a.k.a.,“staged”), give you one last chance to change your mind about what goes in there, and then it will ask you to write a shortmessage, called the “commit message” (obligatory). Conventionally, you write these messages in the imperative, as if theyhadn’t been done yet: “Add RStudio project file” would be a good commit message. The first line of the commit messageshould be a short summary. If you want to give more detailed information, leave a blank line, and then write a longerexplanation. If your commit message isn’t short, your commit may not be a single small unit of progress. (Remember:computers are dumb. In this case, the computer is sufficiently dumb that it can only bring back old changes if you stopand label them, and put them in the local repository. Otherwise it’s going to lose track of them. To match the computeron its level, you need to stop and think about what you’ve accomplished.) Once you’ve written your commit message, hit“Commit”.

What was once in the Second Book of Git is now in the Third Book of Git. The Second Book of Git has been wiped clean.You see a message that tells you how Git went about updating the local repository in some detail that is not important, butwhich should convince you that Git is very efficient in its way of bookkeeping changes.

You did a second thing, of course, which was to update the README file. Make a second commit registering this changeinto the local repository. Give it a description like: “Add description of repository to README”.

Once you’ve done this, your Git panel should be empty, because your workspace should match your local repository. At thetop of the Git panel, it will say, “Your branch is ahead of ‘origin/master’ by 2 commits.” The reference to ‘origin/master’is a reference to the remote repository. The remote repository is stored on Github. In order to sync them up, hit “Push”(with the up arrow, for “upload”). You’ll be asked for the username and password you gave when you created your Githubaccount. The Four Books of Git are in harmony. You can check this by refreshing that tab where you were viewing theGithub repository.

Looking at older versions

I won’t show you how to load them into your workspace (because you usually don’t want to do that), but you can look atthe changes you’ve made by clicking, in the Github page, where it says “3 commits” (this is the total number of commitsyou’ve made up to now: one when you created the repository; another when you added the RStudio project file; and anotherwhen you modified the README). By clicking on the individual commits, you get to see the changes that you made ateach step.

Notice that you did not sync with the remote repository three times (“push”). You only pushed once, just now. The Thirdand Fourth Books of Git don’t just contain your current work. A repository contains everything you have ever done,organized in this convenient fashion. You could also have viewed the same changes by looking at the local repository. Goback into RStudio, and, in the Git panel, hit “More” (the blue gear again). Click “Shell…”. Now type “git log”. You havethe same list of commits (without the detailed changes shown like you do in the Github interface; there is a command, “gitdiff”, that allows you to see them: try “git diff HEAD^ HEAD”). What’s in the local repository, like what’s in the remoterepository, is the whole history of your project.

Cloning and pulling

4

When you eventually collaborate, or, potentially for the nearer future, if you decide to install R on your own computer andwork there, instead of on the class server, then the way you’ll stay synced up (with your collaborators or your friends) is toadd, for each separate computer where you want a copy of the project, Three More Books of Git: a new workspace, index,and local repository.

Above, after you created your remote repository on Github, you then went into RStudio and created a new project. Yougave it the address of the remote repository, and RStudio created ThreeMatching Books of Git in your directory on the classserver. If you wanted to work on the project on your own copy on your own computer (not connecting to the class server,with actual copies of the workspace on your hard drive: for example, you’re taking a long flight soon); or if someone elsewanted to work on the project; then you, or they, would do the same thing again on the appropriate computer. This is called“Cloning.” (If someone else wanted to Push, you’d need to add them as a collaborator on the project. By default, only yourusername and password will work for Push.)

If some changes are pushed to the remote repository, you will then probably want to Pull them, to update your localrepository. You can try this now by hitting the “Pull” (down arrow) button in the Git panel, but nothing will happen.

There are a couple of skills you’d need to learn before working with other collaborators (they’re useful for working onyour own code as well, but not essential at first). Namely, you’d need to learn to “merge” and to create separate “branches”within a repository. You can find out more about these in the tutorials below.

Using Github without having to enter your username and password (optional)

If you start to get annoyed by having to enter your username and password each time you Push, then you can set up a betterway of proving who you are. Follow these steps from the Shell, accessible by hitting “More” in the Git panel in RStudio.(On the class server, you don’t need to follow the steps for MacOS, even if your computer is a Mac. The class server isn’ton MacOS, it’s running Ubuntu. Also, the “pbcopy” command won’t work. You have to type “cat ~/.ssh/id_rsa.pub”, andthen copy the output to the clipboard yourself with the mouse.).

You’ll then have to change thewayRStudio connects to your repository, fromHTTPS to SSH. First, push your latest changes.Then, open a web browser tab to your repository on Github, and click the green “Clone or download” button. Instead ofHTTPS, make sure that on “Use SSH” is selected. Copy the address given (it should be of the form “git@/github.com: …(etc etc) …”). In RStudio, go into the Shell, and remove your existing HTTPS connection to the remote repository:

git remote remove origin

Now add the new SSH connection:

git remote add origin [ADDRESS YOU COPIED FROM GITHUB]git push -u origin master

You can close the Shell and you should be able to Pull and Push from RStudio without a username or password.

If you set up Git on another computer, and you don’t want to have to use your username and password from there, you’llneed to copy your SSH private key to that computer. That’s not included in this tutorial, but I can give you a hand if you’renot sure how. Then just do this again on that computer.

Big binary files and Git’s history (boring, but not optional)

One last thing about Git. I told you at the beginning that you can use Git to store almost anything. You shouldn’t use Gitto store large files, especially not large binary files. We’ll get back to the binary part in a second. The “large” part means,“more than a few megabytes”, at least when it applies to binary files.

First, what I’m telling you concretely is that you shouldn’t put large files in the local repository or remote repository. Ifyou accidentally put one in the index (by checking the box and getting the “A” for “Added” in RStudio), you can alwaysuncheck it. When you think about the fact that you want to be able to distribute your code and work across multiplecomputers or collaborators, you can already see that it would be a bit of a pain if, for every new computer you add, youhave to first download a repository that’s really big.

5

https://help.github.com/articles/connecting-to-github-with-ssh/

But it’s actually worse than this with Git, because of the “binary” part. “Binary” as opposed to “text” files is somethingyou can understand better by reading this article and the Wikipedia pages linked therein. Some examples of binary files areimages, PDFs, Word documents, audio files, and compressed files like Zip files or TGZ files. Some examples of text filesare source code, R-markdown files (which you’ll learn about today), most HTML files that make up web pages (but laterin this assignment I’m going to give you a warning about some of the ones that R creates, which are partly binary), datathat’s saved in CSV (comma-separated value) files, or any document you’ve written that you can open up with Notepad onWindows, TextWrangler on Mac, or GEdit on Linux, or Emacs or vi—these programs are all called “text editors” becausethey won’t save what you’ve written as a binary file. They’ll just save the actual sequence of letters/numbers/spaces thatyou’ve written directly. That’s a text file.

So, remember that Git saves your whole history in the repositories. For text files, it’s good at doing this efficiently. Whenyou put a change to a file in the index and then commit, it will scan through the files to see what’s been updated, and onlysave those changes. When you add a file, it will analyze the file and find an efficient way to store it in the first place. Whenyou “remove” a file, it will not be in the latest version, but it will still be there in the history inside the repository.

But, for binary files, these processes are not efficient at all. Git won’t be able to find the changes if you make changes tobinary files. It will just store both copies, the old one, and the new one, completely. That means that every time you makea change to that binary file, it’s going to increase the size of the repository by a lot (even if the old version is no longerin your workspace, on any computer, anywhere; it’s in your repository). On top of that, it’s just not very good at storingbinary files efficiently at all, and because, whenever you do a push, it likes to go in and do a bit of internal reorganizing,this means that the simple fact of adding a large binary file to your repository is going to slow down literally every singletime you push. Because it’s going to go back and look over that file in detail to see if it can store it better. It can’t. But it’sdumb. And, because Git stores your whole history, even if you remove the file from the remote repository, and all the localrepositories on any computer in the world, this will still happen.

Large text files aren’t so bad. They only cause the problem of the initial pull to be slow. It’s probably not the best idea,because it’s inconvenient, but it only happens once. So if you wanted to store a bunch of data as a CSV in your repository,it wouldn’t cause too many problems. If you stored it as a compressed file, or if you were doing speech corpus work andstored the original audio files, it probably would. Large binary files, say, over five megabytes, are Not Fit for Git. This willcome up later in this assignment.

Learn more about Git

Git isn’t only in RStudio. It’s its own tool. You may find it useful to try out other tutorials about Git, which will teach youdifferent things (including some things we didn’t talk about), in a different way. Here are a few.

http://r-bio.github.io/intro-git-rstudio/

https://www.youtube.com/watch?v=uUuTYDg9XoI

https://try.github.io/levels/1/challenges/1

Exercise 1: Getting started with RMarkdown files

In this class, you’re going to get used to working using RMarkdown files. RMarkdown files are a way of doing what’scalled “literate programming”. This will make more sense if you first look at an example. This is a lesson from my friendJoe’s stats course at the LSA this year.

You can see that the document I sent you to is a web page. And it has a bunch of plots in it, to explain linear regression.You also saw that it also has a bunch of R code in it (for example, right at the top, under “~2 minute setup”). It turns outthat this web page was actually automatically generated. Joe didn’t manually paste in those plots and that R code into theweb page, the way he would have if he were typing his notes up in, for example, Word, or Latex. Joe didn’t do that (I don’tdo that either). Those plots were produced by R, and all Joe did to put them in the document was to write the R code thatmade them. He wrote them into an RMarkdown file, which was then automatically run and converted into the HTML filefor the webpage. He could have also made a PDF, or a Word document, without ever leaving RStudio.

6

https://perlmaven.com/what-is-a-text-file

http://r-bio.github.io/intro-git-rstudio/

https://www.youtube.com/watch?v=uUuTYDg9XoI

https://try.github.io/levels/1/challenges/1

http://jofrhwld.github.io/teaching/courses/2017_lsa/lectures/Session_5.nb.html

Here’s a snippet of Joe’s RMarkdown file (which you can also download by clicking “Code” in the upper left hand cornerof the page), in the form of a screenshot of what he probably saw when he was editing it:

Here’s what I’m looking at right now as I’m editing this document:

One of the points of RMarkdown is so that you can save time and mistakes by compiling all the work you’ve done (allyour data analysis) into a summary document, in a way that will allow you to guarantee that, before you or anyone elselooks over your work, you can make sure everything is working. All the code that you write has to be working at least wellenough to not crash before you can see the document. Your code is in your document (you can also make slides, and ofcourse papers and theses), so if you like looking at your document better than your code, this is a good way to make sureyou pay some attention to your code.

On the other hand, if you wind up enjoying looking at your code better than you like explaining it and explaining theresults—both ways will probably happen to you at some point—this is a good way to force yourself to write a bit ofinformation about what your results mean, which will force you to think about it. And if the result is a messy documentwith a lot of ugly code in it, it’s also a good way to encourage you to simplify and reorganize your code so that it makessense to the reader, which is something we will talk about later on in the class.

Your main task in this part of the assignment is to learn the basics of RMarkdown.

Start by reading parts of this tutorial: “How it works”, then “Code chunks”, then “Inline code”, then “Markdown basics”.You can read as much as you want, but read at least these to get an idea of what it’s all about.

Now watch this video. Follow along with the instructions. You’re probably going to wind up making a lot of plots thataren’t interesting to you, or aren’t meaningful to you, but do them anyway, to get familiar with using R. You will understandwhat all this by at the end of the course.

Use this RMarkdown cheat sheet for reference.

Now create a new RMarkdown file with the title “Assignment 1”, your name as the author, and save it under the filename“assignment_1.Rmd”. Get rid of all the boilerplate text. That title in the header will show up at the top of the document.Below the level of “Title”, Markdown also has “Section headers” and “Subsection headers” and so on. Make one section

7

http://rmarkdown.rstudio.com/lesson-2.html

https://www.youtube.com/watch?v=DNS7i2m4sB0

https://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf

header (using “##”: this is on the RMarkdown cheat sheet) for each of the four exercises in this assignment: Exercise1, Exercise 2, Exercise 3, Exercise 4, Exercise 5, and Exercise 6. Knit it to HTML to preview. You won’t be able tocreate PDFs on the class server, because, in the interest of space, I don’t have the necessary tools installed, but that’s fine.Everything should show up in HTML. You should however be able to knit things as Word documents, which is kind of fun.I’ll definitely only be looking at your work in HTML format, so make sure your assignments look fine in HTML.

Commit the Rmd to your local repository with a meaningful commit message. By convention, remember to write yourcommit messages in the imperative. Don’t commit the HTML file. So this commit should only be adding one file. (I’llexplain why in a minute.)

This is the last time I’ll tell you explicitly to make a commit. Do it each time you feel like you may have accomplishedsomething meaningful. Do it each time you feel like you may have accomplished something meaningful. You will thankyourself! And try and split your changes into small units, which you commit one at a time, even if you’ve made manychanges. You don’t have to push yet, but you can.

Now, why did I tell you not to commit the HTML file? Well, there are a few reasons. Normally, when you store your codein a repository and there are output files that can be automatically generated by the computer, you tend not to save them inthe repository, so that people don’t have to download them. You also tend not to save the output files because this forcesyou to go and make sure that they’ll actually work fine on another computer (that is, that someone else really can run allthe code and generate the document). You wouldn’t want your collaborators to get stuck. I don’t expect you to do this forthis class if you’re working exclusively on the class server, and I don’t expect you’ll have to, because I’ll be looking at yourassignments on the class server, but it’s a good idea (and I do expect you to test all your work on the class server if you’reworking on your own computer).

But the real reason is what we talked about under “Big binary files” above. So, normal HTML files on the internet arereally just text files. But, when R knits your RMarkdown files, it embeds all the plots directly into the HTML. That’s nice,because you can just send that HTML file, by itself, to your collaborator or advisor, as an email attachment for example,and they’ll be able to look over everything. But that also means that the HTML has a large binary file or files (image files)stuck inside it. And you don’t want to put that into a Git repository, for reasons we already talked about.

Finally, let’s insert some text. Type one sentence under Exercise 1 giving one fact about Git, R, RStudio, or statistics, orabout one of your experiences with one of them during class, or in the course of doing this assignment so far.

Exercise 2: Starting R programming

In Exercise 2, we’re going to learn a bit of R programming. Let’s put a couple of R code chunks under Exercise 2. I’mgoing to walk you through roughly what each line means, but then I’m going to leave it up to you to explain the rest. Here’sthe first bit of code that you should paste inside a new R code chunk (if you don’t know how to insert code chunks, go backto the tutorials I linked above).

possible_outcomes <- c(0, 1, 2, 3, 4, 5)outcome_probabilities <- c(0.1, 0.5, 0.2, 0.1, 0.05, 0.05)n_data_points <- 400

set.seed(1)fake_data_points <- sample(possible_outcomes,

n_data_points,replace=T,prob=outcome_probabilities)

set.seed(NULL)

fake_data_set <- tibble::data_frame(`Fake measurement`=fake_data_points)

When you put this code chunk into your document and knit, you won’t see any output. You’ll just see the code. Createanother (separate) code chunk, right underneath it, which has the following code in it.

ggplot2::ggplot(fake_data_set, ggplot2::aes(x=`Fake measurement`)) +

8

ggplot2::geom_histogram(bins=5, colour="black", fill="lightgrey")

This will make you a histogram, like the ones that we saw in class, made using ggplot. We’re going to gloss over the usageof ggplot for this assignment. Just assume that it works. You won’t need to make your own plots yet. Let’s look over thefirst code chunk. Let’s start with the first line.

possible_outcomes <- c(1, 2, 3, 4, 5)

This line stores a value in a variable. Variables are just ways of labelling information in a computer program. Watch thisvideo explaining variables. It doesn’t make reference to R, but the concept of variables in computer programming is general.After watching this video, you’ll understand what it means when I say that:

• R doesn’t require you to declare variables before you use them, but many programming languages do• R automatically initializes variables for you, but many programming languages don’t• There are various conventions for naming variables that you should follow so that your code is easier for you andeveryone else to read. (Code is for people. Computers don’t read code directly. They have to convert it to machinelanguage first.) One rule is that, except under certain special conditions, you start variable names with a small letter(not a capital letter). Another rule is that variable names are meaningful, obvious descriptions of what is stored inthe variable. A question arises with regard to what you do if you want to use multiple words to name your variable.R actually lets you use spaces if you do some special magic, but don’t do this. Almost no programming languagelets you do this, and it will throw your readers off. The special magic is also a bit of a pain. Another possibility isto put a capital letter every time a new word starts, except at the beginning. That’s what they did in this video. Butdon’t do this either. Java programs and JavaScript programs are typically written like this. I am slightly convincedthat it’s slightly difficult to read (although there is zero empirical evidence), but the real reason is that you shouldgo easy on your readers, and your readers will expect to be reading R code. Modern R code is usually written withunderscores (_), as I’ve done here. So do that.

Now watch this video and follow along by typing the code directly into the R console, which should be available on thebottom left of your RStudio screen (you don’t need to first type it into a text window and hit “Run” the way the guy in thevideo does; you can type it directly into the console).

After watching this video, I can now tell you two important things:

• The line in our code chunk creates a vector of numbers, and they represent the possible values of some kind ofobservation that we might have in a data set. If this is too abstract, imagine that our imaginary data set is “numberof students who arrived early for class in a class of five students.”

• Vectors in R have to be of one single type of information. They can’t mix, for example, character strings (text) andnumbers. They have to be either all text or all numbers (or all of some other type, and there are some others).

A note. When we’re talking about “variables” in “memory” with “variable names”, we’re not talking about files in a folderon a computer. We’re not talking about your workspace, in Git terms. In fact we’re not talking about anything that’s storedpermanently on the class server, or on your computer. Here’s a video that explains the difference between what we call“memory” and “disk” or “storage”. To see this more clearly, I’d like you to copy and paste that first line of the code chunkinto the R console, and hit the return key. You’ll see that, over on the right, in the “Files” pane of RStudio that lists fileslike “.gitignore” and “assignment_1.Rmd” (if it’s not currently showing, bring it up in the lower right)—nothing new hasbeen added. Yet, you have indeed created a new variable with this information in it, and, to demonstrate that, just type thename of this variable into the console, and its contents will be printed (in a slightly difficult to understand format that has a“[1]” in front of it: more on that later). This variable is stored in memory. When you log out of your RStudio session andlog back in, it’ll be gone (this ismostly true; R has a way of backing up what’s in memory, but anyway, it’s true by defaulton the class server, as you can demonstrate for yourself). And this variable ss not among the things that you could ever putin your Git repository. It’s something different. It’s not stored permanently as a file. It’s only stored in memory.

Code is a set of replicable and reliable instructions for making things happen, including creating variables in memory. Whatyou’ve got in this code chunk will have the same effect on my computer or your computer as it does on the class server. Itsfirst step will be creating a variable in memory.

The line that starts with fake_data_points <- ... is also the beginning of a variable assignment. To start to understandthis better, I’d like you to run the entire code chunk now (the first code chunk, not the second one that added the histogram).You can do this by clicking somewhere inside it, and then either clicking the “Run” button (towards the top of the screen:

9

https://www.youtube.com/watch?v=_sVtcPgHAjI

https://www.youtube.com/watch?v=_sVtcPgHAjI

https://www.youtube.com/watch?v=rpG0Dj-GO8Y

https://www.youtube.com/watch?v=8TfLBrtQ2sY

https://www.youtube.com/watch?v=8TfLBrtQ2sY

it has a white box and a green arrow pointing right) followed by “Run Current Chunk”; or by pressing Ctrl-Shift-Returnon Windows or Linux, or Cmd-Shift-Return on Mac. You’ll see that the “Console” panel at the bottom of the screen hasdisappeared, but you can bring it back by clicking on the word “Console”. You’ll then see that it’s as if you typed all thatstuff into the console. RStudio has done it for you.

Type “fake_data_points” into the console to print the contents of this variable. You’ll see that it’s long. You’ll also see thatit has a lot more numbers in square brackets at the left hand side (e.g., not just [1], but also some other numbers like [36],[71], and so on—the actual numbers will vary from person to person and screen to screen). These numbers are counting theelements of the vector for you. So what [1] is saying is: this line that I’ve printed for you starts off with the first elementof the vector. What [36] would be saying would be: this next line starts off with the [36]th element of the vector. Thesenumbers aren’t stored in the vector in memory. They’re just there to help you read its contents.

Its contents, on the other hand, are a large quantity of numbers, as you can see. They’re all either 0, 1, 2, 3, 4, or 5. In fact,this variable contains a number of fake data points, which are the ones plotted in the histogram.

The statement that creates these fake data points and does this variable assignment is split over multiple lines. It starts onthe line that I pointed to (fake_data_points <- ...) and it ends on the line that ends in ).

So: your task in Exercise 2 is to give me your best guess as to what all of the other lines in this code chunk are doing.This exercise is not graded on whether you have the right answer. I’m going to evaluate it based on how you arrived atyour guess, which you’re going to explain to me. (If you tell me in your answer you already understood this code becauseyou already know R, that’s fine, I won’t evaluate this part of the exercise. I’ll just tick off whether you did something forthis part.) The answer “I really don’t know” is also fine, as long as you explain to me what you tried doing in order tofigure it out. And the only answer that’s not fine is one that you copied from someone or someplace else and that doesn’tdemonstrate that you actually tried to understand what it said. It’s fine to ask, and it’s fine to put the text of, or a link to, asite or a book where you got some helpful information. (Or R’s help. R has help. See under Statement 4 in my exampleanswer below.) But you have to explain it for yourself, and if you don’t understand it all, you have to say what part youdon’t understand and why. Keep in mind: all that’s happened by the end of this chunk is that we’ve created some fake datathat we’re going to plot later on.

Organize your answer statement by statement. So it should look something like this.

• Statement 1. My best guess is that this statement creates a vector of numbers called “possible_outcomes”. Thesenumbers are all the possible outcomes that we expect to observe in the fake data set we’re going to plot. I arrived atthis guess using several sources of evidence:

– I think it’s a variable assignment because in the videos, and on the made-up website “R-ll about R”, they showvariable assignment statements that have, first, the variable name, then “<-”, then the contents of the variable.

– I think it contains all the possible outcomes because it’s called “possible_outcomes”, although I am aware thatnot all programmers are good programmers who give their variables meaningful names, which is why I doublechecked this in other ways.

– I printed the contents of “fake_data_points” and/or “fake_data_set” on the console, and I went throughmanuallyto verify that the only fake observations we ever observe are 0, 1, 2, 3, 4, and 5: the numbers contained in thisvariable.

– I tried changing the code and re-running it, by changing the contents of “possible_outcomes”, and I observedthat …

– My other source of information is that Ewan already said this is what this statement does.– [Okay, Statement 1 is a little artificial. If you do in fact decide to do something to “figure out” what thisstatement does, do mention it, but you won’t get any extra points for that. It’s just for your edification.]

• Statement 2. …• Statement 3. …• Statement 4. [Blank lines aren’t statements. I’m talking about the line: set.seed(1) . Also, I have tips abouthow you should start this one that will help you for the rest. First, try the help. The way (one way) that you usethe help in R is by going to the console, and typing in a question mark followed by the name of the thing youneed help with. What you need help with here is “set.seed”. So you’d type ?set.seed and hit return. Second,don’t have high expectations about the help. Expect to do some internet searches. Third: experiment, and takenote of the other line that has set.seed in it. Fourth: consider figuring out other parts of the code first.] …

• Statement 5. [Remember, this is one statement split up over four lines. We’ve already talked about what it does,

10

so the question is for you to make that explicit, and then explain how it works, explaining all the different partsof the statement as best you can.] …

• Statement 6. [The last line in the code chunk. Note that what’s being used here is data_frame and notdata.frame . When you try and look in R’s help, it might want to “correct” you to data.frame . Turnsout that’s not so bad, because, as you’ll see if you look in the help for data_frame , the two are related. Do asmuch research as you can to give as good an explanation as you can.]

And remember: I don’t know because … is a great, great, great, answer. It’s very often the best answer, for sufficientlywell thought through values of “…”.

Exercise 3: Reasoning about numerical data

iris_groups23 <- dplyr::filter(iris, Species %in% c("versicolor", "virginica"))ggplot2::ggplot(iris_groups23, ggplot2::aes(x=Sepal.Width)) +

ggplot2::geom_histogram(colour="black", fill="lightgrey", binwidth=0.1) +ggplot2::facet_grid(Species ~ .)

versicolorvirginica

2.0 2.5 3.0 3.5

0.0

2.5

5.0

7.5

10.0

12.5

0.0

2.5

5.0

7.5

10.0

12.5

Sepal.Width

coun

t

The above pair of histograms are taken from a very famous data set. It has nothing to do with language, but it’s famous, soyou should see it. First, read about the data set on its Wikipedia page (yes, a data set with its own Wikipedia page: I toldyou it was famous). Make sure you look at the pictures of the flowers, as it will make a lot more sense.

This is just one of the measurements, the sepal width, for just two of the species, Iris virginica and Iris versicolor. Thebins of the histogram are spaced every 0.1 centimetres, not including the upper value (thus, the first bin ranges from 2.0 to2.1, inclusive of 2.0 but not of 2.1).

Just like in all of the problems we did in class, I’ll tell you that the two groups of observations have the same N , and that,in this case it’s 50 each.

11

https://en.wikipedia.org/wiki/Iris_flower_data_set

Question 3a. Explain what the histograms mean by explaining what a few of the bars mean for each. (In general, feel freeto include any of the figures I provide in your answer, by copying the code into an R chunk or chunks in your document.)Check that the histogram is correct for this subset of the “versicolor” measurements by drawing it out by hand, on paper.Hint: you can work out what the bins are from the histograms. Scan or take a picture and include it in your assignment. (Iwill explain how to include it below.)

library(magrittr)iris_versicolor_subset <- dplyr::filter(iris,

Sepal.Width <= 2.5,Species == "versicolor") %>%

dplyr::select(Sepal.Width, Species)knitr::kable(iris_versicolor_subset)

Sepal.Width Species

2.3 versicolor2.4 versicolor2.0 versicolor2.2 versicolor2.2 versicolor2.5 versicolor2.5 versicolor2.4 versicolor2.4 versicolor2.3 versicolor2.5 versicolor2.3 versicolor2.5 versicolor

To include the drawing, get the image of your drawing onto your computer. Then look at this link to find out how to uploadthe picture from your computer to the class server. Now, insert a code chunk with code like this:

knitr::include_graphics("[FILENAME OF IMAGE FILE YOU UPLOADED]")

If you want to adjust the size of the image, you can add, instead of the opening {r}, {r, out.width='50%'}, to makethe image take up only 50 percent of the total width of the page (for example).

Question 3b.

Here is the histogram of the data, with both groups pooled together:

iris_groups23 <- dplyr::filter(iris, Species %in% c("versicolor", "virginica"))ggplot2::ggplot(iris_groups23, ggplot2::aes(x=Sepal.Width)) +

ggplot2::geom_histogram(colour="black", fill="lightgrey", binwidth=0.1)

12

https://www.youtube.com/watch?v=aTv2gHYhreM

0

5

10

15

20

2.0 2.5 3.0 3.5

Sepal.Width

coun

t

Verify that the two small histograms add up to the big one, using the examples of four specific bins.

Question 3c.

We want to discuss two hypotheses, like we did in class:

• Hypothesis A: The virginica and versicolor iris species are the same in terms of sepal width.• Hypothesis B: The virginica and versicolor iris species are different in terms of sepal width.

Given that we are talking about two whole species, rather than two individual irises, or even just these specific irises thatwe’ve observed, the two hypotheses are in need of a good deal of explanation. Explain what the two hypotheses acutallymean. Make reference to the figures above. Be specific about exactly what ranges of values, and how many, we wouldpredict to see in further measurements from each species, under each of the two hypotheses.

Question 3d.

Try and construct an argument in favour of either Hypothesis A or Hypothesis B. It shouldn’t be stated in statistical termsthat we haven’t talked about yet. It should, however, make reference to the three histograms above. Note that I don’tactually know the true answer to this question, so there is no way I can evaluate you on being “right.” I’m just asking youto put forward some kind of argument, one way or the other.

Exercise 4: Reasoning about categorical data

# Install the data: this line is commented out, no need for it if you're# on the class server, as I've already installed the data package. If you# want to include any figures in your assignment, or if you want to look at# the raw data yourself, and you're# working on your own computer instead of the class server, copy this line,# without the initial "#" , and run it just once on the console, while

13

# hooked up to an internet connection. The data should install.## devtools::install_github("ewan/stats_course", subdir="data/stress_shift")

ggplot2::ggplot(stressshift::stress_shift_permit,ggplot2::aes(x=Category, fill=Syllable)) +

ggplot2::geom_bar(position="dodge", colour="black") +ggplot2::scale_fill_brewer(palette="Set3")

0

10

20

30

40

Noun Verb

Category

coun

t Syllable

Syllable 1

Syllable 2

These bar plots represent the number of times the English word permit (noun) was marked as having the stress on the firstsyllable, or on the second syllable; and the number of times permit (verb) was marked as having the stress on the firstsyllable (I can tell you: once), or on the second syllable; in a large collection of English dictionaries. As in all the exerciseswe’ve done up to now, and everything we’ll see today, there are the same number of data points for both nouns and verbs:exactly 46 each.

Question 4a. Here is the same data, with both Noun and Verb pooled together:

ggplot2::ggplot(stressshift::stress_shift_permit, ggplot2::aes(x=0, fill=Syllable)) +ggplot2::geom_bar(position="dodge", colour="black") +ggplot2::scale_fill_brewer(palette="Set3") +ggplot2::xlab("") +ggplot2::theme(axis.text.x=ggplot2::element_blank(),

axis.ticks.x=ggplot2::element_blank()) +ggplot2::xlim(c(-1,1))

14

0

20

40

coun

t Syllable

Syllable 1

Syllable 2

We want to discuss two hypotheses, like we did in class:

• Hypothesis A: Permit (noun) and permit (verb) are the same in terms of their stress.• Hypothesis B: Permit (noun) and permit (verb) are different in terms of their stress.

Why isn’t the question already answered for us by this graph? Explain what the two hypotheses really mean, like you didin the previous exercise. Make reference to the figures above. Be specific about exactly what we would predict to observefor permit (noun) and permit (verb) under each of the two hypotheses.

It will probably be useful to state explicitly what you think the possible sources of variability in dictionary entries mightbe. These dictionaries date back to the seventeenth century, and this data was collected for a study of historical change, butdo not limit yourself to just the age of the dictionary. Note, though, that in this data set, a single dictionary never lists twodifferent possible variants of the stress pattern, either for the noun or for the verb.

Question 4b. Try and construct an argument in favour of either Hypothesis A or Hypothesis B. It shouldn’t be stated instatistical terms that we haven’t talked about yet. It should, however, make reference to the three bar plots above. As before,I’m just asking you to put forward some kind of argument, one way or the other.

Exercise 5: Reasoning about count data

library(magrittr)set.seed(1)ver_balanced <- languageR::ver %>%

dplyr::group_by(SemanticClass) %>%dplyr::sample_n(198)

set.seed(NULL)

ggplot2::ggplot(ver_balanced, ggplot2::aes(x=Frequency)) +ggplot2::geom_histogram(fill="lightgrey", colour="black", binwidth=250) +

15

ggplot2::facet_grid(SemanticClass ~ .)

opaquetransparent

0 5000 10000 15000 20000

0

50

100

150

0

50

100

150

Frequency

coun

t

The above histogram is count data: frequency counts from a corpus for verbs containing the Dutch prefix ver-. This prefixis cognate with the now somewhat rare English prefix for-, as in forswear, forbid, and forget. Dutch verbieden has thesamemeaning as English forbid, and as should be evident, there is (today) nothing that transparently resembles the meaningof bid (as in, make a bet) in either forbid or verbieden. Thus both are semantically “opaque.” Dutch, however, has somever- verbs that are semantically transparent (verminderen: minder means “fewer”, and verminderen means “reduce”,i.e., “make fewer”). These histograms are grouped according to whether they are semantically opaque or semanticallytransparent. There are 198 observations in each group. The bin width is 250, as above, inclusive on the bottom, exclusiveon the top.

Question 5a. This is a histogram of count data. Each observation is a count of the number of times a certain word occurs.Thus, there are two kinds of counts involved in generating this graph. Explain in your own words what they each are.

Question 5b. We want to discuss two hypotheses, like we did in class:

• Hypothesis A: Semantically transparent and opaque ver- verbs are the same in terms of their frequency.• Hypothesis B: Semantically transparent and opaque ver- verbs are different in terms of their frequency.

Why isn’t the question already answered for us by this graph? Explain what the two hypotheses really mean, like you didin the previous exercise. Make reference to the figures above. Be specific about exactly what we would predict to observefor transparent and opaque verbs under each of the two hypotheses.

Question 5c. Try and construct an argument in favour of either Hypothesis A or Hypothesis B. It shouldn’t be stated instatistical terms that we haven’t talked about yet. It should, however, make reference to the three histograms above. Asbefore, I’m just asking you to put forward some kind of argument, one way or the other.

16

Installation on your own computer (optional)

This step is only if you want to install R, Rstudio, and git on your own computer. The advantage to this is that you won’tbe dependent on having an internet connection, on the server at Paris V (which might be slow or out of service - I hope not- but which will definitely disappear after the academic year is over), and, importantly, that you’ll have a lot of disk space.There isn’t much disk space for you on the server, as we discussed in class. You’ll also have a chance to test your codeon another system, which is a very good idea. If you are distributing your code, it’s so other people can use it. It shouldwork for them. If your code won’t work with a generic, vanilla R installation, with no extra packages installed, you wantto explain in your README how to make it work, and to do this, you want to test it out somewhere other than the classserver. Nevertheless, this step can sometimes give people trouble, so it is strictly optional.

Windows

Installing Git

Installing R and Rstudio

If you install Git before RStudio, then RStudio should detect Git. Skip down to “After things are installed” below.

Mac and Linux

Installing Git

Installing R and RStudio

If you install Git before RStudio, then RStudio should detect Git.

After things are installed

Go through the exact same steps on your own computer that we just went over above on the class server in the first stepof the assignment. You won’t have to re-create your Git account or the Assignment 1 repository, of course, you’ll just needto tell RStudio to create the workspace, index, and local repository on your local computer, based on what’s in the remoterepository.

Remember that these are independent from the ones on the class server. You can think of them as Books 5, 6, and 7 of Git(or Books 1a, 2a, 3a). To sync up the class server and your own computer, you need to sync them up via Book 4, the remoterepository. If you’re working on both your computer and the class server, the best thing to do is, whenever you sit down towork on either one, first Pull. When you’re finished, Push. If you want to avoid dealing withMerge (resolving conflicts),do this systematically, and it will never come up. Merge is something you’d rather be mastering once you get the basicsdown, which is why I haven’t talked about it here.

We haven’t talked about R packages yet, but I told you in class that you shouldn’t install R packages on the class server,because of space issues. But you will definitely need to install some packages on your own computer. To find out whatpackages you need to get started, try running the code chunks you see in this assignment. R will complain that thingsaren’t installed (for example, dplyr and ggplot2). In RStudio, this is pretty easy. From the Tools menu, hit “Installpackages…”. Most packages will be available from the central R package repository, CRAN, and you’ll therefore be ableto install them just by finding them in the list.

17

https://www.youtube.com/watch?v=cEGIFZDyszA

https://www.youtube.com/watch?v=GAGUDL-4aVw

https://git-scm.com/book/en/v2/Getting-Started-Installing-Git

https://www.youtube.com/watch?v=Ywj6yNfc5nM

statistics m2 assignment 1 - paris diderot universityedunbar/stats... ·...

Documents