This document is a short guide for someone starting R, or just coming back to it. The writeup is terse and does not seek to explain matters fully. Instead, it is intended as a quick reference for the reader to get going fast and then on to other work.
In this writeup you will:
- install R on your own computer (unless you have already done so);
- install RStudio (optional but useful);
- do one quick basic operation in R to make sure your system is now capable of running R.
I like to use R because:
- R is open source software, and is freely available; you don’t have to be logged into the LSE (or your work) network to use it; you can use it without being connected to the Internet; you can perform your research while on a long flight or on the beach; you can freely install and use R on as many machines as you like;
- a large community of scientists across many disciplines works with R regularly, and post online questions, answers, and experiences regarding it (e.g., “R vs Stata: .. Datasets”, “R – a second language”, and many others);
- R is a language besides being statistical software, so you can extend R to pretty much any application your mind can imagine;
- R encourages open science—literate programming and reproducible research—and thus makes convenient the replication of empirical findings;
- worldwide, network servers from Argentina and Colombia through Vietnam and New Zealand carry its latest versions;
- in poorer societies these features of R promote research and human capital accumulation, so public spending can then usefully go elsewhere rather than on costly licensing arrangements.
- R is constantly being improved.
Convenient summaries of R commands are available (e.g., the cheatsheet or the Wikibook) but won’t of course be necessarily the best way to start learning to use the software. Books on R (e.g., Michael Crawley 2012 or its earlier first edition) are similarly useful as references but, again, might not always be where someone should head first to get going quickly.
Instead, what I’ve found useful to start is simply to cut and paste from what other people have already written that is most closely related to what I want to do. I intend the exercises that follow to give you that kind of a base so you can then get going on your own research. There is nothing holy or admirable or morally uplifting about writing code from scratch when others have already done so. Our primary goal on this journey is to find out things about the world; aesthetic is secondary.
Before plunging in, some points that many first-time users might not routinely think about:
- To run lines of code, you have to be totally obsessive about getting things exactly right [sometimes you’re lucky and things work anyway even if you slip up, but it’s best not to rely on that].
- If something appears in quotes, i.e., like “[…]”, make sure you put those quotes in exactly: Double quotes ” are different from single quotes ’. Use the right ones.
- If a name or a command is UPPER CASE or lower case, make sure that’s exactly how you type it. R distinguishes case.
- Sometimes R chatters back at you, with no action required back from you. Sometimes it tells you something you need to fix. Either way, pay attention to what it says, even if only to ignore it after you get the meaning.
R, RStudio, and PERL
R is the core collection of routines for statistical computing, while RStudio provides a convenient front end to R. The way that I operate, RStudio works best for me. Others might prefer to engage with R directly, or use a different front-end environment to R.
For many things I do, it is convenient to have R draw on the added functionality of the (separate) PERL language. As one example, to read data from Excel spreadsheets, R uses PERL modules—previously written by others and made freely available.
Therefore, for some, R alone suffices; for me, I want all three.
(Others might wish to use R in tandem with yet other additional software. They’re free to do so.)
If you’re on Mac OS X or Linux, you can skip this as you already have Perl on your machines. If, however, you’re on Windows, point your Internet browser here and download and install
Strawberry Perl for Windows.
For R, point your Internet browser at this landing page. This gives you information on R generally and shows you a link to download R. Go there and select from the list a CRAN Mirror nearest you. I chose the one at Imperial College but it doesn’t much matter: they all work the same way. If you’re reading this document from Beijing, say, you might of course want to choose a different CRAN Mirror. Once you’ve selected the mirror, choose “Download R for Windows” or “Download R for (Mac) OS X”, or “Download R for Linux’’—depending on your system. Run that file to install R on your machine.
Again, this is optional but RStudio provides a clean and convenient interface to use R. Point your browser here and choose the version appropriate for your platform. The website actually guesses what you’re going to need and serves that up for you as a lead recommendation. If, however, the website gets it wrong, you go ahead and choose what will work for you. Download and install.
(If you really want to get fancy, you can select the RStudio Server but that’s only if you run your own Linux server, in which case you likely shouldn’t even be reading this document.)
With a fresh R on your computer, depending on what you want to do, you will need to install some libraries first—but you’ll only need to do this once. Here, we want to read data from an Excel spreadsheet into R, so we need to augment R with some libraries.
Fire up RStudio and go into “
Tools/Install Packages” (i.e., mouse over to the “
Tools” menu item, click on it, and activate the “
Install Packages…” entry). You’ll see that some defaults have already been filled in: if you know what you want with alternative values to these defaults, go ahead and plug in those values. Otherwise, just leave them. Type
gdata into “
Packages (separate multiple…)” . Make sure “
Install dependencies” is activated, and then press “
gdata library is what will let R read data in Excel spreadsheets. To install this library, your machine needs to be online, as R will reach out into its servers, wherever they are, to retrieve this code and then install the library on your machine.
In RStudio (if you’re there still, or if you’ve just come back to this, start up RStudio and then) set your working directory by “
Session/Set Working Directory …/Choose Directory”. This is one way to do it; alternatively, you can hit return after keying into the panel labelled
Console the line of R code:
where, obviously, you substitute for the phrase in green your own working directory. The tilde
~ denotes your home directory, wherever that might be. (Mine is either
/home/dquah, depending on whether I happen to be using Windows or Linux right at that moment. The nice thing about using the tilde is that my code then works the same regardless where I am.) Alternatively, you can copy and paste that preceding line into your own RStudio Console, edit the relevant clause with your keyboard, and then hit return. You can do this for any of the chunks of R code that follow.
To make sure you’ve got things under control, save this, i.e., mouse to “
File/New File/R Script” and then copy the one line of code we’ve just executed into the newly-appeared top left-hand window in RStudio (that new window will typically be called “Untitled1”), and then go like all “
File/Save As … ” on RStudio. I’m saving this as the R script
This will be a first R program, containing just the one
setwd() line. I know I’m going to want to be adding to this R program to do my analysis. But for now I just want to make sure, if I can help it, that my work doesn’t go away unexpectedly.
If you look at this working directory now on your machine, you’ll see it has at least the file
e1.R, or whatever you decided to call your R script. You can take that peek using Windows Explorer or a
bash Terminal or the Finder… whatever. You can also get this same information from within RStudio by hitting return after keying into the Console (i.e., by executing the line):
You should see a listing of the directory that you’ve
setwd’d to, including at least the file
e1.R (and whatever else might be there). So, perhaps something like this:
 “e1.R” “WB-GDP-cleaned-DQ.xls”
WB-GDP-cleaned-DQ.xls is the Excel spreadsheet with which I happen to be working.)
You can execute R code by keying it directly into the RStudio console or more typically opening an R script (such as
e1.R — which is just puretext that you can edit in any text editor) from RStudio, making sure your RStudio focus is on that R Script panel, and then going “
Code/Run Region/Run All. When you do the latter, you’ll see RStudio Console automatically stepping through your code.
Now go get a drink, stretch your legs, do some taiji.
DATA IN DATAFRAMES
The key object that we will use to hold data is what R calls a dataframe.
dataframe is a 2-dimensional array but like most modern things on computers, a
dataframe can hold text, numbers, items of logic, calendar dates (i.e., not just as numbers but recognising the structure of quarters, months, and days), and possibly even more complicated objects as its entries, all freely intermingled. (Matrices of just numbers are very last-century.)
Among other reasons the
dataframe is key for our work is that a dataframe is what R builds when it reads in an Excel spreadsheet. So, for instance, if we have a spreadsheet
2014.01-Poverty+Growth-DQ.xlsx in the folder
~/Dropbox/1/j/data/Global-Distribution, we can read the data in it, in its different sheets, into different dataframes:
## Warning: package 'gdata' was built under R version 3.1.1
setwd("~/Dropbox/1/j/data/Global-Distribution/") theDataXLS <- "2014.01-Poverty+Growth-DQ.xlsx" Country.Info.DF <- read.xls(theDataXLS, sheet="Country-Info") World.Pov.DF <- read.xls(theDataXLS, sheet="WB-Pov") World.GNI.DF <- read.xls(theDataXLS, sheet="WB-GNI-pc")
From the code just run and the earlier chunk, you’ll notice that I can use spreadsheets saved in either “.xls” or “.xlsx” formats: the code in
gdatatakes into account which of the two I happen to be using when I call
Unlike, say, computer systems that need a specific file extension to tell them what kind of a file is being used, R doesn’t care what I name the objects I create within it. Nonetheless, although of course you don’t have to do this, I like putting “.DF” at the ends of the names to my dataframes as doing so helps me remember what they are.
Also, because I often need to read the R code I’ve written and to understand its logic quickly, I’m a little obsessive about how my code is formatted. So, in the preceding I’ve lined up the assignment
<- symbols. Again, not everyone needs to do this and most of the time R simply doesn’t care how its code looks.
This document has provided brief notes as a quick guide for someone starting to use R (or coming back to it).
Modern computing platforms and R allow multiple pathways to achieve any given end goal. The setup in this document prepares a system that ends up looking like mine; others, however, might prefer a different organizational structure for their work.
- R Cheatsheet
- R Programming Wikibook
- Chang, Winston. Cookbook for R
- Crawley, Michael. 2012. The R Book or its earlier first edition
- Kabacoff, Rob. 2012. Quick-R
- The R Manuals
- R Tutorial: An R Introduction to Statistics
- R Tutorial: Introduction
- R for Econometrics