All posts by Danny Quah
Global Data
Global Economy Data
D. Quah
Economics and International Development, LSE
November 2014 (Revised January 2016)
(Also here)
Unless you constantly work with just a single dataset, one of the more significant bottlenecks in empirical projects is getting the data to a form where you can interestingly query those data. This writeup describes the data I regularly use, manipulate, and need to collect (although not regularly enough that I remember every detail about them – hence the need for this document).
Here I provide R code snippets to get from a number of originating databases –Maddison Project; World Bank; Penn World Tables; IMF generally, but focusing on IMF World Economic Outlook; Polity IV; inequality; author-provided data from key published papers; and so on – to where I can then ask the questions I’m interested in. My projects then pretty much always start by my returning first to this writeup and just copying R code from out of it.
By its nature this writeup is never finished. When I encounter interesting data that are not one-offs, but that I will be using consistently, their management then appears here.
Obviously, not everyone will work the way I do and not everyone will want to use these same data. But I hope the combination of R, knitr, and ideas fromliterate programming might help others similarly concerned about presenting their empirical work in a way easier for others to replicate and reproduce.
Setting up
I will consider these datasets in turn in the following sections. First, however, I load for subsequent use a collection of handy R libraries. If you just want to know about the data, you can skip the remainder of this section and just head on over to the section describing the dataset you want to know more about.
library(gdata)
library(ggplot2)
In the sequel, some sections might needlessly re-load these libraries. Doing so is without harm but keeping the code there might help those users who are going to just cut and paste from here into their own projects.
For the aesthetics I find useful in charts, I set up graphics themes, one for each different kind of plot:
myTStheme <- theme_classic() +
theme(
plot.title=element_text(size=rel(1.5)),
legend.title=element_text(size=rel(1.5)),
legend.text=element_text(size=rel(1.5)),
legend.position=c(1,0), legend.justification=c(1,0),
axis.text=element_text(size=rel(1.5)),
axis.title=element_text(size=rel(1.5)),
axis.title.x=element_blank(),
axis.title.y=element_blank()
)
Then I have collected data and R code routines into their own directories so I can re-use them conveniently across different projects:
myDataDir <- file.path ("~", "Dropbox", "1", "j", "Data")
myRoutinesDir <- file.path ("~", "Dropbox", "1", "j", "Code", "Routines")
Where you put your own data and routines will differ from these, so just set the file.path values to what you want. Alternatively—which is how I do it—you can put these in your .Rprofile so they are unneeded in this file but will be executed whenever you invoke R.
The Maddison Project
The Maddison Project data comprise the now-standard empirical estimates to study economic growth over the very long run. These data are provided as an Excel spreadsheet on the Project’s website; since December 2015 they have also been made available as an R library. Further below, in presenting manipulations on Penn World Tables data, I show how to download and use data that have been packaged up in R libraries more generally.
With the R library a researcher can proceed directly to the data analysis for those who use R. The Excel spreadsheet, however, might well remain a standard source for many others, including those who want to see the data directly. Unfortunately, the information here is given in a way that is more useful visually than for data manipulation and analysis. To use this spreadsheet in data analysis one will need to go through something like the following to put the data into usable form.
library(gdata)
library(reshape2)
library(stringr)
theMaddisonXLS <- file.path(myDataDir, "Maddison-Project", "mpd_2013-01.xlsx")
hold.DF <- read.xls(theMaddisonXLS, skip=1, stringsAsFactors=FALSE)
colNames <- as.character(hold.DF[1, ])
colNames[1] <- "Year"
new.DF <- hold.DF[-1, ]
names(new.DF) <- colNames
MaddP.DF <- melt(new.DF, id.vars="Year")
rm(new.DF, hold.DF, colNames)
MaddP.DF$value <- as.numeric(gsub(",", "", MaddP.DF$value))
MaddP.DF <- MaddP.DF[!is.na(MaddP.DF$value), ]
names(MaddP.DF)[2] <- "Economy"
names(MaddP.DF)[3] <- "perCapitaGDP"
MaddP.DF$logPerCapGDP <- log(MaddP.DF$perCapitaGDP)
MaddP.DF$Economy <- str_trim(MaddP.DF$Economy, side="both")
detach("package:stringr")
detach("package:reshape2")
detach("package:gdata")
(Why so elaborate? Check that if we didn’t do trimming of whitespace withstr_trim(), we wouldn’t get a match for “Sweden” in the codechunk to follow. Astounding but true. Instead, we would have had to match “Sweden\32”, i.e., with the invisible blank at the end of the name. Similarly, “Denmark\32,”Finland\32“,”Germany\32“, and so on, but, no, not”France“. That last one’s been put there without a trailing blank. Also, if we hadn’t executed the gsub(), to remove all”,“’s, then entries such as”1,218″ would be unrecognizable as numbers – but instead appear as just NA. Taking out the “,”’s and then converting to numbers by as.numeric are operations needed to get these data to be manipulable for statistical analysis. Yes, spreadsheets and casual hand-editing are excellent to be able to see data directly but they’re dangerous things in computer software.)
Since I knew I would want to use these data repeatedly and I didn’t want to keep running the codechunk above, I saved my own copy of the Maddison Project GDP data in R’s native format:
myMaddP.file <- file.path(myDataDir, "Maddison-Project", "maddp-201301-DQ.rds")
saveRDS(MaddP.DF, file=myMaddP.file)
(This is only for my personal use so I’m not packaging it up as a library. But of course if you do want the R library version, again you can get that for yourself.)
When I now need to use these data I no longer need to do all the stripping and cleaning after (slowly) reading a spreadsheet as above. Instead I just go:
MaddP.DF <- readRDS(myMaddP.file)
rm(myMaddP.file)
so that, for instance, to get growth rates:
MaddP.DF$annGrowth <- NA
for (anEconomy in unique(MaddP.DF$Economy)) {
theYears <- MaddP.DF[MaddP.DF$Economy==anEconomy, ]$Year
logPCGDP <- MaddP.DF[MaddP.DF$Economy==anEconomy, ]$logPerCapGDP
theAnnGr <- rep(NA, length(logPCGDP))
for (jLoop in 2:length(theAnnGr)) {
if (theYears[jLoop-1] == theYears[jLoop]-1) {
theAnnGr[jLoop] <- logPCGDP[jLoop] - logPCGDP[jLoop-1]
}
}
# Change to percent and then move into dataframe
MaddP.DF[MaddP.DF$Economy==anEconomy, ]$annGrowth <- 100.0 * theAnnGr
rm(theAnnGr)
}
(for those who know R, notice I can’t vectorise the inner loop using, say, diffas I need to check if the data are available sequentially in time).
What economies are we working with here?
unique(MaddP.DF$Economy)
## [1] "Austria"
## [2] "Belgium"
## [3] "Denmark"
## [4] "Finland"
## [5] "France"
## [6] "Germany"
## [7] "(Centre- North) Italy"
## [8] "Holland/ Netherlands"
## [9] "Norway"
## [10] "Sweden"
## [11] "Switzerland"
## [12] "England/GB/UK"
## [13] "12 W. Europe"
## [14] "Ireland"
## [15] "Greece"
## [16] "Portugal"
## [17] "Spain"
## [18] "14 small WEC"
## [19] "30 W. Europe"
## [20] "Australia"
## [21] "N. Zealand"
## [22] "Canada"
## [23] "USA"
## [24] "W. Offshoots"
## [25] "Albania"
## [26] "Bulgaria"
## [27] "Czecho-slovakia"
## [28] "Hungary"
## [29] "Poland"
## [30] "Romania"
## [31] "Yugoslavia"
## [32] "7 E. Europe"
## [33] "Bosnia"
## [34] "Croatia"
## [35] "Macedonia"
## [36] "Slovenia"
## [37] "Montenegro"
## [38] "Serbia"
## [39] "Kosovo"
## [40] "F. Yugoslavia"
## [41] "Czech Rep."
## [42] "Slovakia"
## [43] "F. Czecho-slovakia"
## [44] "Armenia"
## [45] "Azerbaijan"
## [46] "Belarus"
## [47] "Estonia"
## [48] "Georgia"
## [49] "Kazakhstan"
## [50] "Kyrgyzstan"
## [51] "Latvia"
## [52] "Lithuania"
## [53] "Moldova"
## [54] "Russia"
## [55] "Tajikistan"
## [56] "Turk-menistan"
## [57] "Ukraine"
## [58] "Uzbekistan"
## [59] "F. USSR"
## [60] "Argentina"
## [61] "Brazil"
## [62] "Chile"
## [63] "Colombia"
## [64] "Mexico"
## [65] "Peru"
## [66] "Uruguay"
## [67] "Venezuela"
## [68] "8 L. America"
## [69] "Bolivia"
## [70] "Costa Rica"
## [71] "Cuba"
## [72] "Dominican Rep."
## [73] "Ecuador"
## [74] "El Salvador"
## [75] "Guatemala"
## [76] "Haïti"
## [77] "Honduras"
## [78] "Jamaica"
## [79] "Nicaragua"
## [80] "Panama"
## [81] "Paraguay"
## [82] "Puerto Rico"
## [83] "T. & Tobago"
## [84] "15 L. America"
## [85] "21 Caribbean"
## [86] "L. America"
## [87] "China"
## [88] "India"
## [89] "Indonesia (Java before 1880)"
## [90] "Japan"
## [91] "Philippines"
## [92] "S. Korea"
## [93] "Thailand"
## [94] "Taiwan"
## [95] "Bangladesh"
## [96] "Burma"
## [97] "Hong Kong"
## [98] "Malaysia"
## [99] "Nepal"
## [100] "Pakistan"
## [101] "Singapore"
## [102] "Sri Lanka"
## [103] "16 E. Asia"
## [104] "Afghanistan"
## [105] "Cambodia"
## [106] "Laos"
## [107] "Mongolia"
## [108] "North Korea"
## [109] "Vietnam"
## [110] "24 Sm. E. Asia"
## [111] "30 E. Asia"
## [112] "Bahrain"
## [113] "Iran"
## [114] "Iraq"
## [115] "Israel"
## [116] "Jordan"
## [117] "Kuwait"
## [118] "Lebanon"
## [119] "Oman"
## [120] "Qatar"
## [121] "Saudi Arabia"
## [122] "Syria"
## [123] ""
## [124] "UAE"
## [125] "Yemen"
## [126] "W. Bank & Gaza"
## [127] "15 W. Asia"
## [128] "Asia"
## [129] "Algeria"
## [130] "Angola"
## [131] "Benin"
## [132] "Botswana"
## [133] "Burkina Faso"
## [134] "Burundi"
## [135] "Cameroon"
## [136] "Cape Verde"
## [137] "Centr. Afr. Rep."
## [138] "Chad"
## [139] "Comoro Islands"
## [140] "Congo 'Brazzaville'"
## [141] "Côte d'Ivoire"
## [142] "Djibouti"
## [143] "Egypt"
## [144] "Equatorial Guinea"
## [145] "Eritrea & Ethiopia"
## [146] "Gabon"
## [147] "Gambia"
## [148] "Ghana"
## [149] "Guinea"
## [150] "Guinea Bissau"
## [151] "Kenya"
## [152] "Lesotho"
## [153] "Liberia"
## [154] "Libya"
## [155] "Madagascar"
## [156] "Malawi"
## [157] "Mali"
## [158] "Mauritania"
## [159] "Mauritius"
## [160] "Morocco"
## [161] "Mozambique"
## [162] "Namibia"
## [163] "Niger"
## [164] "Nigeria"
## [165] "Rwanda"
## [166] "Sao Tomé & Principe"
## [167] "Senegal"
## [168] "Seychelles"
## [169] "Sierra Leone"
## [170] "Somalia"
## [171] "Cape Colony/ South Africa"
## [172] "Sudan"
## [173] "Swaziland"
## [174] "Tanzania"
## [175] "Togo"
## [176] "Tunisia"
## [177] "Uganda"
## [178] "Congo-Kinshasa"
## [179] "Zambia"
## [180] "Zimbabwe"
## [181] "3 Small Afr."
## [182] "Total Africa"
## [183] "Total World"
A bit of a mess, isn’t it? And it’s even after we’ve already done a str_trim().
I am in awe of the amount of work that goes into constructing these Maddison Project data. These researchers have my greatest respect. But the mess above is what happens when authors use names they make up, like “England/GB/UK”, “Holland/ Netherlands”, “(Centre- North) Italy”, “14 small WEC”, or “3 Small Afr.”; or when they insert peculiar characters like “&” or random invisible whitespace.
(Without ISO standardisation, of course, it’s inevitable we have to make things up. Still.)
It’s bad enough when we have to guess what these names mean in a spreadsheet; trying to write computer code to select things systematically is almost impossible.
But having done our best to clean these data, take a look at some selected growth experiences.
In the Appendix I set up R code to do this conveniently; that code will be re-used subsequently as well.
Using that code now, check out these four economies from 1870 to 2010:
theBegSmpl <- 1870
theEndSmpl <- 2010
theEconomies <- c("USA", "England/GB/UK", "France", "Sweden")
theSeries <- "logPerCapGDP"
thisTitle <- "log Per Capita GDP in constant 1990 Int. GK$"
source(file=file.path(myRoutinesDir, "multplot-maddp.R"), local=TRUE, echo=TRUE)
##
## > getSeries <- c("Year", "Economy", theSeries)
##
## > theAES <- aes_string(x = "Year", y = theSeries, group = "Economy",
## + colour = "Economy")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy %in% theEconomies) &
## + (MaddP.DF$Year >= theBegSmpl) & (MaddP.DF$Year <= theEndSmpl),
## + getSeries]
##
## > ggplot(data = this.DF, theAES) + geom_line(size = 2) +
## + myTStheme + ggtitle(thisTitle)
##
## > rm(this.DF, theAES, getSeries)
rm(thisTitle, theSeries, theEconomies, theEndSmpl, theBegSmpl)
To structure more clearly this information, I seek to eyeball an extrapolated trend in these per capita incomes data. As previously, I provide in the Appendix the R code to do this. Here, I just call that code after setting up the things I want to see.
Begin with US data:
theBegFit <- 1870
theEndFit <- 1980
theEndSmp <- 2010
theEconomy <- "USA"
source(file=file.path(myRoutinesDir, "eyetrend-maddp.R"), local=TRUE, echo=TRUE)
##
## > olsFIT <- lm(logPerCapGDP ~ Year, data = MaddP.DF[(MaddP.DF$Economy ==
## + theEconomy) & (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <=
## + .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": log Per Capita GDP in constant 1990 Int. GK$")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy == theEconomy) &
## + (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <= theEndSmp),
## + c("Year", "perCapi ..." ... [TRUNCATED]
##
## > ggplot(data = this.DF, aes(x = Year, y = logPerCapGDP)) +
## + geom_line(size = 2) + geom_segment(data = this.DF, aes(x = theBegFit,
## + xend = .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": Per Capita GDP in constant 1990 Int. GK$")
##
## > expTrendFitted <- function(x) {
## + ifelse(x >= theBegFit & x <= theEndFit, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > expTrendExtrap <- function(x) {
## + ifelse(x >= theEndFit + 1 & x <= theEndSmp, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > ggplot(data = this.DF, aes(x = Year, y = perCapitaGDP)) +
## + geom_line(size = 2) + stat_function(fun = expTrendFitted,
## + linetype = 1, colo .... [TRUNCATED]
##
## > rm(olsFIT, expTrendFitted, expTrendExtrap)
rm(theBegFit, theEndFit, theEndSmp, theEconomy)
where presented are both the fitted linear trend for the log of US per capita GDP, and the resulting exponential trend for the original series. The solid line is the fitted trend; the dashed line the extrapolation.
Remarkably, a smooth exponential trend, fitted from 1870 through as early as 1980, gives a reasonable description on the out-of-sample post-1980 behaviour of US per capita GDP.
Do the same for China but now beginning in 1950 as it’s from then that the Maddison Project data provide a usefully uninterrupted sequence:
theBegFit <- 1950
theEndFit <- 1980
theEndSmp <- 2010
theEconomy <- "China"
source(file=file.path(myRoutinesDir, "eyetrend-maddp.R"), local=TRUE, echo=TRUE)
##
## > olsFIT <- lm(logPerCapGDP ~ Year, data = MaddP.DF[(MaddP.DF$Economy ==
## + theEconomy) & (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <=
## + .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": log Per Capita GDP in constant 1990 Int. GK$")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy == theEconomy) &
## + (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <= theEndSmp),
## + c("Year", "perCapi ..." ... [TRUNCATED]
##
## > ggplot(data = this.DF, aes(x = Year, y = logPerCapGDP)) +
## + geom_line(size = 2) + geom_segment(data = this.DF, aes(x = theBegFit,
## + xend = .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": Per Capita GDP in constant 1990 Int. GK$")
##
## > expTrendFitted <- function(x) {
## + ifelse(x >= theBegFit & x <= theEndFit, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > expTrendExtrap <- function(x) {
## + ifelse(x >= theEndFit + 1 & x <= theEndSmp, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > ggplot(data = this.DF, aes(x = Year, y = perCapitaGDP)) +
## + geom_line(size = 2) + stat_function(fun = expTrendFitted,
## + linetype = 1, colo .... [TRUNCATED]
##
## > rm(olsFIT, expTrendFitted, expTrendExtrap)
rm(theEconomy, theEndSmp, theEndFit, theBegFit)
In stark contrast to the US, China’s per capita GDP follows post-1980 a completely different trajectory from its pre-1980 history. This, of course, is no surprise to anyone even vaguely aware of global economic developments. The value of the calculation is to quantify how large the change is that has occurred: if anyone thought growth trends were slow and difficult to change, China provides a striking and positive counter-example.
Finally, for comparison, let’s do this for the UK:
theBegFit <- 1950
theEndFit <- 1980
theEndSmp <- 2010
theEconomy <- "England/GB/UK"
source(file=file.path(myRoutinesDir, "eyetrend-maddp.R"), local=TRUE, echo=TRUE)
##
## > olsFIT <- lm(logPerCapGDP ~ Year, data = MaddP.DF[(MaddP.DF$Economy ==
## + theEconomy) & (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <=
## + .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": log Per Capita GDP in constant 1990 Int. GK$")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy == theEconomy) &
## + (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <= theEndSmp),
## + c("Year", "perCapi ..." ... [TRUNCATED]
##
## > ggplot(data = this.DF, aes(x = Year, y = logPerCapGDP)) +
## + geom_line(size = 2) + geom_segment(data = this.DF, aes(x = theBegFit,
## + xend = .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": Per Capita GDP in constant 1990 Int. GK$")
##
## > expTrendFitted <- function(x) {
## + ifelse(x >= theBegFit & x <= theEndFit, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > expTrendExtrap <- function(x) {
## + ifelse(x >= theEndFit + 1 & x <= theEndSmp, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > ggplot(data = this.DF, aes(x = Year, y = perCapitaGDP)) +
## + geom_line(size = 2) + stat_function(fun = expTrendFitted,
## + linetype = 1, colo .... [TRUNCATED]
##
## > rm(olsFIT, expTrendFitted, expTrendExtrap)
rm(theEconomy, theEndSmp, theEndFit, theBegFit)
Get a final sense of the difference here by putting all these on the same graph.
theBegSmpl <- 1950
theEndSmpl <- 2010
theEconomies <- c("USA", "England/GB/UK", "China")
theSeries <- "logPerCapGDP"
thisTitle <- "log Per Capita GDP in constant 1990 Int. GK$"
source(file=file.path(myRoutinesDir, "multplot-maddp.R"), local=TRUE, echo=TRUE)
##
## > getSeries <- c("Year", "Economy", theSeries)
##
## > theAES <- aes_string(x = "Year", y = theSeries, group = "Economy",
## + colour = "Economy")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy %in% theEconomies) &
## + (MaddP.DF$Year >= theBegSmpl) & (MaddP.DF$Year <= theEndSmpl),
## + getSeries]
##
## > ggplot(data = this.DF, theAES) + geom_line(size = 2) +
## + myTStheme + ggtitle(thisTitle)
##
## > rm(this.DF, theAES, getSeries)
rm(thisTitle, theSeries, theEconomies, theEndSmpl, theBegSmpl)
A more useful perspective on the size of these cross-country differences come from the levels of the series themselves, not their logs.
theBegSmpl <- 1950
theEndSmpl <- 2010
theEconomies <- c("USA", "England/GB/UK", "China")
theSeries <- "perCapitaGDP"
thisTitle <- "Per Capita GDP in constant 1990 Int. GK$"
source(file=file.path(myRoutinesDir, "multplot-maddp.R"), local=TRUE, echo=TRUE)
##
## > getSeries <- c("Year", "Economy", theSeries)
##
## > theAES <- aes_string(x = "Year", y = theSeries, group = "Economy",
## + colour = "Economy")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy %in% theEconomies) &
## + (MaddP.DF$Year >= theBegSmpl) & (MaddP.DF$Year <= theEndSmpl),
## + getSeries]
##
## > ggplot(data = this.DF, theAES) + geom_line(size = 2) +
## + myTStheme + ggtitle(thisTitle)
##
## > rm(this.DF, theAES, getSeries)
rm(thisTitle, theSeries, theEconomies, theEndSmpl, theBegSmpl)
Remember, however, that this is for per capita GDP and obviously therefore does not take into account the sizes of the different populations.
World Development Indicators
The World Bank’s World Development Indicators WDI are available in one large Excel ZIPfile, one large CSV text ZIPfile, and through online query.
[To include here – R Code I had used for my DV409 course, for students in International Development at LSE]
Penn World Tables
The Penn World Tables provide annual economic data on incomes, outputs, inputs, and productivity across more than 150 economies beginning in 1950. This project was begun by Robert Summers, Alan Heston, and Irving Kravis has now been taken over by a worldwide team of researchers. Feenstra, Robert C., Robert Inklaar, and Marcel P. Timmer (2013) “The Next Generation of the Penn World Table” currently provide regular updates. The project name, however, obviously still shows where it originated.
Penn World Tables (PWT) version 8.0 data are available as spreadsheets created from dynamic queries on the site. The results from such queries also contain extensive descriptions on the assumptions and procedures used to construct these data. So we could always craft our requests to that site directly.
As an alternative to that, R users from around the world have assembled an R dataset collecting together all the PWT data, and have placed that on R servers. So we can instead just use that directly. In this approach – as with much of modern computer thinking – data comprised of numbers are no different from executable library code, so we can just install our own private version of the PWT data as an R package. As with any R library, we only need to install the PWT data once on whatever machine we want to use.
Key in or select pwt8, and then let RStudio install the data.
The pwt8 manual at the site gives a compact description and listing of what’s in it. By loading this dataset, exactly as we would an R library of code that we might run, we immediately have access to all the PWT8.0 variables:
library("pwt8")
data("pwt8.0")
The R documentation describes pwt8.0 as a dataframe of 10,354 observations on 39 variables. To understand this, remember that in R terminology, a dataframe is a 2-dimensional array. However, like most modern things on computers, a dataframe can hold text, numbers, items of logic, and possibly even more complicated objects as its entries, freely intermingled. Each variable has 10,354 (167 economies, 62 years) observations: many of these observations might, of course, be NA (not available) but in principle we have data on 167 economies for 62 years, 1950 through 2011.
Create our own dataframe and put in it, among other information, per capita GDP (measured in thousands of constant 2005 PPP-adjusted US$):
ourOwn.DF <- data.frame (country=pwt8.0$country,
isocode=pwt8.0$isocode,
year=pwt8.0$year)
ourOwn.DF$pc.GDP <- (pwt8.0$rgdpe / pwt8.0$pop)/1000.0
Our dataframe ourOwn.DF now contains in its first four columns the variables country, isocode, year, and per capita GDP pc.GDP. We can see the beginnings of those columns by asking for the dataframe’s structure,
str(ourOwn.DF)
## 'data.frame': 10354 obs. of 4 variables:
## $ country: Factor w/ 167 levels "Angola","Albania",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ isocode: Factor w/ 167 levels "AGO","ALB","ARG",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 ...
## $ pc.GDP : num NA NA NA NA NA NA NA NA NA NA ...
R doesn’t try to print out everything, just enough of the start of those columns that we know things are as we expect.
What PWT uses to label economies is an ISO code (ISO 3166-1 alpha-3), unfortunately, different from the World Bank’s country codes. We can see what these isocodes are by:
unique(ourOwn.DF$isocode)
## [1] AGO ALB ARG ARM ATG AUS AUT AZE BDI BEL BEN BFA BGD BGR BHR BHS BIH
## [18] BLR BLZ BMU BOL BRA BRB BRN BTN BWA CAF CAN CHE CHL CHN CIV CMR COD
## [35] COG COL COM CPV CRI CYP CZE DEU DJI DMA DNK DOM ECU EGY ESP EST ETH
## [52] FIN FJI FRA GAB GBR GEO GHA GIN GMB GNB GNQ GRC GRD GTM HKG HND HRV
## [69] HUN IDN IND IRL IRN IRQ ISL ISR ITA JAM JOR JPN KAZ KEN KGZ KHM KNA
## [86] KOR KWT LAO LBN LBR LCA LKA LSO LTU LUX LVA MAC MAR MDA MDG MDV MEX
## [103] MKD MLI MLT MNE MNG MOZ MRT MUS MWI MYS NAM NER NGA NLD NOR NPL NZL
## [120] OMN PAK PAN PER PHL POL PRT PRY QAT ROU RUS RWA SAU SDN SEN SGP SLE
## [137] SLV SRB STP SUR SVK SVN SWE SWZ SYR TCD TGO THA TJK TKM TTO TUN TUR
## [154] TWN TZA UGA UKR URY USA UZB VCT VEN VNM YEM ZAF ZMB ZWE
## 167 Levels: AGO ALB ARG ARM ATG AUS AUT AZE BDI BEL BEN BFA BGD BGR ... ZWE
so that, explicitly, countries and isocodes can be seen by (the relatively obscure):
by (ourOwn.DF, ourOwn.DF$isocode, FUN = function(a.DF) {
a.DF[1, c("country", "isocode")]
}
)
or, perhaps more transparently, since there is only ever one “2011” observation for each economy:
subset (ourOwn.DF, subset=(ourOwn.DF$year == "2011"), c("isocode", "country"))
## isocode country
## 62 AGO Angola
## 124 ALB Albania
## 186 ARG Argentina
## 248 ARM Armenia
## 310 ATG Antigua and Barbuda
## 372 AUS Australia
## 434 AUT Austria
## 496 AZE Azerbaijan
## 558 BDI Burundi
## 620 BEL Belgium
## 682 BEN Benin
## 744 BFA Burkina Faso
## 806 BGD Bangladesh
## 868 BGR Bulgaria
## 930 BHR Bahrain
## 992 BHS Bahamas
## 1054 BIH Bosnia and Herzegovina
## 1116 BLR Belarus
## 1178 BLZ Belize
## 1240 BMU Bermuda
## 1302 BOL Bolivia
## 1364 BRA Brazil
## 1426 BRB Barbados
## 1488 BRN Brunei
## 1550 BTN Bhutan
## 1612 BWA Botswana
## 1674 CAF Central African Republic
## 1736 CAN Canada
## 1798 CHE Switzerland
## 1860 CHL Chile
## 1922 CHN China
## 1984 CIV Cote d'Ivoire
## 2046 CMR Cameroon
## 2108 COD Congo, Democratic Republic
## 2170 COG Congo, Republic of
## 2232 COL Colombia
## 2294 COM Comoros
## 2356 CPV Cape Verde
## 2418 CRI Costa Rica
## 2480 CYP Cyprus
## 2542 CZE Czech Republic
## 2604 DEU Germany
## 2666 DJI Djibouti
## 2728 DMA Dominica
## 2790 DNK Denmark
## 2852 DOM Dominican Republic
## 2914 ECU Ecuador
## 2976 EGY Egypt
## 3038 ESP Spain
## 3100 EST Estonia
## 3162 ETH Ethiopia
## 3224 FIN Finland
## 3286 FJI Fiji
## 3348 FRA France
## 3410 GAB Gabon
## 3472 GBR United Kingdom
## 3534 GEO Georgia
## 3596 GHA Ghana
## 3658 GIN Guinea
## 3720 GMB Gambia, The
## 3782 GNB Guinea-Bissau
## 3844 GNQ Equatorial Guinea
## 3906 GRC Greece
## 3968 GRD Grenada
## 4030 GTM Guatemala
## 4092 HKG Hong Kong
## 4154 HND Honduras
## 4216 HRV Croatia
## 4278 HUN Hungary
## 4340 IDN Indonesia
## 4402 IND India
## 4464 IRL Ireland
## 4526 IRN Iran
## 4588 IRQ Iraq
## 4650 ISL Iceland
## 4712 ISR Israel
## 4774 ITA Italy
## 4836 JAM Jamaica
## 4898 JOR Jordan
## 4960 JPN Japan
## 5022 KAZ Kazakhstan
## 5084 KEN Kenya
## 5146 KGZ Kyrgyzstan
## 5208 KHM Cambodia
## 5270 KNA St. Kitts & Nevis
## 5332 KOR Korea, Republic of
## 5394 KWT Kuwait
## 5456 LAO Laos
## 5518 LBN Lebanon
## 5580 LBR Liberia
## 5642 LCA St. Lucia
## 5704 LKA Sri Lanka
## 5766 LSO Lesotho
## 5828 LTU Lithuania
## 5890 LUX Luxembourg
## 5952 LVA Latvia
## 6014 MAC Macao
## 6076 MAR Morocco
## 6138 MDA Moldova
## 6200 MDG Madagascar
## 6262 MDV Maldives
## 6324 MEX Mexico
## 6386 MKD Macedonia
## 6448 MLI Mali
## 6510 MLT Malta
## 6572 MNE Montenegro
## 6634 MNG Mongolia
## 6696 MOZ Mozambique
## 6758 MRT Mauritania
## 6820 MUS Mauritius
## 6882 MWI Malawi
## 6944 MYS Malaysia
## 7006 NAM Namibia
## 7068 NER Niger
## 7130 NGA Nigeria
## 7192 NLD Netherlands
## 7254 NOR Norway
## 7316 NPL Nepal
## 7378 NZL New Zealand
## 7440 OMN Oman
## 7502 PAK Pakistan
## 7564 PAN Panama
## 7626 PER Peru
## 7688 PHL Philippines
## 7750 POL Poland
## 7812 PRT Portugal
## 7874 PRY Paraguay
## 7936 QAT Qatar
## 7998 ROU Romania
## 8060 RUS Russia
## 8122 RWA Rwanda
## 8184 SAU Saudi Arabia
## 8246 SDN Sudan
## 8308 SEN Senegal
## 8370 SGP Singapore
## 8432 SLE Sierra Leone
## 8494 SLV El Salvador
## 8556 SRB Serbia
## 8618 STP Sao Tome and Principe
## 8680 SUR Suriname
## 8742 SVK Slovak Republic
## 8804 SVN Slovenia
## 8866 SWE Sweden
## 8928 SWZ Swaziland
## 8990 SYR Syria
## 9052 TCD Chad
## 9114 TGO Togo
## 9176 THA Thailand
## 9238 TJK Tajikistan
## 9300 TKM Turkmenistan
## 9362 TTO Trinidad & Tobago
## 9424 TUN Tunisia
## 9486 TUR Turkey
## 9548 TWN Taiwan
## 9610 TZA Tanzania
## 9672 UGA Uganda
## 9734 UKR Ukraine
## 9796 URY Uruguay
## 9858 USA United States of America
## 9920 UZB Uzbekistan
## 9982 VCT St. Vincent & Grenadines
## 10044 VEN Venezuela
## 10106 VNM Vietnam
## 10168 YEM Yemen
## 10230 ZAF South Africa
## 10292 ZMB Zambia
## 10354 ZWE Zimbabwe
You can also take a look for selected economies of the data that we have just created:
ourOwn.DF[ourOwn.DF$isocode %in% c("CHN", "USA"), c("isocode", "year", "pc.GDP")]
## isocode year pc.GDP
## 1861 CHN 1950 NA
## 1862 CHN 1951 NA
## 1863 CHN 1952 0.614
## 1864 CHN 1953 0.679
## 1865 CHN 1954 0.688
## 1866 CHN 1955 0.705
## 1867 CHN 1956 0.742
## 1868 CHN 1957 0.793
## 1869 CHN 1958 0.912
## 1870 CHN 1959 0.992
## 1871 CHN 1960 0.928
## 1872 CHN 1961 0.588
## 1873 CHN 1962 0.603
## 1874 CHN 1963 0.682
## 1875 CHN 1964 0.772
## 1876 CHN 1965 0.842
## 1877 CHN 1966 0.910
## 1878 CHN 1967 0.799
## 1879 CHN 1968 0.749
## 1880 CHN 1969 0.827
## 1881 CHN 1970 0.967
## 1882 CHN 1971 1.006
## 1883 CHN 1972 0.976
## 1884 CHN 1973 1.043
## 1885 CHN 1974 1.044
## 1886 CHN 1975 1.090
## 1887 CHN 1976 1.065
## 1888 CHN 1977 1.089
## 1889 CHN 1978 1.234
## 1890 CHN 1979 1.296
## 1891 CHN 1980 1.324
## 1892 CHN 1981 1.368
## 1893 CHN 1982 1.475
## 1894 CHN 1983 1.556
## 1895 CHN 1984 1.858
## 1896 CHN 1985 2.005
## 1897 CHN 1986 2.083
## 1898 CHN 1987 2.164
## 1899 CHN 1988 2.111
## 1900 CHN 1989 1.966
## 1901 CHN 1990 2.041
## 1902 CHN 1991 2.138
## 1903 CHN 1992 2.297
## 1904 CHN 1993 2.548
## 1905 CHN 1994 2.742
## 1906 CHN 1995 3.058
## 1907 CHN 1996 3.132
## 1908 CHN 1997 3.296
## 1909 CHN 1998 3.239
## 1910 CHN 1999 3.371
## 1911 CHN 2000 3.533
## 1912 CHN 2001 3.753
## 1913 CHN 2002 4.137
## 1914 CHN 2003 4.451
## 1915 CHN 2004 4.880
## 1916 CHN 2005 5.342
## 1917 CHN 2006 5.973
## 1918 CHN 2007 6.610
## 1919 CHN 2008 6.721
## 1920 CHN 2009 7.189
## 1921 CHN 2010 7.679
## 1922 CHN 2011 8.069
## 9797 USA 1950 12.802
## 9798 USA 1951 13.387
## 9799 USA 1952 13.621
## 9800 USA 1953 14.032
## 9801 USA 1954 13.740
## 9802 USA 1955 14.552
## 9803 USA 1956 14.599
## 9804 USA 1957 14.641
## 9805 USA 1958 14.284
## 9806 USA 1959 15.072
## 9807 USA 1960 15.220
## 9808 USA 1961 15.323
## 9809 USA 1962 16.028
## 9810 USA 1963 16.495
## 9811 USA 1964 17.236
## 9812 USA 1965 18.176
## 9813 USA 1966 19.142
## 9814 USA 1967 19.412
## 9815 USA 1968 20.188
## 9816 USA 1969 20.667
## 9817 USA 1970 20.495
## 9818 USA 1971 21.046
## 9819 USA 1972 22.063
## 9820 USA 1973 23.183
## 9821 USA 1974 22.541
## 9822 USA 1975 22.239
## 9823 USA 1976 23.324
## 9824 USA 1977 24.150
## 9825 USA 1978 25.303
## 9826 USA 1979 25.740
## 9827 USA 1980 25.021
## 9828 USA 1981 25.481
## 9829 USA 1982 24.721
## 9830 USA 1983 25.725
## 9831 USA 1984 27.528
## 9832 USA 1985 28.377
## 9833 USA 1986 28.981
## 9834 USA 1987 29.499
## 9835 USA 1988 30.399
## 9836 USA 1989 31.189
## 9837 USA 1990 31.344
## 9838 USA 1991 30.984
## 9839 USA 1992 31.798
## 9840 USA 1993 32.537
## 9841 USA 1994 33.683
## 9842 USA 1995 34.211
## 9843 USA 1996 35.225
## 9844 USA 1997 36.567
## 9845 USA 1998 37.978
## 9846 USA 1999 39.382
## 9847 USA 2000 40.489
## 9848 USA 2001 40.522
## 9849 USA 2002 40.823
## 9850 USA 2003 41.404
## 9851 USA 2004 42.449
## 9852 USA 2005 43.212
## 9853 USA 2006 43.954
## 9854 USA 2007 44.372
## 9855 USA 2008 43.237
## 9856 USA 2009 41.728
## 9857 USA 2010 42.287
## 9858 USA 2011 42.646
so that we see in these data China’s per capita GDP just breached P$8000 in 2011. By contrast, the US had by 1950, the beginning of the sample, achieved better than 150% of China’s 2011 per capita GDP. In 2011 per capita GDP in the US exceeded 5 times China’s.
Also, we can check out the per capita GDP of a selection of economies of interest, looking at the numbers directly and then producing a graph of them.
This next instruction would serve up the numbers so we can look at them directly but to save space I don’t show its output:
ourOwn.DF[ourOwn.DF$isocode %in% c("GBR", "USA", "SGP"),
c("year", "isocode", "pc.GDP")]
Produce next the desired graph, using the myTStheme aesthetic I defined at the beginning of this document:
thisTitle <- "GBR, USA, SGP per capita GDP at PPP"
ggplot(ourOwn.DF[ourOwn.DF$isocode %in% c("GBR", "USA", "SGP"), c("year", "isocode", "pc.GDP")],
aes(x=year, y=pc.GDP, group=isocode, colour=isocode)) +
myTStheme + geom_line(size=2) + ggtitle(thisTitle)
The first instruction says concentrate on that part of our dataframe whose isocodes are “GBR” (Great Britain), “USA” (the US), or “SGP” (Singapore), and pick out the columns “year”, “isocode”, and “pc.GDP” that go with those isocodes: This does no more than give us a peek within our dataframeourOwnData.DF, but it helps reassure us that everything is OK.
The ggplot() instruction re-creates, on the fly, a dataframe (that will go away once the instruction finishes), that is exactly the same as that we were just looking at from the previous instruction; and then it draws a line graph (usinggeom_line()) where the X-axis is the year variable and the Y-axispc.GDP, grouping the observations by isocode (so the “USA” observations all go together, and the “SGP” ones similarly), and using colours specific to each isocode.
(And so, yes, according to these data Singapore’s citizens, in purchasing power parity and on average, have grown richer than the US’s.)
I had used a related chart, showing the performance of a couple of other East Asian economies, in Chinese Lessons: Singapore’s Epic Regression to the Mean(Nov 2014).
ggplot(ourOwn.DF[ourOwn.DF$isocode %in% c("USA", "SGP", "TWN", "KOR"), c("year", "isocode", "pc.GDP")],
aes(x=year, y=pc.GDP, group=isocode, colour=isocode)) +
myTStheme + geom_line(size=2)
Get better resolution on this information by looking at ratios relative to US:
ourOwn.DF$relUSA <- rep(NA, nrow(ourOwn.DF))
tmpUSA <- ourOwn.DF[ourOwn.DF$isocode=="USA",]$pc.GDP
for (anISOcode in unique(ourOwn.DF$isocode)) {
if (anISOcode!="USA") {
propUSA <- ourOwn.DF[ourOwn.DF$isocode==anISOcode,]$pc.GDP / tmpUSA
ourOwn.DF[ourOwn.DF$isocode==anISOcode,]$relUSA <-propUSA
}
}
The 5 years at the beginning and end of the timesample showed average relative income levels (in percent):
for (anISOcode in c("SGP", "TWN", "KOR")) {
cat(sprintf("%s %5.2f %5.2f\n", anISOcode,
100*mean(ourOwn.DF$relUSA[ourOwn.DF$isocode==anISOcode &
ourOwn.DF$year>1959 & ourOwn.DF$year<1965]),
100*mean(ourOwn.DF$relUSA[ourOwn.DF$isocode==anISOcode &
ourOwn.DF$year>2006 & ourOwn.DF$year<2011])))
}
## SGP 16.48 114.55
## TWN 13.22 63.96
## KOR 7.04 60.87
If you wish, before proceeding, you can now experiment with looking at different economies’ per capita GDP by varying the ggplot() call above. It’s impractical, however, to graph the per capita GDP of all 167 economies: Well, you can do so, of course, but it’s unclear what to make of the resulting wash of colored ink. That, however, has not prevented a number of well-known researchers from presenting exactly that.
IMF
A perennial question arising in timeseries is what to make of the difference between deeper, underlying long-run secular movements and short-run (quarter by quarter, or even year by year) directly observable fluctuations.
I use this background question to motivate the extraction and manipulation IMF World Economic Outlook Data. In particular, I retrace the steps I used to generate the long-run, short-run comparisons in “Convergence Determines Governance” (Nov 2014).
This trend/cycle distinction arises in many interesting situations when working with dynamic data. Our motivation here comes specifically from continuing the previous discussion on economic growth. We examine dynamic income patterns across advanced and emerging economies, taking the opportunity to unpack an additional useful dataset, namely that presented in the IMF’s World Economic Outlook (October 2015).
Download the “By Countries” and “By Country Groups” files from the IMF provider page, and use Excel to convert them to .xslx format. (Incidentally, one of the most common questions of IMF data is what IMF means by “Country Groups”. This listing for the October 2015 WEO report gives the answer.) I’m putting these files in my directory
file.path(myDataDir, “IMF-WEO”, “2015.10”)
and that’s what my R code will point to below. Again, you’ll want to modify those names accordingly for your own machine.
If you peek inside the spreadsheets, you’ll see that IMF decided to present their data vertically by variable (and country, country groups, and so on), and horizontally by year. If each variable has its values running down a column—which is the R convention—the IMF data are organised in the following “panel data” way. The spreadsheet contains, among many others, a variable named “1980”: that variable takes a certain value for the observation “USA GDP”, another for the observation labelled “Singapore Investment”, and so on. This isn’t particularly convenient for timeseries work. So we reshape these data using R.
Rather unfortunately further, the format the IMF decided to use differs across the two critical spreadsheet files. For countries, the file WEOOct2015all.xlsxincludes an extra column for ISO code (which is of course useful) but requires the coder to be wary when peeling off data. Below I will use this ISO information to refer to individual countries, so the code will be written to preserve it.
First read in everything and then keep the series we want. Here it’ll be GDP, or in IMF language, the “WEO Subject Code” that is “NGDPD”, measured in billions of US dollars. Most computer manipulations eschew whitespace like blanks or spaces in names—it’s difficult to tell when one name ends and something else begins. To help us, R quietly goes ahead and replaces the IMF label with a name that substitutes periods “.” for spaces.
I’d earlier, for illustration, detached some libraries, so I just need to put them back:
library(gdata)
library(ggplot2)
If you’re going to be working with cross-country data later, having some systematic way to refer to countries will be convenient. By systematic I mean not an English (or Chinese or German or Russian) name but something that will appear consistently in international databases. ISO codes are a good option and you might want to memorise them—or at least have some ready reference chart for them. ISO codes will be what I use to pull out countries next. Finally, reading from spreadsheets is a slow process generally. So, just as with the Maddison Project data, I’m going to save the data read in into R’s native format that I subsequently use instead.
theDataXLS <- file.path(myDataDir, "IMF-WEO", "2015.10", "WEOOct2015alla.xlsx")
holdAggr.DF <- read.xls(theDataXLS, sheet="WEOOct2015alla", stringsAsFactors=FALSE)
rm(theDataXLS)
theDataXLS <- file.path(myDataDir, "IMF-WEO", "2015.10", "WEOOct2015all.xlsx")
holdIndv.DF <- read.xls(theDataXLS, sheet="WEOOct2015all", stringsAsFactors=FALSE)
rm(theDataXLS)
myIMFweoAggr.file <- file.path(myDataDir, "IMF-WEO", "2015.10", "WEOOct2015Aggr.rds")
myIMFweoIndv.file <- file.path(myDataDir, "IMF-WEO", "2015.10", "WEOOCt2015Indv.rds")
saveRDS(holdAggr.DF, file=myIMFweoAggr.file)
saveRDS(holdIndv.DF, file=myIMFweoIndv.file)
When I want to use these IMF WEO data subsequently, instead of reading spreadsheets slowly, I just do
holdAggr.DF <- readRDS(myIMFweoAggr.file)
holdIndv.DF <- readRDS(myIMFweoIndv.file)
rm(myIMFweoAggr.file, myIMFweoIndv.file)
gdpIMFaggr.DF <- holdAggr.DF[holdAggr.DF$WEO.Subject.Code == "NGDPD",]
gdpIMFindv.DF <- holdIndv.DF[holdIndv.DF$WEO.Subject.Code == "NGDPD" &
holdIndv.DF$ISO %in% c("CHN", "KOR", "SNG", "TWN", "USA"), ]
rm(holdAggr.DF, holdIndv.DF)
Thus far we have been able to do what we want keeping to just R’s basicdata.frame class. Indeed, R comes with basic method functions that understand data.frames, and thus can analyse and manipulate data contained there. However, more extensive timeseries work is better done using classes specialised to manipulate timeseries data, and for which more finely-tuned method functions are available. R contains several possible special data classes (including ts, mts, timeSeries, xts, and so on) for timeseries. Eric Zivot describes these in useful detail, and suggests why an applied researcher might choose one or another.
For our purposes hereafter, when we need timeseries specifics, we will use the zoo (Z’s Ordered Object) class (Zeileis and Grothendieck, 2005). In my view it is this class that displays the right tradeoff between ease of use and flexibility, not least for those researchers working with financial timeseries data: zoo critically adds, over the standard ts and mts classes, the ability to manage data whose indexes are irregularly spaced, i.e., don’t come in just annual, quarterly, or monthly frequencies. Such data might be perhaps spatial or perhaps drawn from a continuous underlying time record but just recorded at specific points. A researcher might want to work with exchange rates that are continuously traded throughout the day but only recorded at particular instants. For working with annual data this specific advantage—handling irregular but ordered indexes—does not yet make a difference but we might as well get used to zoo now: we’re going to need to put IMF’s reorganised data somewhere useful in any case.
A zoo object, like that of a data.frame, is just a two-dimensional array, and can be added to, extracted from, have parts removed, and generally manipulated using much the same conventions as for a data.frame. Its contents too run down columns, each of the latter making up a single variable.
Two differences from data.frames are key. First, a zoo object knows about timeseries structure and methods; this is good. Second, its body comprises just numbers; this needn’t be bad until when you want to analyze something other than numbers.
library(zoo)
library(dynlm)
begYear <- 1980
endYear <- 2015
nmbYears <- endYear - begYear + 1
Because a data.frame is typically more general than just numbers, we have to take added care when moving contents between data.frame and zooobjects. I use the R function sapply, together with as.numeric and gsubbelow to achieve the proper translation but there are other ways to do this. Again, if we didn’t apply gsub, human eye-recognisable big numbers with “,”’s in them will not be interpreted as numbers but simply coerced to NA.
In this application I have cheated by looking into the IMF spreadsheets for where the numerical data sit, rather than automating the procedure more elegantly. But the resulting code below is at least short and transparent. Simply adjust the next few lines appropriately when you use this on IMF data that might have had their formats altered. In software code whenever you see a number other than “0” or “1” and it’s not part of a name, you need to think about why something so special appears in what should otherwise be general. (So watch out on the “9” and “10” below.)
begColumn <- 9
tmpDatMtr <- sapply(gdpIMFaggr.DF[, begColumn:(begColumn+nmbYears-1)],
function (x) {as.numeric(gsub(",", "", as.character(x)))})
begColumn <- 10
tmpHldMtr <- sapply(gdpIMFindv.DF[, begColumn:(begColumn+nmbYears-1)],
function (x) {as.numeric(gsub(",", "", as.character(x)))})
tmpDatMtr <- rbind(tmpDatMtr, tmpHldMtr)
imfGDP.oo <- zoo (t(tmpDatMtr), c(begYear:endYear))
names(imfGDP.oo) <- c(gdpIMFaggr.DF$Country.Group.Name, gdpIMFindv.DF$ISO)
rm(gdpIMFaggr.DF, gdpIMFindv.DF, tmpDatMtr, tmpHldMtr, begColumn)
For aggregates the IMF gives descriptions (such as “Middle East and North Africa”) and corresponding cryptic WEO Country Group Codes (603) but no really convenient label to use in coding. So we give up and just remember what columns in aggrsGDP.oo refer to what country groupings from the output to
print(names(imfGDP.oo))
## [1] "World"
## [2] "Advanced economies"
## [3] "Euro area "
## [4] "Major advanced economies (G7)"
## [5] "Other advanced economies (Advanced economies excluding G7 and euro area)"
## [6] "European Union"
## [7] "Emerging market and developing economies"
## [8] "Commonwealth of Independent States"
## [9] "Emerging and developing Asia"
## [10] "Emerging and developing Europe"
## [11] "ASEAN-5"
## [12] "Latin America and the Caribbean"
## [13] "Middle East, North Africa, Afghanistan, and Pakistan"
## [14] "Middle East and North Africa"
## [15] "Sub-Saharan Africa"
## [16] "CHN"
## [17] "KOR"
## [18] "TWN"
## [19] "USA"
Having seen that output, trash some series in specific rows, and clean up some names that we are more likely to want to use. At least the ISO codes, relatively memorable and convenient to use, can remain unchanged.
names(imfGDP.oo)[2] <- "Advanced.Economies"
names(imfGDP.oo)[3] <- "Euro.Area"
names(imfGDP.oo)[4] <- "G7"
names(imfGDP.oo)[6] <- "EU"
names(imfGDP.oo)[7] <- "EMDE"
names(imfGDP.oo)[14] <- "MENA"
imfGDP.oo <- imfGDP.oo[, c(-5, -8:-10, -12, -13, -15)]
Now construct vectors of the economies and groupings to which we will want to pay closer attention:
indWrldG7EMDE <- match(c("World", "G7", "EMDE"), names(imfGDP.oo))
indG7EMDE <- match(c("G7", "EMDE"), names(imfGDP.oo))
indUSACHN <- match(c("USA", "CHN"), names(imfGDP.oo))
Using match here to create index vectors is much more convenient and less error-prone than trying to remember which column in the zoo objectimfGDP.oo corresponds to what aggregate grouping or economy.
Take a look at some of what we’ve put together:
thisTitle <- "World, G7, and Emerging Economies"
autoplot(imfGDP.oo[, indWrldG7EMDE], facets=NULL) +
geom_line(size=2) + myTStheme + ggtitle(thisTitle)
thisTitle <- "G7 and Emerging Economies"
autoplot(imfGDP.oo[, indG7EMDE], facets=NULL) +
geom_line(size=2) + myTStheme + ggtitle(thisTitle)
thisTitle <- "US and China"
autoplot(imfGDP.oo[, indUSACHN], facets=NULL) +
geom_line(size=2) + myTStheme + ggtitle(thisTitle)
As an exercise, I re-create here that extrapolated-trend graph with China data:
imfGDP.oo$logCHN <- log(imfGDP.oo$CHN)
thisTitle <- "China"
theBegFit <- 1980
theEndFit <- 2000
theEndSmp <- 2015
olsFIT <- dynlm(logCHN ~ trend(imfGDP.oo), data = imfGDP.oo,
start=theBegFit, end=theEndFit)
expTrendFitted <- function(x) {ifelse (x>=theBegFit & x<=theEndFit,
exp(coef(olsFIT)[1] + ((x-theBegFit) * coef(olsFIT)[2])), NA)
}
expTrendExtrap <- function(x) {ifelse (x>=theEndFit+1 & x<=theEndSmp,
exp(coef(olsFIT)[1] + ((x-theBegFit) * coef(olsFIT)[2])), NA)
}
thisTitle <- paste0(thisTitle, " GDP (US$bn, Market Exchange Rates)")
autoplot(imfGDP.oo$CHN, facets=NULL) +
stat_function(fun=expTrendFitted, linetype=1, colour="blue", size=1.1) +
stat_function(fun=expTrendExtrap, linetype=2, colour="blue", size=1.05) +
geom_line(size=2) + myTStheme + ggtitle(thisTitle)
One of the most striking features in these data is how “Emerging Markets and Developing Economies” have assumed a dramatically larger footprint in the global economy. To be clear, the G7 economies, at least from the visual perspective in these graphs, have not slowed dramatically in their collective growth trajectory. Instead, it is that the emerging markets have just grown so much faster since the early 2000s.
thisTitle <- "Underlying Trends (5-year moving average). Trillions current US$"
tmpGDP.oo <- rollmean(imfGDP.oo[, indG7EMDE], 5, align="center")
autoplot(tmpGDP.oo/1000, facets=NULL) +
geom_line(size=2) + myTStheme + ggtitle(thisTitle)
rm(tmpGDP.oo)
thisTitle <- "G7-Emerging Economies gap as fraction of G7 GDP"
autoplot((imfGDP.oo$G7 - imfGDP.oo$EMDE) / imfGDP.oo$G7, facets=NULL) +
geom_line(size=2, colour="dark blue") + myTStheme + ggtitle(thisTitle)
Some of that catch-up is of course due simply (i.e., arithmetically) to China. But the graph comparing aggregate GDP for China and the US shows that that can’t be the entire explanation.
For readers accustomed to thinking of cross-country comparison in per capita GDP, it is useful to remember why these aggregate GDP statistics are useful. Obviously, they wouldn’t be what someone would want to look at for, say, convergence in a neoclassical growth model. However, they are exactly what someone would need to assess shifts in the global balance of power—economic initially of course but then perhaps more generally. It is these statistics someone would want to use to gauge the capacity of different economies or groupings to drive or drag down global economic performance, or to measure needs and functions for appropriate global governance.
For the same reason, the analysis here looks not at GDP corrected for purchasing power parity but instead at market exchange rates. Again, it is these latter that matter for evaluating contribution to the global economy and for assessing global power shifts, while purchasing power parity adjustment is useful instead for estimating residents’ well-being.
A recurrent question given this perspective is, can the emerging economies continue to grow if the advanced ones stagnate? The preceding might suggest yes. Nonetheless, that the emerging economies might slow if the G7 undergoes a more prolonged secular stagnation is often suggested by the following calculations on growth rates:
imfGDP.oo$G7g <- diff(log(imfGDP.oo$G7))
imfGDP.oo$EMDEg <- diff(log(imfGDP.oo$EMDE))
indG7EMDEg <- match(c("G7g", "EMDEg"), names(imfGDP.oo))
thisTitle <- "G7 and Emerging Economy annual growth rates"
autoplot(imfGDP.oo[, indG7EMDEg], facets=NULL) +
geom_line(size=2) + myTStheme + ggtitle(thisTitle)
Notice how growth rates across the two groups seem, over time, to have moved closer and closer in sync with each other. We can confirm this visual impression by calculating cross-correlations:
earlySample <- as.character(seq(1981, 2000))
laterSample <- as.character(seq(2001, 2015))
cor (imfGDP.oo[earlySample, indG7EMDEg])
## G7g EMDEg
## G7g 1.000 0.153
## EMDEg 0.153 1.000
cor (imfGDP.oo[laterSample, indG7EMDEg])
## G7g EMDEg
## G7g 1.000 0.787
## EMDEg 0.787 1.000
The early part of the sample, up through 2000, the cross-correlation was 0.2 between growth rates in the G7 and Emerging Markets and Developing Economies. Towars the end of the sample, after 2000, that statistic had risen more than four-fold to 0.8. However, despite this higher cyclical correlation (ever “tighter coupling”), the long-term trend behaviour—as shown dramatically in the graphs—shows growth occurring in the emerging markets even without corresponding speed up in the G7. Some might even say this last feature shows “decoupling”.
Polity IV
No one – least of all its impeccably conscientious creators and maintainers – pretends that ideas as complicated as democracy or autocracy can be summarized in a single numerical index. Nonetheless, without some basis to start the discussion, we are just making up ideals as we go along, and make little progress. The Polity IV data give us that substantial first step.
The splashpage contains this disclaimer:
I heartily recommend this form of words for all projects financed by governments of nation states.
(TBC – What I did with this for my Middle-Income Trap project.)
CONCLUSION
This writeup has described, for R manipulation, the access and use of a number of datasets central to studying the global economy. Because of its nature, this document is never finished. Everytime the author finds and uses a new dataset valuable to add to our understanding the world economy, that dataset’s description and manipulation appear here.
APPENDIX
The is is the code chunk for plotting multiple series neatly:
read_chunk(file.path(myRoutinesDir, "multplot-maddp.R"))
# This works only for my Maddison Project DF;
# I haven't found it worth coding more generally
# Thu Jan 07 16:33:21 2016 - Danny Quah
getSeries <- c("Year", "Economy", theSeries)
theAES <- aes_string(x="Year", y=theSeries, group="Economy",
colour="Economy")
this.DF <-
MaddP.DF[(MaddP.DF$Economy %in% theEconomies) &
(MaddP.DF$Year >= theBegSmpl) & (MaddP.DF$Year <= theEndSmpl),
getSeries]
ggplot(data=this.DF, theAES) + geom_line(size=2) +
myTStheme + ggtitle(thisTitle)
rm(this.DF, theAES, getSeries)
This is the code chunk that implements the eyeballing-trend operation.
read_chunk(file.path(myRoutinesDir, "eyetrend-maddp.R"))
# This works only for my Maddison Project DF;
# I haven't found it worth coding more generally
# Thu Jan 07 16:33:21 2016 - Danny Quah
olsFIT <- lm(logPerCapGDP ~ Year,
data=MaddP.DF[(MaddP.DF$Economy == theEconomy) &
(MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <= theEndFit), ])
thisTitle <- paste0(theEconomy,
": log Per Capita GDP in constant 1990 Int. GK$")
this.DF <-
MaddP.DF[(MaddP.DF$Economy == theEconomy) &
(MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <= theEndSmp),
c("Year", "perCapitaGDP", "logPerCapGDP")]
ggplot(data=this.DF, aes(x=Year, y=logPerCapGDP)) + geom_line(size=2) +
geom_segment(data=this.DF, aes(x=theBegFit, xend=theEndFit,
y=coef(olsFIT)[1]+coef(olsFIT)[2]*theBegFit,
yend=coef(olsFIT)[1]+coef(olsFIT)[2]*theEndFit),
linetype=1, colour="blue", size=1.1) +
geom_segment(data=this.DF, aes(x=theEndFit+1, xend=theEndSmp,
y=coef(olsFIT)[1]+coef(olsFIT)[2]*(theEndFit+1),
yend=coef(olsFIT)[1]+coef(olsFIT)[2]*theEndSmp),
linetype=2, colour="blue", size=1.05) +
myTStheme + ggtitle(thisTitle)
#
thisTitle <- paste0(theEconomy,
": Per Capita GDP in constant 1990 Int. GK$")
expTrendFitted <- function(x) {ifelse (x>=theBegFit & x<=theEndFit,
exp(coef(olsFIT)[1] + (x * coef(olsFIT)[2])), NA)
}
expTrendExtrap <- function(x) {ifelse (x>=theEndFit+1 & x<=theEndSmp,
exp(coef(olsFIT)[1] + (x * coef(olsFIT)[2])), NA)
}
ggplot(data=this.DF, aes(x=Year, y=perCapitaGDP)) + geom_line(size=2) +
stat_function(fun=expTrendFitted, linetype=1,
colour="blue", size=1.1) +
stat_function(fun=expTrendExtrap, linetype=2,
colour="blue", size=1.05) +
myTStheme + ggtitle(thisTitle)
rm(olsFIT, expTrendFitted, expTrendExtrap)
Long Run World Economic Growth
Economics and International Development, LSE
Background Common to Projects
Long-Run World Economic Growth
October 2014
D. Quah
This writeup summarises some useful information on long-run world economic growth using data from the Maddison Project.
library(knitr)
opts_chunk$set(echo=TRUE, tidy=FALSE, warning=FALSE)
setwd("~/Dropbox/1/j/Code/2014.03-World-Growth")
The Maddison Project provides the now-standard data to study comparative economic growth over the very long run. These data are provided as an Excel spreadsheet. Unfortunately, that information is given in a way that is more useful visually than for data manipulation and analysis. So, typically, one would need to go through the following to put the data into a more usable form.
library(ggplot2)
library(gdata)
## gdata: read.xls support for 'XLS' (Excel 97-2004) files ENABLED.
##
## gdata: read.xls support for 'XLSX' (Excel 2007+) files ENABLED.
##
## Attaching package: 'gdata'
##
## The following object is masked from 'package:stats':
##
## nobs
##
## The following object is masked from 'package:utils':
##
## object.size
library(reshape2)
library(stringr)
theMaddisonXLS <- "~/Dropbox/1/j/data/Maddison-Project/mpd_2013-01.xlsx"
hold.DF <- read.xls(theMaddisonXLS, skip=1, stringsAsFactors=FALSE)
colNames <- as.character(hold.DF[1, ])
colNames[1] <- "Year"
new.DF <- hold.DF[-1, ]
names(new.DF) <- colNames
MaddP.DF <- melt(new.DF, id.vars="Year")
rm(new.DF, hold.DF, colNames)
MaddP.DF$value <- as.numeric(MaddP.DF$value)
MaddP.DF <- MaddP.DF[!is.na(MaddP.DF$value), ]
names(MaddP.DF)[2] <- "Economy"
names(MaddP.DF)[3] <- "perCapitaGDP"
MaddP.DF$logPerCapGDP <- log(MaddP.DF$perCapitaGDP)
MaddP.DF$Economy <- str_trim(MaddP.DF$Economy, side="both")
detach("package:stringr")
detach("package:reshape2")
detach("package:gdata")
(You can check that if we didn’t do this trimming of whitespace withstr_trim(), we wouldn’t get a match for “Sweden” in the codechunk to follow. Astounding but true: Spreadsheets and casual hand-editing are dangerous things to mix and then have lying around on a computer.)
Since I knew I would want to use these data repeatedly and I didn’t want to keep running the codechunk above, I saved my own copy of the Maddison Project GDP data in R’s native format:
myMaddP.file <- "~/Dropbox/1/j/data/Maddison-Project/maddp-201301-DQ.rds"
saveRDS(MaddP.DF, file=myMaddP.file)
(This is just for my personal use so I’m not packaging it up as a library.)
When I now need to use these data I no longer need to do all the stripping and cleaning after (slowly) reading a spreadsheet as above. Instead I just go:
MaddP.DF <- readRDS(myMaddP.file)
and to get growth rates to study subsequently:
MaddP.DF$annGrowth <- NA
for (anEconomy in unique(MaddP.DF$Economy)) {
theYears <- MaddP.DF[MaddP.DF$Economy==anEconomy, ]$Year
logPCGDP <- MaddP.DF[MaddP.DF$Economy==anEconomy, ]$logPerCapGDP
theAnnGr <- rep(NA, length(logPCGDP))
for (jLoop in 2:length(theAnnGr)) {
if (theYears[jLoop-1] == theYears[jLoop]-1) {
theAnnGr[jLoop] <- logPCGDP[jLoop] - logPCGDP[jLoop-1]
}
}
# Change to percent and then move into dataframe
MaddP.DF[MaddP.DF$Economy==anEconomy, ]$annGrowth <- 100.0 * theAnnGr
rm(theAnnGr)
}
(for those who know R, notice I can’t vectorise the inner loop using, say,diff as I need to check if the data are available sequentially in time).
What economies are we working with here?
unique(MaddP.DF$Economy)
## [1] "Austria"
## [2] "Belgium"
## [3] "Denmark"
## [4] "Finland"
## [5] "France"
## [6] "Germany"
## [7] "(Centre- North) Italy"
## [8] "Holland/ Netherlands"
## [9] "Norway"
## [10] "Sweden"
## [11] "Switzerland"
## [12] "England/GB/UK"
## [13] "12 W. Europe"
## [14] "Ireland"
## [15] "Greece"
## [16] "Portugal"
## [17] "Spain"
## [18] "14 small WEC"
## [19] "30 W. Europe"
## [20] "Australia"
## [21] "N. Zealand"
## [22] "Canada"
## [23] "USA"
## [24] "W. Offshoots"
## [25] "Albania"
## [26] "Bulgaria"
## [27] "Czecho-slovakia"
## [28] "Hungary"
## [29] "Poland"
## [30] "Romania"
## [31] "Yugoslavia"
## [32] "7 E. Europe"
## [33] "Bosnia"
## [34] "Croatia"
## [35] "Macedonia"
## [36] "Slovenia"
## [37] "Montenegro"
## [38] "Serbia"
## [39] "Kosovo"
## [40] "F. Yugoslavia"
## [41] "Czech Rep."
## [42] "Slovakia"
## [43] "F. Czecho-slovakia"
## [44] "Armenia"
## [45] "Azerbaijan"
## [46] "Belarus"
## [47] "Estonia"
## [48] "Georgia"
## [49] "Kazakhstan"
## [50] "Kyrgyzstan"
## [51] "Latvia"
## [52] "Lithuania"
## [53] "Moldova"
## [54] "Russia"
## [55] "Tajikistan"
## [56] "Turk-menistan"
## [57] "Ukraine"
## [58] "Uzbekistan"
## [59] "F. USSR"
## [60] "Argentina"
## [61] "Brazil"
## [62] "Chile"
## [63] "Colombia"
## [64] "Mexico"
## [65] "Peru"
## [66] "Uruguay"
## [67] "Venezuela"
## [68] "8 L. America"
## [69] "Bolivia"
## [70] "Costa Rica"
## [71] "Cuba"
## [72] "Dominican Rep."
## [73] "Ecuador"
## [74] "El Salvador"
## [75] "Guatemala"
## [76] "Haïti"
## [77] "Honduras"
## [78] "Jamaica"
## [79] "Nicaragua"
## [80] "Panama"
## [81] "Paraguay"
## [82] "Puerto Rico"
## [83] "T. & Tobago"
## [84] "15 L. America"
## [85] "21 Caribbean"
## [86] "L. America"
## [87] "China"
## [88] "India"
## [89] "Indonesia (Java before 1880)"
## [90] "Japan"
## [91] "Philippines"
## [92] "S. Korea"
## [93] "Thailand"
## [94] "Taiwan"
## [95] "Bangladesh"
## [96] "Burma"
## [97] "Hong Kong"
## [98] "Malaysia"
## [99] "Nepal"
## [100] "Pakistan"
## [101] "Singapore"
## [102] "Sri Lanka"
## [103] "16 E. Asia"
## [104] "Afghanistan"
## [105] "Cambodia"
## [106] "Laos"
## [107] "Mongolia"
## [108] "North Korea"
## [109] "Vietnam"
## [110] "24 Sm. E. Asia"
## [111] "30 E. Asia"
## [112] "Bahrain"
## [113] "Iran"
## [114] "Iraq"
## [115] "Israel"
## [116] "Jordan"
## [117] "Kuwait"
## [118] "Lebanon"
## [119] "Oman"
## [120] "Qatar"
## [121] "Saudi Arabia"
## [122] "Syria"
## [123] "NA"
## [124] "UAE"
## [125] "Yemen"
## [126] "W. Bank & Gaza"
## [127] "15 W. Asia"
## [128] "Asia"
## [129] "Algeria"
## [130] "Angola"
## [131] "Benin"
## [132] "Botswana"
## [133] "Burkina Faso"
## [134] "Burundi"
## [135] "Cameroon"
## [136] "Cape Verde"
## [137] "Centr. Afr. Rep."
## [138] "Chad"
## [139] "Comoro Islands"
## [140] "Congo 'Brazzaville'"
## [141] "Côte d'Ivoire"
## [142] "Djibouti"
## [143] "Egypt"
## [144] "Equatorial Guinea"
## [145] "Eritrea & Ethiopia"
## [146] "Gabon"
## [147] "Gambia"
## [148] "Ghana"
## [149] "Guinea"
## [150] "Guinea Bissau"
## [151] "Kenya"
## [152] "Lesotho"
## [153] "Liberia"
## [154] "Libya"
## [155] "Madagascar"
## [156] "Malawi"
## [157] "Mali"
## [158] "Mauritania"
## [159] "Mauritius"
## [160] "Morocco"
## [161] "Mozambique"
## [162] "Namibia"
## [163] "Niger"
## [164] "Nigeria"
## [165] "Rwanda"
## [166] "Sao Tomé & Principe"
## [167] "Senegal"
## [168] "Seychelles"
## [169] "Sierra Leone"
## [170] "Somalia"
## [171] "Cape Colony/ South Africa"
## [172] "Sudan"
## [173] "Swaziland"
## [174] "Tanzania"
## [175] "Togo"
## [176] "Tunisia"
## [177] "Uganda"
## [178] "Congo-Kinshasa"
## [179] "Zambia"
## [180] "Zimbabwe"
## [181] "3 Small Afr."
## [182] "Total Africa"
## [183] "Total World"
A bit of a mess, isn’t it? And this is after we’ve already done a str_trim().
Look, I am in awe of the amount of work that has gone into constructing these Maddison Project data. These researchers have my greatest respect. But the mess above is what happens when authors use names they go around making up, like “England/GB/UK”, “Holland/ Netherlands”, “(Centre- North) Italy”, “14 small WEC”, or “3 Small Afr.”; or when they insert peculiar characters like “&” or random invisible whitespace or other non-ASCII characters.
(Without ISO standardisation, of course, it’s inevitable we have to make things up. Still.)
It’s bad enough when we have to guess what these names mean in a spreadsheet; trying to write computer code to select things systematically from this is almost impossible.
But having done our best to clean these data, take a look at some selected growth experiences. For convenience and aesthetics, set up a theme for the charts to come.
myTStheme <- theme_classic() +
theme(
plot.title=element_text(size=rel(1.5)),
legend.title=element_text(size=rel(1.5)),
legend.text=element_text(size=rel(1.5)),
legend.position=c(1,0), legend.justification=c(1,0),
axis.text=element_text(size=rel(1.5)),
axis.title=element_text(size=rel(1.5)),
axis.title.y=element_blank()
)
In the Appendix I set up R code to do this conveniently; that code will be re-used subsequently as well.
Using that code now, check out these four economies from 1870 to 2010:
theBegSmpl <- 1870
theEndSmpl <- 2010
theEconomies <- c("USA", "England/GB/UK", "France", "Sweden")
theSeries <- "logPerCapGDP"
thisTitle <- "log Per Capita GDP in constant 1990 Int. GK$"
source(file="./multplot.R", local=TRUE, echo=TRUE)
##
## > getSeries <- c("Year", "Economy", theSeries)
##
## > theAES <- aes_string(x = "Year", y = theSeries, group = "Economy",
## + colour = "Economy")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy %in% theEconomies) &
## + (MaddP.DF$Year >= theBegSmpl) & (MaddP.DF$Year <= theEndSmpl),
## + getSeries]
##
## > ggplot(data = this.DF, theAES) + geom_line(size = 2) +
## + myTStheme + ggtitle(thisTitle)
##
## > rm(this.DF, theAES, getSeries)
rm(thisTitle, theSeries, theEconomies, theEndSmpl, theBegSmpl)
To structure more clearly this information, I seek to eyeball an extrapolated trend in these per capita incomes data. As previously, I provide in the Appendix the R code to do this. Here, I just call that code after setting up the things I want to see.
Begin with US data:
theBegFit <- 1870
theEndFit <- 1980
theEndSmp <- 2010
theEconomy <- "USA"
source(file="./eyetrend.R", local=TRUE, echo=TRUE)
##
## > olsFIT <- lm(logPerCapGDP ~ Year, data = MaddP.DF[(MaddP.DF$Economy ==
## + theEconomy) & (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <=
## + .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": log Per Capita GDP in constant 1990 Int. GK$")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy == theEconomy) &
## + (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <= theEndSmp),
## + c("Year", "perCapi ..." ... [TRUNCATED]
##
## > ggplot(data = this.DF, aes(x = Year, y = logPerCapGDP)) +
## + geom_line(size = 2) + geom_segment(data = this.DF, aes(x = theBegFit,
## + xend = .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": Per Capita GDP in constant 1990 Int. GK$")
##
## > expTrendFitted <- function(x) {
## + ifelse(x >= theBegFit & x <= theEndFit, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > expTrendExtrap <- function(x) {
## + ifelse(x >= theEndFit + 1 & x <= theEndSmp, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > ggplot(data = this.DF, aes(x = Year, y = perCapitaGDP)) +
## + geom_line(size = 2) + stat_function(data = this.DF, fun = expTrendFitted,
## + li .... [TRUNCATED]
##
## > rm(olsFIT, expTrendFitted, expTrendExtrap)
rm(theBegFit, theEndFit, theEndSmp, theEconomy)
where presented are both the fitted linear trend for the log of US per capita GDP, and the resulting exponential trend for the original series. The solid line is the fitted trend; the dashed line the extrapolation.
Remarkably, a smooth exponential trend, fitted from 1870 through as early as 1980, gives a reasonable description on the out-of-sample post-1980 behaviour of US per capita GDP.
Do the same for China but now beginning in 1950 as it’s from then that the Maddison Project data provide a usefully uninterrupted sequence:
theBegFit <- 1950
theEndFit <- 1980
theEndSmp <- 2010
theEconomy <- "China"
source(file="./eyetrend.R", local=TRUE, echo=TRUE)
##
## > olsFIT <- lm(logPerCapGDP ~ Year, data = MaddP.DF[(MaddP.DF$Economy ==
## + theEconomy) & (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <=
## + .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": log Per Capita GDP in constant 1990 Int. GK$")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy == theEconomy) &
## + (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <= theEndSmp),
## + c("Year", "perCapi ..." ... [TRUNCATED]
##
## > ggplot(data = this.DF, aes(x = Year, y = logPerCapGDP)) +
## + geom_line(size = 2) + geom_segment(data = this.DF, aes(x = theBegFit,
## + xend = .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": Per Capita GDP in constant 1990 Int. GK$")
##
## > expTrendFitted <- function(x) {
## + ifelse(x >= theBegFit & x <= theEndFit, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > expTrendExtrap <- function(x) {
## + ifelse(x >= theEndFit + 1 & x <= theEndSmp, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > ggplot(data = this.DF, aes(x = Year, y = perCapitaGDP)) +
## + geom_line(size = 2) + stat_function(data = this.DF, fun = expTrendFitted,
## + li .... [TRUNCATED]
##
## > rm(olsFIT, expTrendFitted, expTrendExtrap)
rm(theEconomy, theEndSmp, theEndFit, theBegFit)
In stark contrast to the US, China’s per capita GDP follows post-1980 a completely different trajectory from its pre-1980 history. This, of course, is no surprise to anyone even vaguely aware of global economic developments. The value of the calculation is to quantify how large the change is that has occurred: if anyone thought growth trends were slow and difficult to change, China provides a striking and positive counter-example.
Finally, for comparison, let’s do this for the UK:
theBegFit <- 1950
theEndFit <- 1980
theEndSmp <- 2010
theEconomy <- "England/GB/UK"
source(file="./eyetrend.R", local=TRUE, echo=TRUE)
##
## > olsFIT <- lm(logPerCapGDP ~ Year, data = MaddP.DF[(MaddP.DF$Economy ==
## + theEconomy) & (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <=
## + .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": log Per Capita GDP in constant 1990 Int. GK$")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy == theEconomy) &
## + (MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <= theEndSmp),
## + c("Year", "perCapi ..." ... [TRUNCATED]
##
## > ggplot(data = this.DF, aes(x = Year, y = logPerCapGDP)) +
## + geom_line(size = 2) + geom_segment(data = this.DF, aes(x = theBegFit,
## + xend = .... [TRUNCATED]
##
## > thisTitle <- paste0(theEconomy, ": Per Capita GDP in constant 1990 Int. GK$")
##
## > expTrendFitted <- function(x) {
## + ifelse(x >= theBegFit & x <= theEndFit, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > expTrendExtrap <- function(x) {
## + ifelse(x >= theEndFit + 1 & x <= theEndSmp, exp(coef(olsFIT)[1] +
## + (x * coef(olsFIT)[2])), NA)
## + }
##
## > ggplot(data = this.DF, aes(x = Year, y = perCapitaGDP)) +
## + geom_line(size = 2) + stat_function(data = this.DF, fun = expTrendFitted,
## + li .... [TRUNCATED]
##
## > rm(olsFIT, expTrendFitted, expTrendExtrap)
rm(theEconomy, theEndSmp, theEndFit, theBegFit)
Get a final sense of the difference here by putting all these on the same graph.
theBegSmpl <- 1950
theEndSmpl <- 2010
theEconomies <- c("USA", "England/GB/UK", "China")
theSeries <- "logPerCapGDP"
thisTitle <- "log Per Capita GDP in constant 1990 Int. GK$"
source(file="./multplot.R", local=TRUE, echo=TRUE)
##
## > getSeries <- c("Year", "Economy", theSeries)
##
## > theAES <- aes_string(x = "Year", y = theSeries, group = "Economy",
## + colour = "Economy")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy %in% theEconomies) &
## + (MaddP.DF$Year >= theBegSmpl) & (MaddP.DF$Year <= theEndSmpl),
## + getSeries]
##
## > ggplot(data = this.DF, theAES) + geom_line(size = 2) +
## + myTStheme + ggtitle(thisTitle)
##
## > rm(this.DF, theAES, getSeries)
rm(thisTitle, theSeries, theEconomies, theEndSmpl, theBegSmpl)
A more useful perspective on the size of these cross-country differences come from the levels of the series themselves, not their logs.
myTStheme <- theme_classic() +
theme(
plot.title=element_text(size=rel(1.5)),
legend.title=element_text(size=rel(1.5)),
legend.text=element_text(size=rel(1.5)),
legend.position=c(0.45,0.6), legend.justification=c(1,0),
axis.text=element_text(size=rel(1.5)),
axis.title=element_text(size=rel(1.5)),
axis.title.y=element_blank()
)
theBegSmpl <- 1950
theEndSmpl <- 2010
theEconomies <- c("USA", "England/GB/UK", "China")
theSeries <- "perCapitaGDP"
thisTitle <- "Per Capita GDP in constant 1990 Int. GK$"
source(file="./multplot.R", local=TRUE, echo=TRUE)
##
## > getSeries <- c("Year", "Economy", theSeries)
##
## > theAES <- aes_string(x = "Year", y = theSeries, group = "Economy",
## + colour = "Economy")
##
## > this.DF <- MaddP.DF[(MaddP.DF$Economy %in% theEconomies) &
## + (MaddP.DF$Year >= theBegSmpl) & (MaddP.DF$Year <= theEndSmpl),
## + getSeries]
##
## > ggplot(data = this.DF, theAES) + geom_line(size = 2) +
## + myTStheme + ggtitle(thisTitle)
##
## > rm(this.DF, theAES, getSeries)
rm(thisTitle, theSeries, theEconomies, theEndSmpl, theBegSmpl)
Remember, however, that this is for per capita GDP and obviously therefore does not take into account the sizes of the different populations.
APPENDIX
The is is the code chunk for plotting multiple series neatly:
read_chunk("./multplot.R")
getSeries <- c("Year", "Economy", theSeries)
theAES <- aes_string(x="Year", y=theSeries, group="Economy",
colour="Economy")
this.DF <-
MaddP.DF[(MaddP.DF$Economy %in% theEconomies) &
(MaddP.DF$Year >= theBegSmpl) & (MaddP.DF$Year <= theEndSmpl),
getSeries]
ggplot(data=this.DF, theAES) + geom_line(size=2) +
myTStheme + ggtitle(thisTitle)
rm(this.DF, theAES, getSeries)
This is the code chunk that implements the eyeballing-trend operation.
read_chunk("./eyetrend.R")
olsFIT <- lm(logPerCapGDP ~ Year,
data=MaddP.DF[(MaddP.DF$Economy == theEconomy) &
(MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <= theEndFit), ])
thisTitle <- paste0(theEconomy,
": log Per Capita GDP in constant 1990 Int. GK$")
this.DF <-
MaddP.DF[(MaddP.DF$Economy == theEconomy) &
(MaddP.DF$Year >= theBegFit) & (MaddP.DF$Year <= theEndSmp),
c("Year", "perCapitaGDP", "logPerCapGDP")]
ggplot(data=this.DF, aes(x=Year, y=logPerCapGDP)) + geom_line(size=2) +
geom_segment(data=this.DF, aes(x=theBegFit, xend=theEndFit,
y=coef(olsFIT)[1]+coef(olsFIT)[2]*theBegFit,
yend=coef(olsFIT)[1]+coef(olsFIT)[2]*theEndFit),
linetype=1, colour="blue", size=1.1) +
geom_segment(data=this.DF, aes(x=theEndFit+1, xend=theEndSmp,
y=coef(olsFIT)[1]+coef(olsFIT)[2]*(theEndFit+1),
yend=coef(olsFIT)[1]+coef(olsFIT)[2]*theEndSmp),
linetype=2, colour="blue", size=1.05) +
myTStheme + ggtitle(thisTitle)
#
thisTitle <- paste0(theEconomy,
": Per Capita GDP in constant 1990 Int. GK$")
expTrendFitted <- function(x) {ifelse (x>=theBegFit & x<=theEndFit,
exp(coef(olsFIT)[1] + (x * coef(olsFIT)[2])), NA)
}
expTrendExtrap <- function(x) {ifelse (x>=theEndFit+1 & x<=theEndSmp,
exp(coef(olsFIT)[1] + (x * coef(olsFIT)[2])), NA)
}
ggplot(data=this.DF, aes(x=Year, y=perCapitaGDP)) + geom_line(size=2) +
stat_function(data=this.DF, fun=expTrendFitted, linetype=1,
colour="blue", size=1.1) +
stat_function(data=this.DF, fun=expTrendExtrap, linetype=2,
colour="blue", size=1.05) +
myTStheme + ggtitle(thisTitle)
rm(olsFIT, expTrendFitted, expTrendExtrap)
Quickly Starting on R
Short notes upon (re) starting to use R (also here)
D. Quah
Economics and International Development, LSE
October 2014
This document is a short guide for someone starting R, or just coming back to it. The writeup is terse and does not seek to explain matters fully. Instead, it is intended as a quick reference for the reader to get going fast and then on to other work.
In this writeup you will:
- install R on your own computer (unless you have already done so);
- install RStudio (optional but useful);
- do one quick basic operation in R to make sure your system is now capable of running R.
MOTIVATION
I like to use R because:
- R is open source software, and is freely available; you don’t have to be logged into the LSE (or your work) network to use it; you can use it without being connected to the Internet; you can perform your research while on a long flight or on the beach; you can freely install and use R on as many machines as you like;
- a large community of scientists across many disciplines works with R regularly, and post online questions, answers, and experiences regarding it (e.g., “R vs Stata: .. Datasets”, “R – a second language”, and many others);
- R is a language besides being statistical software, so you can extend R to pretty much any application your mind can imagine;
- R encourages open science—literate programming and reproducible research—and thus makes convenient the replication of empirical findings;
- worldwide, network servers from Argentina and Colombia through Vietnam and New Zealand carry its latest versions;
- in poorer societies these features of R promote research and human capital accumulation, so public spending can then usefully go elsewhere rather than on costly licensing arrangements.
- R is constantly being improved.
Convenient summaries of R commands are available (e.g., the cheatsheet or the Wikibook) but won’t of course be necessarily the best way to start learning to use the software. Books on R (e.g., Michael Crawley 2012 or its earlier first edition) are similarly useful as references but, again, might not always be where someone should head first to get going quickly.
Instead, what I’ve found useful to start is simply to cut and paste from what other people have already written that is most closely related to what I want to do. I intend the exercises that follow to give you that kind of a base so you can then get going on your own research. There is nothing holy or admirable or morally uplifting about writing code from scratch when others have already done so. Our primary goal on this journey is to find out things about the world; aesthetic is secondary.
Before plunging in, some points that many first-time users might not routinely think about:
- To run lines of code, you have to be totally obsessive about getting things exactly right [sometimes you’re lucky and things work anyway even if you slip up, but it’s best not to rely on that].
- If something appears in quotes, i.e., like “[…]”, make sure you put those quotes in exactly: Double quotes ” are different from single quotes ’. Use the right ones.
- If a name or a command is UPPER CASE or lower case, make sure that’s exactly how you type it. R distinguishes case.
- Sometimes R chatters back at you, with no action required back from you. Sometimes it tells you something you need to fix. Either way, pay attention to what it says, even if only to ignore it after you get the meaning.
R, RStudio, and PERL
R is the core collection of routines for statistical computing, while RStudio provides a convenient front end to R. The way that I operate, RStudio works best for me. Others might prefer to engage with R directly, or use a different front-end environment to R.
For many things I do, it is convenient to have R draw on the added functionality of the (separate) PERL language. As one example, to read data from Excel spreadsheets, R uses PERL modules—previously written by others and made freely available.
Therefore, for some, R alone suffices; for me, I want all three.
(Others might wish to use R in tandem with yet other additional software. They’re free to do so.)
Install PERL
If you’re on Mac OS X or Linux, you can skip this as you already have Perl on your machines. If, however, you’re on Windows, point your Internet browser here and download and install Strawberry Perl for Windows.
Install R
For R, point your Internet browser at this landing page. This gives you information on R generally and shows you a link to download R. Go there and select from the list a CRAN Mirror nearest you. I chose the one at Imperial College but it doesn’t much matter: they all work the same way. If you’re reading this document from Beijing, say, you might of course want to choose a different CRAN Mirror. Once you’ve selected the mirror, choose “Download R for Windows” or “Download R for (Mac) OS X”, or “Download R for Linux’’—depending on your system. Run that file to install R on your machine.
Install RStudio
Again, this is optional but RStudio provides a clean and convenient interface to use R. Point your browser here and choose the version appropriate for your platform. The website actually guesses what you’re going to need and serves that up for you as a lead recommendation. If, however, the website gets it wrong, you go ahead and choose what will work for you. Download and install.
(If you really want to get fancy, you can select the RStudio Server but that’s only if you run your own Linux server, in which case you likely shouldn’t even be reading this document.)
Running R
With a fresh R on your computer, depending on what you want to do, you will need to install some libraries first—but you’ll only need to do this once. Here, we want to read data from an Excel spreadsheet into R, so we need to augment R with some libraries.
Fire up RStudio and go into “Tools/Install Packages” (i.e., mouse over to the “Tools” menu item, click on it, and activate the “Install Packages…” entry). You’ll see that some defaults have already been filled in: if you know what you want with alternative values to these defaults, go ahead and plug in those values. Otherwise, just leave them. Type gdata into “Packages (separate multiple…)” . Make sure “Install dependencies” is activated, and then press “Install”. This gdata library is what will let R read data in Excel spreadsheets. To install this library, your machine needs to be online, as R will reach out into its servers, wherever they are, to retrieve this code and then install the library on your machine.
In RStudio (if you’re there still, or if you’ve just come back to this, start up RStudio and then) set your working directory by “
Session/Set Working Directory …/Choose Directory”. This is one way to do it; alternatively, you can hit return after keying into the panel labelledConsole the line of R code:
setwd("~/Dropbox/1/t/courses/ec402/2013t14/w")
where, obviously, you substitute for the phrase in green your own working directory. The tilde ~ denotes your home directory, wherever that might be. (Mine is either C:/Users/DQUAH or /home/dquah, depending on whether I happen to be using Windows or Linux right at that moment. The nice thing about using the tilde is that my code then works the same regardless where I am.) Alternatively, you can copy and paste that preceding line into your own RStudio Console, edit the relevant clause with your keyboard, and then hit return. You can do this for any of the chunks of R code that follow.
To make sure you’ve got things under control, save this, i.e., mouse to “File/New File/R Script” and then copy the one line of code we’ve just executed into the newly-appeared top left-hand window in RStudio (that new window will typically be called “Untitled1”), and then go like all “File/Save As … ” on RStudio. I’m saving this as the R script e1.R.
This will be a first R program, containing just the one setwd() line. I know I’m going to want to be adding to this R program to do my analysis. But for now I just want to make sure, if I can help it, that my work doesn’t go away unexpectedly.
If you look at this working directory now on your machine, you’ll see it has at least the file e1.R, or whatever you decided to call your R script. You can take that peek using Windows Explorer or a bash Terminal or the Finder… whatever. You can also get this same information from within RStudio by hitting return after keying into the Console (i.e., by executing the line):
dir()
You should see a listing of the directory that you’ve setwd’d to, including at least the file e1.R (and whatever else might be there). So, perhaps something like this:
[1] “e1.R” “WB-GDP-cleaned-DQ.xls”
(where WB-GDP-cleaned-DQ.xls is the Excel spreadsheet with which I happen to be working.)
You can execute R code by keying it directly into the RStudio console or more typically opening an R script (such as e1.R — which is just puretext that you can edit in any text editor) from RStudio, making sure your RStudio focus is on that R Script panel, and then going “Code/Run Region/Run All. When you do the latter, you’ll see RStudio Console automatically stepping through your code.
Now go get a drink, stretch your legs, do some taiji.
DATA IN DATAFRAMES
The key object that we will use to hold data is what R calls a dataframe.
A dataframe is a 2-dimensional array but like most modern things on computers, a dataframe can hold text, numbers, items of logic, calendar dates (i.e., not just as numbers but recognising the structure of quarters, months, and days), and possibly even more complicated objects as its entries, all freely intermingled. (Matrices of just numbers are very last-century.)
Among other reasons the dataframe is key for our work is that a dataframe is what R builds when it reads in an Excel spreadsheet. So, for instance, if we have a spreadsheet 2014.01-Poverty+Growth-DQ.xlsx in the folder ~/Dropbox/1/j/data/Global-Distribution, we can read the data in it, in its different sheets, into different dataframes:
library(gdata)
## Warning: package 'gdata' was built under R version 3.1.1
setwd("~/Dropbox/1/j/data/Global-Distribution/")
theDataXLS <- "2014.01-Poverty+Growth-DQ.xlsx"
Country.Info.DF <- read.xls(theDataXLS, sheet="Country-Info")
World.Pov.DF <- read.xls(theDataXLS, sheet="WB-Pov")
World.GNI.DF <- read.xls(theDataXLS, sheet="WB-GNI-pc")
From the code just run and the earlier chunk, you’ll notice that I can use spreadsheets saved in either “.xls” or “.xlsx” formats: the code in gdatatakes into account which of the two I happen to be using when I call read.xls().
Unlike, say, computer systems that need a specific file extension to tell them what kind of a file is being used, R doesn’t care what I name the objects I create within it. Nonetheless, although of course you don’t have to do this, I like putting “.DF” at the ends of the names to my dataframes as doing so helps me remember what they are.
Also, because I often need to read the R code I’ve written and to understand its logic quickly, I’m a little obsessive about how my code is formatted. So, in the preceding I’ve lined up the assignment <- symbols. Again, not everyone needs to do this and most of the time R simply doesn’t care how its code looks.
CONCLUSION
This document has provided brief notes as a quick guide for someone starting to use R (or coming back to it).
Modern computing platforms and R allow multiple pathways to achieve any given end goal. The setup in this document prepares a system that ends up looking like mine; others, however, might prefer a different organizational structure for their work.
REFERENCES
- R Cheatsheet
- R Programming Wikibook
- Chang, Winston. Cookbook for R
- Crawley, Michael. 2012. The R Book or its earlier first edition
- Kabacoff, Rob. 2012. Quick-R
- The R Manuals
- R Tutorial: An R Introduction to Statistics
- R Tutorial: Introduction
- R for Econometrics
