4 Manipulating data

At this point, you can set up an R project, import a data file into RStudio, and quickly explore the dataset’s structure, which is typically the first step in any analysis. If the dataset is well-structured, the next crucial step is to manipulate the data in R to obtain the information we need. Often, we are not interested in the entire dataset; instead, we focus on selecting specific variables, analyzing certain groups, and observing how these variables vary across those groups

With R, there are several ways to perform these operations. Firstly, you can use R’s base functions, without having to load a specific package. An alternative approach is based on the tidyverse, the package suite developed by Hadley Wickham with a special syntax. Here, we’ll concentrate on the tidyverse functions, which I feel are easier and more intuitive to use.

4.1 Introduction to dplyr functions

To manipulate data, we will use a package from the tidyverse called dplyr, comprising a comprehensive suite of exceptionally useful functions. To familiarize ourselves with its usage, we will explore the Quality of Government Environmental Indicators dataset. To do this, we will need first to load the tidyverse and import the data with the haven package.

# Load the tidyverse

library(tidyverse)

── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0.9000 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(haven)

qog <- read_dta("data/qog_env.dta") 

head(qog)

# A tibble: 6 × 414
  cname       ccode  year cname_qog   ccode_qog ccodealp ccodealp_year ccodecow
  <chr>       <dbl> <dbl> <chr>           <dbl> <chr>    <chr>            <dbl>
1 Afghanistan     4  1946 Afghanistan         4 AFG      AFG46              700
2 Afghanistan     4  1947 Afghanistan         4 AFG      AFG47              700
3 Afghanistan     4  1948 Afghanistan         4 AFG      AFG48              700
4 Afghanistan     4  1949 Afghanistan         4 AFG      AFG49              700
5 Afghanistan     4  1950 Afghanistan         4 AFG      AFG50              700
6 Afghanistan     4  1951 Afghanistan         4 AFG      AFG51              700
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
#   act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
#   ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
#   ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
#   ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
#   ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>,
#   edgar_ch4 <dbl>, edgar_co <dbl>, edgar_co2gdp <dbl>, edgar_co2pc <dbl>, …

4.1.1 Count the observations of groups with `count()`

One basic operation in data manipulation is to count the number of observations of different values in a variable. The dplyr::count() function is used for this purpose.

In Exercise 1, we observed that the qog dataset has a country-year structure. For each country, there is a series of indicators, with one value per year. The dataset includes a year variable and a cname variable representing the country.

We might want to know how many country observations we have for each year. To do this, we can use count() with the year variable, that will give us the number of observations for each year. If we look at the results, we see that the dataset has observations from 1946 to 2020 with an increasing number of countries in the dataset over time.

count(qog, year)

# A tibble: 75 × 2
    year     n
   <dbl> <int>
 1  1946    75
 2  1947    76
 3  1948    82
 4  1949    85
 5  1950    88
 6  1951    88
 7  1952    89
 8  1953    89
 9  1954    91
10  1955    94
# ℹ 65 more rows

table(qog$year) # Alternative way to do it


1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 
  75   76   82   85   88   88   89   89   91   94   96   97   97   98  103  118 
1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 
 119  125  127  131  135  138  141  143  144  146  149  149  151  154  160  160 
1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 
 160  165  166  167  168  168  170  170  170  172  172  172  172  173  189  192 
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 
 192  193  193  193  193  193  193  193  194  194  194  194  195  195  195  195 
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 
 195  195  195  195  195  194  194  194  194  194  194

If we want to cross the year and cname variables, we can use the count() function with both variables. This will give us the number of observations for each year and country. In this case, we see that we have only one observation for each country-year combination.

count(qog, year, cname)

# A tibble: 11,722 × 3
    year cname                                n
   <dbl> <chr>                            <int>
 1  1946 Afghanistan                          1
 2  1946 Albania                              1
 3  1946 Andorra                              1
 4  1946 Antigua and Barbuda                  1
 5  1946 Argentina                            1
 6  1946 Australia                            1
 7  1946 Belgium                              1
 8  1946 Bhutan                               1
 9  1946 Bolivia (Plurinational State of)     1
10  1946 Brazil                               1
# ℹ 11,712 more rows

4.1.2 `filter()` observations

We may want to use only a subset of countries and get a smaller versions of the dataset. And the tidyverse has one for this, which is called filter() We want to select specific rows/observations of the dataset based on a specific conditions. To filter, we will use extensively boolean operators.

Operator	Description
`==`	equal
`!=`	not equal
`<`	less than
`<=`	less than or equal
`>`	greater than
`>=`	greater than or equal
`&`	and
`\|`	or
`!`	not
`%in%`	in the set

For instance, I might be only interested in environmental indicators for France. In that case I could filter the whole dataset with only the observations that has France as country variable.

filter(qog, cname == "France")

# A tibble: 75 × 414
   cname  ccode  year cname_qog      ccode_qog ccodealp ccodealp_year ccodecow
   <chr>  <dbl> <dbl> <chr>              <dbl> <chr>    <chr>            <dbl>
 1 France   250  1963 France (1963-)       250 FRA      FRA63              220
 2 France   250  1964 France (1963-)       250 FRA      FRA64              220
 3 France   250  1965 France (1963-)       250 FRA      FRA65              220
 4 France   250  1966 France (1963-)       250 FRA      FRA66              220
 5 France   250  1967 France (1963-)       250 FRA      FRA67              220
 6 France   250  1968 France (1963-)       250 FRA      FRA68              220
 7 France   250  1969 France (1963-)       250 FRA      FRA69              220
 8 France   250  1970 France (1963-)       250 FRA      FRA70              220
 9 France   250  1971 France (1963-)       250 FRA      FRA71              220
10 France   250  1972 France (1963-)       250 FRA      FRA72              220
# ℹ 65 more rows
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
#   act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
#   ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
#   ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
#   ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
#   ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>, …

# If I want to keep all the countries except France
filter(qog, cname != "France")

# A tibble: 11,647 × 414
   cname       ccode  year cname_qog   ccode_qog ccodealp ccodealp_year ccodecow
   <chr>       <dbl> <dbl> <chr>           <dbl> <chr>    <chr>            <dbl>
 1 Afghanistan     4  1946 Afghanistan         4 AFG      AFG46              700
 2 Afghanistan     4  1947 Afghanistan         4 AFG      AFG47              700
 3 Afghanistan     4  1948 Afghanistan         4 AFG      AFG48              700
 4 Afghanistan     4  1949 Afghanistan         4 AFG      AFG49              700
 5 Afghanistan     4  1950 Afghanistan         4 AFG      AFG50              700
 6 Afghanistan     4  1951 Afghanistan         4 AFG      AFG51              700
 7 Afghanistan     4  1952 Afghanistan         4 AFG      AFG52              700
 8 Afghanistan     4  1953 Afghanistan         4 AFG      AFG53              700
 9 Afghanistan     4  1954 Afghanistan         4 AFG      AFG54              700
10 Afghanistan     4  1955 Afghanistan         4 AFG      AFG55              700
# ℹ 11,637 more rows
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
#   act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
#   ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
#   ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
#   ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
#   ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>, …

# Filter for multiple countries
filter(qog, cname %in% c("France", "Norway", "Spain", "Denmark"))

# A tibble: 300 × 414
   cname   ccode  year cname_qog ccode_qog ccodealp ccodealp_year ccodecow
   <chr>   <dbl> <dbl> <chr>         <dbl> <chr>    <chr>            <dbl>
 1 Denmark   208  1946 Denmark         208 DNK      DNK46              390
 2 Denmark   208  1947 Denmark         208 DNK      DNK47              390
 3 Denmark   208  1948 Denmark         208 DNK      DNK48              390
 4 Denmark   208  1949 Denmark         208 DNK      DNK49              390
 5 Denmark   208  1950 Denmark         208 DNK      DNK50              390
 6 Denmark   208  1951 Denmark         208 DNK      DNK51              390
 7 Denmark   208  1952 Denmark         208 DNK      DNK52              390
 8 Denmark   208  1953 Denmark         208 DNK      DNK53              390
 9 Denmark   208  1954 Denmark         208 DNK      DNK54              390
10 Denmark   208  1955 Denmark         208 DNK      DNK55              390
# ℹ 290 more rows
# ℹ 406 more variables: ccodevdem <dbl>, cname_year <chr>, version <chr>,
#   act_act <dbl>, as_rifr <dbl>, as_ws <dbl>, bti_envc <dbl>, ccci_coop <dbl>,
#   ccci_em <dbl>, ccci_fin <dbl>, ccci_kyoto <dbl>, ccci_rep <dbl>,
#   ccci_unfccc <dbl>, cckp_rain <dbl>, cckp_temp <dbl>, ccl_exepp <dbl>,
#   ccl_leglp <dbl>, ccl_lpp <dbl>, ccl_mitlpp <dbl>, ccl_nexep <dbl>,
#   ccl_nlegl <dbl>, ccl_nlp <dbl>, ccl_nmitlp <dbl>, edgar_bc <dbl>, …

# Filter for countries and year and save in a new object ! 
qog_subset <- filter(qog, cname %in% c("France", "Norway", "Spain", "Denmark"), year > 1990)

count(qog_subset, cname)

# A tibble: 4 × 2
  cname       n
  <chr>   <int>
1 Denmark    30
2 France     30
3 Norway     30
4 Spain      30

Often however, we need several operations to be done together. To do so with the tidyverse functions, we can use something that is called the pipe : |>. You will quickly understand what the pipe is, we use it to chain instructions just as in a recipe. Depending on the version of R, you might also use/see this %>%. The shortcut for it is Ctrl + Shift + M or Cmd + Shift + M on Mac.

Let’s say we want to count the number of observations we have on the variable wvs_pedp that measure the share of the population participating in environmental protests for Sweden. We can use the pipe to do it by chaining the filter() and count() functions rather than doing it separately in different steps.

qog |> 
  # Keep only countries 
  filter(cname == "Sweden") |> 
  # Count the number of observations for each country
  count(wvs_pedp) # Use Sort = TRUE if you want to sort the results

# A tibble: 2 × 2
  wvs_pedp     n
     <dbl> <int>
1     5.22     1
2    NA       74

4.1.3 Your turn

Count the countries and year where the share of the population participating in environmental protests was greater than 10 (measured by the wvs_pedp variable).

4.2 `select()` variables

While filter() is useful for keeping only certain groups of rows, select() is used, as its name suggests, to select certain variables (columns) from our dataframe.

Let’s say I’m interested in CO2 emissions per capita, their overall level and how they vary over time and by country. For this, I would need only three variables that are present in the dataset: year, cname and the variable wdi_co2 which comes from the World Bank’s World Development Indicators. To do this, I use select() and just specify which variables of the dataset I want to select.

qog_subset |> 
  select(cname, year,  wdi_co2)

# A tibble: 120 × 3
   cname    year wdi_co2
   <chr>   <dbl>   <dbl>
 1 Denmark  1991   11.7 
 2 Denmark  1992   10.5 
 3 Denmark  1993   11.0 
 4 Denmark  1994   11.7 
 5 Denmark  1995   10.9 
 6 Denmark  1996   13.7 
 7 Denmark  1997   11.6 
 8 Denmark  1998   11.2 
 9 Denmark  1999   10.4 
10 Denmark  2000    9.61
# ℹ 110 more rows

4.3 Calculating statistics by group with `group_by()` and `summarise()`

To compute a first set of descriptive statistics, we could look at the central indicators of the co2 emissions variable (wdi_co2).

mean(qog_subset$wdi_co2, na.rm = TRUE) # Compute the mean

[1] 7.666003

median(qog_subset$wdi_co2, na.rm = TRUE) # Compute the median

[1] 7.496984

sd(qog_subset$wdi_co2, na.rm = TRUE) # Compute the standard deviation

[1] 2.003383

However, we are often interested in how these indicators vary across groups such as country or year. To do so, use the summarise() function in R that allows you to compute new variables by groups. To choose grouping variable, we use first group_by() where we specify for which group we want to compute something. Here I use this function to compute the mean, the median, the standard deviation and the first and third quartile of co2 emissions per capita for each country. This gives us a new tibble with all of the summary information.

qog_stats <- qog |> 
   group_by(cname) |> 
  summarise(mean_co2 = mean(wdi_co2, na.rm = TRUE), 
            median_co2 = median(wdi_co2, na.rm = TRUE),
            sd_co2 = sd(wdi_co2, na.rm = TRUE))

qog_stats

# A tibble: 204 × 4
   cname               mean_co2 median_co2 sd_co2
   <chr>                  <dbl>      <dbl>  <dbl>
 1 Afghanistan            0.148      0.135 0.0892
 2 Albania                1.65       1.56  0.649 
 3 Algeria                2.62       2.98  0.981 
 4 Andorra                6.82       6.75  0.700 
 5 Angola                 0.754      0.582 0.354 
 6 Antigua and Barbuda    4.71       4.61  0.986 
 7 Argentina              3.69       3.67  0.615 
 8 Armenia                1.37       1.46  0.386 
 9 Australia             14.6       15.4   2.71  
10 Austria                7.19       7.39  1.08  
# ℹ 194 more rows

It is also possible to sort the results to gain a clearer idea of which countries have the highest average CO2 emissions per capita. To achieve this, we utilize the arrange() function, specifying the variable by which we intend to sort.

qog_stats |> 
  arrange(-median_co2)

# A tibble: 204 × 4
   cname                    mean_co2 median_co2 sd_co2
   <chr>                       <dbl>      <dbl>  <dbl>
 1 Qatar                        51.5       54.6  14.8 
 2 United Arab Emirates         33.6       29.2  14.0 
 3 Kuwait                       28.1       27.4  12.9 
 4 Luxembourg                   27.3       25.0   7.41
 5 Bahrain                      23.9       23.7   2.71
 6 United States of America     19.0       19.3   1.76
 7 Brunei Darussalam            17.4       16.5   4.65
 8 Canada                       15.8       16.2   1.84
 9 Australia                    14.6       15.4   2.71
10 Trinidad and Tobago          17.8       15.3   9.86
# ℹ 194 more rows

qog_stats |> 
  arrange(mean_co2)

# A tibble: 204 × 4
   cname                    mean_co2 median_co2 sd_co2
   <chr>                       <dbl>      <dbl>  <dbl>
 1 Burundi                    0.0324     0.0354 0.0112
 2 Chad                       0.0516     0.0464 0.0186
 3 Ethiopia                   0.0574     0.0546 0.0264
 4 Rwanda                     0.0624     0.0645 0.0298
 5 Central African Republic   0.0678     0.0643 0.0154
 6 Mali                       0.0679     0.0523 0.0402
 7 Niger                      0.0722     0.0637 0.0339
 8 Burkina Faso               0.0725     0.0662 0.0447
 9 Somalia                    0.0762     0.0620 0.0346
10 Uganda                     0.0797     0.0632 0.0353
# ℹ 194 more rows

4.4 Your turn

When were the CO2 emissions per capita the lowest in Saudi Arabia?

4.4.1 Mutate

We can also create new variables based on other ones with mutate(). Let’s say I want to compute the growth rate of co2 emissions per capita every year. For this, I will use mutate() to create new variables.

qog |> 
  select(cname, year, wdi_fossil) |> 
  mutate(dataset = "QOG dataset")

# A tibble: 11,722 × 4
   cname        year wdi_fossil dataset    
   <chr>       <dbl>      <dbl> <chr>      
 1 Afghanistan  1946         NA QOG dataset
 2 Afghanistan  1947         NA QOG dataset
 3 Afghanistan  1948         NA QOG dataset
 4 Afghanistan  1949         NA QOG dataset
 5 Afghanistan  1950         NA QOG dataset
 6 Afghanistan  1951         NA QOG dataset
 7 Afghanistan  1952         NA QOG dataset
 8 Afghanistan  1953         NA QOG dataset
 9 Afghanistan  1954         NA QOG dataset
10 Afghanistan  1955         NA QOG dataset
# ℹ 11,712 more rows

To compute the growth rate, we need for each year the emissions per capita of that year and those from the previous year, which I access with lag().

qog_co2_growth <- qog |>
  filter(cname == "France") |>
  select(cname, year, wdi_co2) |> 
  mutate(co2_lag = lag(wdi_co2),
        co2_growth = (wdi_co2 - co2_lag) / co2_lag*100)

qog_co2_growth

# A tibble: 75 × 5
   cname   year wdi_co2 co2_lag co2_growth
   <chr>  <dbl>   <dbl>   <dbl>      <dbl>
 1 France  1963    6.88   NA        NA    
 2 France  1964    7.01    6.88      2.02 
 3 France  1965    7.06    7.01      0.695
 4 France  1966    6.90    7.06     -2.32 
 5 France  1967    7.33    6.90      6.19 
 6 France  1968    7.52    7.33      2.59 
 7 France  1969    8.01    7.52      6.64 
 8 France  1970    8.45    8.01      5.42 
 9 France  1971    8.83    8.45      4.50 
10 France  1972    9.11    8.83      3.13 
# ℹ 65 more rows

4.5 Recode values with `case_when`

Mutate() is also widely used to recode variable. Here we create a new trajectory variable and we use case_when() to recode whether there co2_emissions are growing or not for a given year.

qog_co2_growth <- qog_co2_growth |> 
  filter(cname == "France") |>
  mutate(
    trajectory = case_when(
      co2_growth > 0 ~ "Bad",
      co2_growth < 0 ~ "Good"
    )
  )
qog_co2_growth

# A tibble: 75 × 6
   cname   year wdi_co2 co2_lag co2_growth trajectory
   <chr>  <dbl>   <dbl>   <dbl>      <dbl> <chr>     
 1 France  1963    6.88   NA        NA     <NA>      
 2 France  1964    7.01    6.88      2.02  Bad       
 3 France  1965    7.06    7.01      0.695 Bad       
 4 France  1966    6.90    7.06     -2.32  Good      
 5 France  1967    7.33    6.90      6.19  Bad       
 6 France  1968    7.52    7.33      2.59  Bad       
 7 France  1969    8.01    7.52      6.64  Bad       
 8 France  1970    8.45    8.01      5.42  Bad       
 9 France  1971    8.83    8.45      4.50  Bad       
10 France  1972    9.11    8.83      3.13  Bad       
# ℹ 65 more rows

qog_co2_growth |> 
  count(trajectory)

# A tibble: 3 × 2
  trajectory     n
  <chr>      <int>
1 Bad           22
2 Good          31
3 <NA>          22

4.5.1 Your turn !

Create a century variable that indicates whether the year is in the 20th or 21st century. Find the countries that have the highest total number of deaths due to natural disasters for each century, as measured by the emdat_ndeath variable.

4.1 Introduction to dplyr functions

4.1.1 Count the observations of groups with count()

4.1.2 filter() observations