5  Reshaping data

The analysis of your data is usually not the most time-consuming part of your project; data wrangling is. Your data must be cleaned before it can be analyzed. This means having a tidy dataset, where each variable is a column, each observation is a row, and each cell contains a single value. Sometimes, the datasets you find are not tidy, and you will need to reshape them. In other cases, you may encounter datasets organized differently, representing the same data in a less useful format for analysis. Here, we focus on how to combine and reshape different datasets to use them properly. I will use polling data from various political parties.

library(tidyverse)

german_polls <- read_csv("data/german_polls.csv")

5.1 Reshaping with pivot_longer() and pivot_wider()

Datasets can be in long (many rows, few columns) or wide formats (few rows, many columns). Depending on our unit of analysis, we can reshape our datasets into different formats. In the german_polls dataset, the vote intentions for each party are stored in different columns. However, we could also structure the data with one column for parties and another for voting intention.

german_long <- german_polls |>
  pivot_longer(
    # Select which columns to pivot
    cols = c(Union, FDP, LINKE, SPD, PIRATEN, AfD, GRUENE),
    # Choose a name for the new column with all of the parties
    names_to = "party",
    # Choose a name for the new column with the vote intentions
    values_to = "intention"
  )

german_long
# A tibble: 24,115 × 6
   date       firm      date_from  sample_size party   intention
   <date>     <chr>     <date>           <dbl> <chr>       <dbl>
 1 2005-09-22 FG Wahlen 2005-09-20        1345 Union          37
 2 2005-09-22 FG Wahlen 2005-09-20        1345 FDP             8
 3 2005-09-22 FG Wahlen 2005-09-20        1345 LINKE           8
 4 2005-09-22 FG Wahlen 2005-09-20        1345 SPD            35
 5 2005-09-22 FG Wahlen 2005-09-20        1345 PIRATEN        NA
 6 2005-09-22 FG Wahlen 2005-09-20        1345 AfD            NA
 7 2005-09-22 FG Wahlen 2005-09-20        1345 GRUENE          8
 8 2005-10-06 FG Wahlen 2005-10-04        1259 Union          39
 9 2005-10-06 FG Wahlen 2005-10-04        1259 FDP             7
10 2005-10-06 FG Wahlen 2005-10-04        1259 LINKE           9
# ℹ 24,105 more rows
german_long |> 
  group_by(date, party) |> 
  summarise(intention = mean(intention, na.rm = T))
`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.
# A tibble: 18,788 × 3
# Groups:   date [2,684]
   date       party   intention
   <date>     <chr>       <dbl>
 1 2005-09-22 AfD           NaN
 2 2005-09-22 FDP             8
 3 2005-09-22 GRUENE          8
 4 2005-09-22 LINKE           8
 5 2005-09-22 PIRATEN       NaN
 6 2005-09-22 SPD            35
 7 2005-09-22 Union          37
 8 2005-10-06 AfD           NaN
 9 2005-10-06 FDP             7
10 2005-10-06 GRUENE          7
# ℹ 18,778 more rows

We can also reshape the dataset to put it back on wide format.

german_wide <- german_long |> 
  pivot_wider(
    names_from = party,
    values_from = intention
  ) |> 
  unnest()

german_wide
# A tibble: 3,445 × 11
   date       firm  date_from  sample_size Union   FDP LINKE   SPD PIRATEN   AfD
   <date>     <chr> <date>           <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>
 1 2005-09-22 FG W… 2005-09-20        1345    37     8     8    35      NA    NA
 2 2005-10-06 FG W… 2005-10-04        1259    39     7     9    34      NA    NA
 3 2005-10-13 FG W… 2005-10-11        1280    38     8     8    35      NA    NA
 4 2005-10-27 FG W… 2005-10-25        1269    37     9     8    35      NA    NA
 5 2005-11-10 FG W… 2005-11-08        1230    36     9     9    33      NA    NA
 6 2005-11-24 FG W… 2005-11-22        1298    37     9     8    34      NA    NA
 7 2005-12-08 FG W… 2005-12-06        1237    38     9     8    34      NA    NA
 8 2006-01-12 FG W… 2006-01-10        1249    39     9     8    33      NA    NA
 9 2006-01-26 FG W… 2006-01-24        1279    41     8     8    33      NA    NA
10 2006-02-16 FG W… 2006-02-14        1260    41     8     7    32      NA    NA
# ℹ 3,435 more rows
# ℹ 1 more variable: GRUENE <dbl>

5.2 Combining with bind_rows

In many cases, you want to combine datasets having the same structure and mostly the same variables but different observations (rows). It may be the same data from different years, different countries etc. Here, I want to combine the data I have on Germany with similar data on spanish polls. To do so, I first, import this dataset and convert it also to a long format.

spain_polls <- read_csv("data/spanish_polls.csv")
Rows: 1414 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (1): firm
dbl  (19): PP, PSOE, ERC, PNVEAJ, CC, BNG, Cs, VOX, Podemos, EHBildu, PACMA,...
date  (2): date, date_from

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spain_long <- spain_polls |>
  pivot_longer(
    # Select which columns NOT to pivot
    cols = -c(date, firm, date_from, sample_size),
    # Choose a name for the new column with all of the parties
    names_to = "party",
    # Choose a name for the new column with the vote intentions
    values_to = "intention"
  )

spain_long
# A tibble: 25,452 × 6
   date       firm        date_from  sample_size party   intention
   <date>     <chr>       <date>           <dbl> <chr>       <dbl>
 1 2011-12-15 Metroscopia 2011-12-14        1000 PP           44.9
 2 2011-12-15 Metroscopia 2011-12-14        1000 PSOE         28.4
 3 2011-12-15 Metroscopia 2011-12-14        1000 ERC          NA  
 4 2011-12-15 Metroscopia 2011-12-14        1000 PNVEAJ       NA  
 5 2011-12-15 Metroscopia 2011-12-14        1000 CC           NA  
 6 2011-12-15 Metroscopia 2011-12-14        1000 BNG          NA  
 7 2011-12-15 Metroscopia 2011-12-14        1000 Cs           NA  
 8 2011-12-15 Metroscopia 2011-12-14        1000 VOX          NA  
 9 2011-12-15 Metroscopia 2011-12-14        1000 Podemos      NA  
10 2011-12-15 Metroscopia 2011-12-14        1000 EHBildu      NA  
# ℹ 25,442 more rows

My goal now is to combine them together. But before, I add a country variable on the two datasets to keep track of which country the data is from.

spain_long <- spain_long |> mutate(country = "Spain")
german_long <- german_long |> mutate(country = "Germany")

Now, I combine the two datasets with the bind_rows function that comes from the dplyr package. The functions allows you to combine two datasets with the same structure by stacking them on top of each other. The resulting dataset will have all the rows from the first dataset, followed by all the rows from the second dataset. The datasets will be combined by columns, so the columns must have the same names and types.

polls <- bind_rows(spain_long, german_long)