Introduction to Statistical Computing - dplyr

👨‍💻 Eugene Hickey @ Atlantic Technological University 👨‍💻

  • eugene.hickey@tudublin.ie
  • @eugene100hickey
  • github.com/eugene100hickey
  • www.fizzics.ie

Week Three - Manipulating Data

  • Once we get our data, we have to do stuff with it

  • Often means recasting the shape of our data

  • The dplyr library is a key player here

  • six verbs

    • filter() chooses some of the rows

    • select() chooses some of the columns

    • mutate() makes new columns

    • arrange() sorts the rows by some values

    • group_by() puts rows together

    • summarise() makes new rows based on group_by()

The Pipe - %>%

  • neat feature, takes a bit of getting used to
  • but makes life way simpler and code more readable
  • chains operations together
  • read it in your head as “and then…”
  • alternatives are, well, ugly. And unforgiving.
  • also can look like this: |> or | >
  • keyboard shortcut Ctrl plus SHFT plus m

The Pipe - %>% continued

  • we’ve used it extensively so far, and can go even further
  • basic idea is that you send the output from one line into the first (unnamed) argument of the next
  • keyboard shortcut, one worth knowing, is Ctrl + Shift + M
  • if you’re really interested, other pipes such as the T-pipe (%T%) and the dollar pipe (%$%)

filter

  • used to choose some rows from a dataframe
  • pass in the dataframe and some logical condition
    • could be == (note the double equals), for characters and <, > for numerics
    • also <=, >=, ! (for NOT)
    • can use between()
    • I like %in%
  • can have multiple logical conditions in the same filter
# you'll need the libraries: tidyverse, gapminder, and knitr
gapminder::gapminder %>% 
  filter(continent == "Africa", 
         between(pop, 20e6, 50e6), 
         year %in% c(1952:1964)) %>% 
  gt::gt()
country continent year lifeExp pop gdpPercap
Egypt Africa 1952 41.893 22223309 1418.8224
Egypt Africa 1957 44.444 25009741 1458.9153
Egypt Africa 1962 46.992 28173309 1693.3359
Ethiopia Africa 1952 34.078 20860941 362.1463
Ethiopia Africa 1957 36.667 22815614 378.9042
Ethiopia Africa 1962 40.059 25145372 419.4564
Nigeria Africa 1952 36.324 33119096 1077.2819
Nigeria Africa 1957 37.802 37173340 1100.5926
Nigeria Africa 1962 39.360 41871351 1150.9275

@Allison_Horst

select()

  • used to pick out columns from a dataframe
  • pass in the dataframe and one or more columns
    • can deselect columns with a minus sign
gapminder::gapminder %>% 
  select(country, year, lifeExp) %>% 
  head() %>% 
  gt::gt()
country year lifeExp
Afghanistan 1952 28.801
Afghanistan 1957 30.332
Afghanistan 1962 31.997
Afghanistan 1967 34.020
Afghanistan 1972 36.088
Afghanistan 1977 38.438
gapminder::gapminder %>% 
  select(-c(continent, pop)) %>% 
  head() %>% gt::gt()
country year lifeExp gdpPercap
Afghanistan 1952 28.801 779.4453
Afghanistan 1957 30.332 820.8530
Afghanistan 1962 31.997 853.1007
Afghanistan 1967 34.020 836.1971
Afghanistan 1972 36.088 739.9811
Afghanistan 1977 38.438 786.1134

mutate

  • makes new columns
  • same number of rows
  • pass in dataframe and instructions

@Allison_Horst
gapminder::gapminder %>% 
  select(-continent, -lifeExp) |> 
  mutate(total_gdp_billions = pop * gdpPercap / 1e9) %>% 
  head() %>% gt::gt()
country year pop gdpPercap total_gdp_billions
Afghanistan 1952 8425333 779.4453 6.567086
Afghanistan 1957 9240934 820.8530 7.585449
Afghanistan 1962 10267083 853.1007 8.758856
Afghanistan 1967 11537966 836.1971 9.648014
Afghanistan 1972 13079460 739.9811 9.678553
Afghanistan 1977 14880372 786.1134 11.697659

arrange

  • used to order columns
    • normally increasing, use desc() to reverse
gapminder::gapminder %>% 
  select(country, year, pop) %>% 
  arrange(pop) %>% 
  head() %>% gt::gt()
country year pop
Sao Tome and Principe 1952 60011
Sao Tome and Principe 1957 61325
Djibouti 1952 63149
Sao Tome and Principe 1962 65345
Sao Tome and Principe 1967 70787
Djibouti 1957 71851
gapminder::gapminder %>% 
  select(country, year, pop) %>% 
  arrange(desc(pop)) %>% 
  head() %>% gt::gt()
country year pop
China 2007 1318683096
China 2002 1280400000
China 1997 1230075000
China 1992 1164970000
India 2007 1110396331
China 1987 1084035000

group_by and summarise

  • always(ish) go hand in hand
  • group_by() reduces number of rows
  • summarise() makes new columns
  • always use ungroup() when you’re finished
@Allison_Horst
gapminder::gapminder %>% 
  group_by(continent, year) %>% 
  summarise(mean_lifeExp = mean(lifeExp)) %>% ungroup() %>% 
  ggplot(aes(year, mean_lifeExp, col = continent)) + 
  geom_line() + geom_point() + theme_clean()

a bit more dplyr

  • the function distinct()

    • gets rid of duplicate rows
  • the function rename() changes names of columns

  • the function relocate() changes order of columns

  • and left_join() links dataframes together (a.k.a. SQL)

left_join()

band_members |> gt::gt() |> gt::tab_options(table.font.size = 24)
name band
Mick Stones
John Beatles
Paul Beatles
band_instruments |> gt::gt() |> gt::tab_options(table.font.size = 24)
name plays
John guitar
Paul bass
Keith guitar
band_members |> left_join(band_instruments) |> gt::gt() |> gt::tab_options(table.font.size = 24)
name band plays
Mick Stones NA
John Beatles guitar
Paul Beatles bass

Wide and Long Dataframe Formats

  • pivot_longer() goes from wide to long
  • pivot_wider() goes from long to wide

Tidy Data (the tidyr package)

  • idea of tidy data
    • each variable must have it’s own column
    • each observation it’s own row
    • each value it’s own cell

Working with Excel

  • Avoid using multiple tables within one spreadsheet.

  • Avoid spreading data across multiple tabs (but do use a new tab to record data cleaning or manipulations).

  • Record zeros as zeros.

  • Use an appropriate null value to record missing data.

  • Don’t use formatting to convey information or to make your spreadsheet look pretty.

  • Place comments in a separate column.

  • Record units in column headers.

  • Include only one piece of information in a cell.

  • Avoid spaces, numbers and special characters in column headers.

  • Avoid special characters in your data.

  • Record metadata in a separate plain text file.

penguins %>% select(-body_mass_g) |> 
  pivot_longer(cols = -c(species, island, sex, year), names_to = "Parameter", values_to = "Measurement") %>% 
  head() %>% gt::gt()
species island sex year Parameter Measurement
Adelie Torgersen male 2007 bill_length_mm 39.1
Adelie Torgersen male 2007 bill_depth_mm 18.7
Adelie Torgersen male 2007 flipper_length_mm 181.0
Adelie Torgersen female 2007 bill_length_mm 39.5
Adelie Torgersen female 2007 bill_depth_mm 17.4
Adelie Torgersen female 2007 flipper_length_mm 186.0

@Allison_Horst

@Allison_Horst

@Allison_Horst

@Allison_Horst

@Allison_Horst

@Allison_Horst

@Allison_Horst
gapminder::gapminder %>% select(country, year, lifeExp) %>% 
  pivot_wider(names_from = "year", values_from = "lifeExp") %>% 
  head(15) %>% gt::gt()
country 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Afghanistan 28.801 30.332 31.997 34.020 36.088 38.438 39.854 40.822 41.674 41.763 42.129 43.828
Albania 55.230 59.280 64.820 66.220 67.690 68.930 70.420 72.000 71.581 72.950 75.651 76.423
Algeria 43.077 45.685 48.303 51.407 54.518 58.014 61.368 65.799 67.744 69.152 70.994 72.301
Angola 30.015 31.999 34.000 35.985 37.928 39.483 39.942 39.906 40.647 40.963 41.003 42.731
Argentina 62.485 64.399 65.142 65.634 67.065 68.481 69.942 70.774 71.868 73.275 74.340 75.320
Australia 69.120 70.330 70.930 71.100 71.930 73.490 74.740 76.320 77.560 78.830 80.370 81.235
Austria 66.800 67.480 69.540 70.140 70.630 72.170 73.180 74.940 76.040 77.510 78.980 79.829
Bahrain 50.939 53.832 56.923 59.923 63.300 65.593 69.052 70.750 72.601 73.925 74.795 75.635
Bangladesh 37.484 39.348 41.216 43.453 45.252 46.923 50.009 52.819 56.018 59.412 62.013 64.062
Belgium 68.000 69.240 70.250 70.940 71.440 72.800 73.930 75.350 76.460 77.530 78.320 79.441
Benin 38.223 40.358 42.618 44.885 47.014 49.190 50.904 52.337 53.919 54.777 54.406 56.728
Bolivia 40.414 41.890 43.428 45.032 46.714 50.023 53.859 57.251 59.957 62.050 63.883 65.554
Bosnia and Herzegovina 53.820 58.450 61.930 64.790 67.450 69.860 70.690 71.140 72.178 73.244 74.090 74.852
Botswana 47.622 49.618 51.520 53.298 56.024 59.319 61.484 63.622 62.745 52.556 46.634 50.728
Brazil 50.917 53.285 55.665 57.632 59.504 61.489 63.336 65.205 67.057 69.388 71.006 72.390

Workshop - Week Three

Perform the Following Tasks:

1

Take the us_contagious_diseases dataset from the dslabs library. filter() the dataset for the disease Smallpox in the state of Wisconsin. This should give you 25 rows.

2

Take the research_funding_rates dataset from the dslabs library. Use select() to print out the dataset with only the columns discipline, success_rates_men, and success_rates_women. The dataframe should look as below:

discipline success_rates_men success_rates_women
Chemical sciences 26.5 25.6
Physical sciences 19.3 23.1
Physics 26.9 22.2
Humanities 14.3 19.3
Technical sciences 15.9 21.0
Interdisciplinary 11.4 21.8
Earth/life sciences 24.4 14.3
Social sciences 15.3 11.5
Medical sciences 18.8 11.2

3

Take the table from problem 2 and use arrange(desc()) to modify it so that the rows are ordered by decreasing values of applications_total

discipline success_rates_men success_rates_women
Social sciences 15.3 11.5
Medical sciences 18.8 11.2
Humanities 14.3 19.3
Earth/life sciences 24.4 14.3
Technical sciences 15.9 21.0
Interdisciplinary 11.4 21.8
Physical sciences 19.3 23.1
Chemical sciences 26.5 25.6
Physics 26.9 22.2

4

Again, take the research_funding_rates dataset from the dslabs library. Make a new column using mutate() that shows the difference in success rate between men and women for each discipline. Print out the dataset with only the columns discipline and gender_difference, ordered by success rate_gender_difference. The dataframe should look as below:

discipline gender_difference
Earth/life sciences 10.1
Medical sciences 7.6
Physics 4.7
Social sciences 3.8
Chemical sciences 0.9
Physical sciences -3.8
Humanities -5.0
Technical sciences -5.1
Interdisciplinary -10.4

5

Take the polls_us_election_2016 dataset from the dslabs library. group_by() the grade column and summarise() to calculate the average sample size for each grade. arrange(desc()) the table in decreasing average_sample_size. The dataframe should look as below:

grade average_sample_size
A+ 733
A 944
A- 759
B+ 896
B 974
B- 1263
C+ 972
C 717
C- 1870
D 1097

Assignments - Week Three

  1. Complete week three moodle quiz

  2. Complete swirl() exercises

  • install.packages("swirl")

  • library(swirl)

  • install_course("Getting and Cleaning Data")

  • swirl()

  • choose course Getting and Cleaning Data

  • do the exercises 2 (Grouping and Chaining with dplyr) and 4 (Dates and Times with lubridate)

    • note, because of time zone issues, you might need a skip() command in the later around the 55% mark
  • email a screen shot of the end of the lesson to eugene.hickey@associate.atu.ie

  • it’ll look a bit like screen capture here