Introduction to Statistical Computing - dplyr

👨‍💻 Eugene Hickey @ Atlantic Technological University 👨‍💻

eugene.hickey@tudublin.ie
@eugene100hickey
github.com/eugene100hickey
www.fizzics.ie

Week Three - Manipulating Data

Once we get our data, we have to do stuff with it
Often means recasting the shape of our data
The dplyr library is a key player here
six verbs
- filter() chooses some of the rows
- select() chooses some of the columns
- mutate() makes new columns
- arrange() sorts the rows by some values
- group_by() puts rows together
- summarise() makes new rows based on group_by()

The Pipe - %>%

neat feature, takes a bit of getting used to
but makes life way simpler and code more readable
chains operations together
read it in your head as “and then…”
alternatives are, well, ugly. And unforgiving.
also can look like this: |> or | >
keyboard shortcut Ctrl plus SHFT plus m

The Pipe - %>% continued

we’ve used it extensively so far, and can go even further
basic idea is that you send the output from one line into the first (unnamed) argument of the next
keyboard shortcut, one worth knowing, is Ctrl + Shift + M
if you’re really interested, other pipes such as the T-pipe (%T%) and the dollar pipe (%$%)

filter

used to choose some rows from a dataframe
pass in the dataframe and some logical condition
- could be == (note the double equals), for characters and <, > for numerics
- also <=, >=, ! (for NOT)
- can use between()
- I like %in%
can have multiple logical conditions in the same filter

# you'll need the libraries: tidyverse, gapminder, and knitr
gapminder::gapminder %>% 
  filter(continent == "Africa", 
         between(pop, 20e6, 50e6), 
         year %in% c(1952:1964)) %>% 
  gt::gt()

country	continent	year	lifeExp	pop	gdpPercap
Egypt	Africa	1952	41.893	22223309	1418.8224
Egypt	Africa	1957	44.444	25009741	1458.9153
Egypt	Africa	1962	46.992	28173309	1693.3359
Ethiopia	Africa	1952	34.078	20860941	362.1463
Ethiopia	Africa	1957	36.667	22815614	378.9042
Ethiopia	Africa	1962	40.059	25145372	419.4564
Nigeria	Africa	1952	36.324	33119096	1077.2819
Nigeria	Africa	1957	37.802	37173340	1100.5926
Nigeria	Africa	1962	39.360	41871351	1150.9275

select()

used to pick out columns from a dataframe
pass in the dataframe and one or more columns
- can deselect columns with a minus sign

gapminder::gapminder %>% 
  select(country, year, lifeExp) %>% 
  head() %>% 
  gt::gt()

country	year	lifeExp
Afghanistan	1952	28.801
Afghanistan	1957	30.332
Afghanistan	1962	31.997
Afghanistan	1967	34.020
Afghanistan	1972	36.088
Afghanistan	1977	38.438

gapminder::gapminder %>% 
  select(-c(continent, pop)) %>% 
  head() %>% gt::gt()

country	year	lifeExp	gdpPercap
Afghanistan	1952	28.801	779.4453
Afghanistan	1957	30.332	820.8530
Afghanistan	1962	31.997	853.1007
Afghanistan	1967	34.020	836.1971
Afghanistan	1972	36.088	739.9811
Afghanistan	1977	38.438	786.1134

mutate

makes new columns
same number of rows
pass in dataframe and instructions

gapminder::gapminder %>% 
  select(-continent, -lifeExp) |> 
  mutate(total_gdp_billions = pop * gdpPercap / 1e9) %>% 
  head() %>% gt::gt()

country	year	pop	gdpPercap	total_gdp_billions
Afghanistan	1952	8425333	779.4453	6.567086
Afghanistan	1957	9240934	820.8530	7.585449
Afghanistan	1962	10267083	853.1007	8.758856
Afghanistan	1967	11537966	836.1971	9.648014
Afghanistan	1972	13079460	739.9811	9.678553
Afghanistan	1977	14880372	786.1134	11.697659

arrange

used to order columns
- normally increasing, use desc() to reverse

gapminder::gapminder %>% 
  select(country, year, pop) %>% 
  arrange(pop) %>% 
  head() %>% gt::gt()

country	year	pop
Sao Tome and Principe	1952	60011
Sao Tome and Principe	1957	61325
Djibouti	1952	63149
Sao Tome and Principe	1962	65345
Sao Tome and Principe	1967	70787
Djibouti	1957	71851

gapminder::gapminder %>% 
  select(country, year, pop) %>% 
  arrange(desc(pop)) %>% 
  head() %>% gt::gt()

country	year	pop
China	2007	1318683096
China	2002	1280400000
China	1997	1230075000
China	1992	1164970000
India	2007	1110396331
China	1987	1084035000

group_by and summarise

always(ish) go hand in hand
group_by() reduces number of rows
summarise() makes new columns
always use ungroup() when you’re finished

gapminder::gapminder %>% 
  group_by(continent, year) %>% 
  summarise(mean_lifeExp = mean(lifeExp)) %>% ungroup() %>% 
  ggplot(aes(year, mean_lifeExp, col = continent)) + 
  geom_line() + geom_point() + theme_clean()

a bit more dplyr

the function distinct()
- gets rid of duplicate rows
the function rename() changes names of columns
the function relocate() changes order of columns
and left_join() links dataframes together (a.k.a. SQL)

`left_join()`

band_members |> gt::gt() |> gt::tab_options(table.font.size = 24)

name	band
Mick	Stones
John	Beatles
Paul	Beatles

band_instruments |> gt::gt() |> gt::tab_options(table.font.size = 24)

name	plays
John	guitar
Paul	bass
Keith	guitar

band_members |> left_join(band_instruments) |> gt::gt() |> gt::tab_options(table.font.size = 24)

name	band	plays
Mick	Stones	NA
John	Beatles	guitar
Paul	Beatles	bass

Wide and Long Dataframe Formats

pivot_longer() goes from wide to long
pivot_wider() goes from long to wide

Tidy Data (the `tidyr` package)

idea of tidy data
- each variable must have it’s own column
- each observation it’s own row
- each value it’s own cell

Working with Excel

Avoid using multiple tables within one spreadsheet.
Avoid spreading data across multiple tabs (but do use a new tab to record data cleaning or manipulations).
Record zeros as zeros.
Use an appropriate null value to record missing data.
Don’t use formatting to convey information or to make your spreadsheet look pretty.
Place comments in a separate column.
Record units in column headers.
Include only one piece of information in a cell.
Avoid spaces, numbers and special characters in column headers.
Avoid special characters in your data.
Record metadata in a separate plain text file.

penguins %>% select(-body_mass_g) |> 
  pivot_longer(cols = -c(species, island, sex, year), names_to = "Parameter", values_to = "Measurement") %>% 
  head() %>% gt::gt()

species	island	sex	year	Parameter	Measurement
Adelie	Torgersen	male	2007	bill_length_mm	39.1
Adelie	Torgersen	male	2007	bill_depth_mm	18.7
Adelie	Torgersen	male	2007	flipper_length_mm	181.0
Adelie	Torgersen	female	2007	bill_length_mm	39.5
Adelie	Torgersen	female	2007	bill_depth_mm	17.4
Adelie	Torgersen	female	2007	flipper_length_mm	186.0

gapminder::gapminder %>% select(country, year, lifeExp) %>% 
  pivot_wider(names_from = "year", values_from = "lifeExp") %>% 
  head(15) %>% gt::gt()

country	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
Afghanistan	28.801	30.332	31.997	34.020	36.088	38.438	39.854	40.822	41.674	41.763	42.129	43.828
Albania	55.230	59.280	64.820	66.220	67.690	68.930	70.420	72.000	71.581	72.950	75.651	76.423
Algeria	43.077	45.685	48.303	51.407	54.518	58.014	61.368	65.799	67.744	69.152	70.994	72.301
Angola	30.015	31.999	34.000	35.985	37.928	39.483	39.942	39.906	40.647	40.963	41.003	42.731
Argentina	62.485	64.399	65.142	65.634	67.065	68.481	69.942	70.774	71.868	73.275	74.340	75.320
Australia	69.120	70.330	70.930	71.100	71.930	73.490	74.740	76.320	77.560	78.830	80.370	81.235
Austria	66.800	67.480	69.540	70.140	70.630	72.170	73.180	74.940	76.040	77.510	78.980	79.829
Bahrain	50.939	53.832	56.923	59.923	63.300	65.593	69.052	70.750	72.601	73.925	74.795	75.635
Bangladesh	37.484	39.348	41.216	43.453	45.252	46.923	50.009	52.819	56.018	59.412	62.013	64.062
Belgium	68.000	69.240	70.250	70.940	71.440	72.800	73.930	75.350	76.460	77.530	78.320	79.441
Benin	38.223	40.358	42.618	44.885	47.014	49.190	50.904	52.337	53.919	54.777	54.406	56.728
Bolivia	40.414	41.890	43.428	45.032	46.714	50.023	53.859	57.251	59.957	62.050	63.883	65.554
Bosnia and Herzegovina	53.820	58.450	61.930	64.790	67.450	69.860	70.690	71.140	72.178	73.244	74.090	74.852
Botswana	47.622	49.618	51.520	53.298	56.024	59.319	61.484	63.622	62.745	52.556	46.634	50.728
Brazil	50.917	53.285	55.665	57.632	59.504	61.489	63.336	65.205	67.057	69.388	71.006	72.390

Workshop - Week Three

Perform the Following Tasks:

Take the us_contagious_diseases dataset from the dslabs library. filter() the dataset for the disease Smallpox in the state of Wisconsin. This should give you 25 rows.

Take the research_funding_rates dataset from the dslabs library. Use select() to print out the dataset with only the columns discipline, success_rates_men, and success_rates_women. The dataframe should look as below:

discipline	success_rates_men	success_rates_women
Chemical sciences	26.5	25.6
Physical sciences	19.3	23.1
Physics	26.9	22.2
Humanities	14.3	19.3
Technical sciences	15.9	21.0
Interdisciplinary	11.4	21.8
Earth/life sciences	24.4	14.3
Social sciences	15.3	11.5
Medical sciences	18.8	11.2

Take the table from problem 2 and use arrange(desc()) to modify it so that the rows are ordered by decreasing values of applications_total

discipline	success_rates_men	success_rates_women
Social sciences	15.3	11.5
Medical sciences	18.8	11.2
Humanities	14.3	19.3
Earth/life sciences	24.4	14.3
Technical sciences	15.9	21.0
Interdisciplinary	11.4	21.8
Physical sciences	19.3	23.1
Chemical sciences	26.5	25.6
Physics	26.9	22.2

Again, take the research_funding_rates dataset from the dslabs library. Make a new column using mutate() that shows the difference in success rate between men and women for each discipline. Print out the dataset with only the columns discipline and gender_difference, ordered by success rate_gender_difference. The dataframe should look as below:

discipline	gender_difference
Earth/life sciences	10.1
Medical sciences	7.6
Physics	4.7
Social sciences	3.8
Chemical sciences	0.9
Physical sciences	-3.8
Humanities	-5.0
Technical sciences	-5.1
Interdisciplinary	-10.4

Take the polls_us_election_2016 dataset from the dslabs library. group_by() the grade column and summarise() to calculate the average sample size for each grade. arrange(desc()) the table in decreasing average_sample_size. The dataframe should look as below:

grade	average_sample_size
A+	733
A	944
A-	759
B+	896
B	974
B-	1263
C+	972
C	717
C-	1870
D	1097

Assignments - Week Three

Complete week three moodle quiz
Complete swirl() exercises

install.packages("swirl")
library(swirl)
install_course("Getting and Cleaning Data")
swirl()
choose course Getting and Cleaning Data
do the exercises 2 (Grouping and Chaining with dplyr) and 4 (Dates and Times with lubridate)
- note, because of time zone issues, you might need a skip() command in the later around the 55% mark

email a screen shot of the end of the lesson to eugene.hickey@associate.atu.ie
it’ll look a bit like screen capture here