+ - 0:00:00
Notes for current slide
Notes for next slide

Tidy data and data wrangling
🔧

1 / 29

NC bike crashes

2 / 29

Tidy data

3 / 29

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

4 / 29

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Characteristics of tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.
4 / 29

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Characteristics of tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Characteristics of untidy data:

!@#$%^&*()

4 / 29

Pipes

5 / 29

Where does the name come from?

The pipe operator is implemented in the package magrittr, it's pronounced "and then".

pipe

magrittr

6 / 29

Review: How does a pipe work?

  • You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
  • Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to = "campus"))
  • Writing it out using pipes give it a more natural (and easier to read) structure:
find("keys") %>%
start_car() %>%
drive(to = "campus") %>%
park()
7 / 29

What about other arguments?

To send results to a function argument other than first one or to use the previous result for multiple arguments, use .:

starwars %>%
filter(species == "Human") %>%
lm(mass ~ height, data = .)
##
## Call:
## lm(formula = mass ~ height, data = .)
##
## Coefficients:
## (Intercept) height
## -116.58 1.11
8 / 29

Data wrangling

9 / 29

Bike crashes in NC 2007 - 2014

The dataset is in the dsbox package:

library(dsbox)
ncbikecrash
10 / 29

Variables

View the names of variables via

names(ncbikecrash)
## [1] "object_id" "city" "county"
## [4] "region" "development" "locality"
## [7] "on_road" "rural_urban" "speed_limit"
## [10] "traffic_control" "weather" "workzone"
## [13] "bike_age" "bike_age_group" "bike_alcohol"
## [16] "bike_alcohol_drugs" "bike_direction" "bike_injury"
## [19] "bike_position" "bike_race" "bike_sex"
## [22] "driver_age" "driver_age_group" "driver_alcohol"
## [25] "driver_alcohol_drugs" "driver_est_speed" "driver_injury"
## [28] "driver_race" "driver_sex" "driver_vehicle_type"
## [31] "crash_alcohol" "crash_date" "crash_day"
## [34] "crash_group" "crash_hour" "crash_location"
## [37] "crash_month" "crash_severity" "crash_time"
## [40] "crash_type" "crash_year" "ambulance_req"
## [43] "hit_run" "light_condition" "road_character"
## [46] "road_class" "road_condition" "road_configuration"
## [49] "road_defects" "road_feature" "road_surface"
## [52] "num_bikes_ai" "num_bikes_bi" "num_bikes_ci"
## [55] "num_bikes_ki" "num_bikes_no" "num_bikes_to"
## [58] "num_bikes_ui" "num_lanes" "num_units"
## [61] "distance_mi_from" "frm_road" "rte_invd_cd"
## [64] "towrd_road" "geo_point" "geo_shape"
11 / 29

Variables

See detailed descriptions with ?ncbikecrash.

12 / 29

Viewing your data

  • In the Environment, after loading with data(ncbikecrash), click on the name of the data frame to view it in the data viewer
  • Use the glimpse function to take a peek
glimpse(ncbikecrash)
## Observations: 7,467
## Variables: 66
## $ object_id <int> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1675, …
## $ city <chr> "None - Rural Crash", "Henderson", "None - Rural…
## $ county <chr> "Wayne", "Vance", "Lincoln", "Columbus", "New Ha…
## $ region <chr> "Coastal", "Piedmont", "Piedmont", "Coastal", "C…
## $ development <chr> "Farms, Woods, Pastures", "Residential", "Farms,…
## $ locality <chr> "Rural (<30% Developed)", "Mixed (30% To 70% Dev…
## $ on_road <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURKHEAD …
## $ rural_urban <chr> "Rural", "Urban", "Rural", "Urban", "Urban", "Ru…
## $ speed_limit <chr> "50 - 55 MPH", "30 - 35 MPH", "50 - 55 MPH", …
## $ traffic_control <chr> "No Control Present", "Stop Sign", "Double Yello…
## $ weather <chr> "Clear", "Clear", "Clear", "Rain", "Clear", "Clo…
## $ workzone <chr> "No", "No", "No", "No", "No", "No", "No", "No", …
## $ bike_age <chr> "52", "66", "33", "52", "22", "15", "41", "14", …
## $ bike_age_group <chr> "50-59", "60-69", "30-39", "50-59", "20-24", "11…
## $ bike_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", "No",…
## $ bike_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bike_direction <chr> "With Traffic", "With Traffic", "With Traffic", …
## $ bike_injury <chr> "B: Evident Injury", "C: Possible Injury", "C: P…
## $ bike_position <chr> "Bike Lane / Paved Shoulder", "Travel Lane", "Tr…
## $ bike_race <chr> "Black", "Black", "White", "Black", "White", "Na…
## $ bike_sex <chr> "Male", "Male", "Male", "Male", "Female", "Male"…
## $ driver_age <chr> "34", NA, "37", "55", "25", "17", NA, "50", "32"…
## $ driver_age_group <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-19", …
## $ driver_alcohol <chr> "No", "Missing", "No", "No", "No", "No", "Missin…
## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ driver_est_speed <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-15 mph…
## $ driver_injury <chr> "O: No Injury", "Unknown Injury", "O: No Injury"…
## $ driver_race <chr> "White", "Unknown/Missing", "Hispanic", "Black",…
## $ driver_sex <chr> "Male", NA, "Female", "Male", "Male", "Female", …
## $ driver_vehicle_type <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "Passe…
## $ crash_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", "No",…
## $ crash_date <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14DEC201…
## $ crash_day <chr> "Wednesday", "Wednesday", "Sunday", "Saturday", …
## $ crash_group <chr> "Motorist Overtaking Bicyclist", "Bicyclist Fail…
## $ crash_hour <int> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22, 12,…
## $ crash_location <chr> "Non-Intersection", "Intersection", "Non-Interse…
## $ crash_month <chr> "December", "November", "November", "December", …
## $ crash_severity <chr> "B: Evident Injury", "C: Possible Injury", "C: P…
## $ crash_time <time> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13:27:0…
## $ crash_type <chr> "Motorist Overtaking - Undetected Bicyclist", "B…
## $ crash_year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ ambulance_req <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ hit_run <chr> "No", "Yes", "No", "No", "No", "No", "Yes", "No"…
## $ light_condition <chr> "Dark - Roadway Not Lighted", NA, "Dark - Roadwa…
## $ road_character <chr> "Straight - Level", "Straight - Level", "Straigh…
## $ road_class <chr> "State Secondary Route", "Local Street", "US Rou…
## $ road_condition <chr> "Dry", "Dry", "Dry", "Water (Standing, Moving)",…
## $ road_configuration <chr> "Two-Way, Not Divided", "Two-Way, Divided, Unpro…
## $ road_defects <chr> "None", NA, "None", "None", "None", "None", "Non…
## $ road_feature <chr> "No Special Feature", "T-Intersection", "No Spec…
## $ road_surface <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth Asph…
## $ num_bikes_ai <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num_bikes_bi <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num_bikes_ci <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num_bikes_ki <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num_bikes_no <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num_bikes_to <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num_bikes_ui <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ num_lanes <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", "8 la…
## $ num_units <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ distance_mi_from <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"…
## $ frm_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ rte_invd_cd <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ towrd_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ geo_point <chr> "35.3336070056, -77.9955023901", "36.3151872016,…
## $ geo_shape <chr> "{\"type\": \"Point\", \"coordinates\": [-77.995…
13 / 29

A Grammar of Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames.

  • filter: pick rows matching criteria
  • slice: pick rows using index(es)
  • select: pick columns by name
  • pull: grab a column as a vector
  • arrange: reorder rows
  • mutate: add new variables
  • distinct: filter for unique rows
  • sample_n / sample_frac: randomly sample rows
  • summarise: reduce variables to values
  • ... (many more)
14 / 29

dplyr rules for functions

  • First argument is always a data frame
  • Subsequent arguments say what to do with that data frame
  • Always return a data frame
  • Doesn't modify in place
15 / 29

A note on piping and layering

  • The %>% operator in dplyr functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
16 / 29

A note on piping and layering

  • The %>% operator in dplyr functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
  • The + operator in ggplot2 functions is used for "layering". This means you create the plot in layers, separated by +.
16 / 29

filter to select a subset of rows

for crashes in Durham County

ncbikecrash %>%
filter(county == "Durham")
## # A tibble: 340 x 66
## object_id city county region development locality on_road rural_urban
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2452 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## 2 2441 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 3 2466 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 4 549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban
## 5 598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban
## 6 603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban
## 7 3974 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 8 7134 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 9 1670 Durh… Durham Piedm… Commercial Urban (… INFINI… Urban
## 10 1773 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## # … with 330 more rows, and 58 more variables: speed_limit <chr>,
## # traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>,
## # bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>,
## # bike_direction <chr>, bike_injury <chr>, bike_position <chr>,
## # bike_race <chr>, bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, driver_est_speed <chr>,
## # driver_injury <chr>, driver_race <chr>, driver_sex <chr>,
## # driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>,
## # crash_day <chr>, crash_group <chr>, crash_hour <int>, crash_location <chr>,
## # crash_month <chr>, crash_severity <chr>, crash_time <time>,
## # crash_type <chr>, crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## # light_condition <chr>, road_character <chr>, road_class <chr>,
## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, frm_road <chr>,
## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr>
17 / 29

filter for many conditions at once

for crashes in Durham County where biker was 0-5 years old

ncbikecrash %>%
filter(county == "Durham", bike_age_group == "0-5")
## # A tibble: 4 x 66
## object_id city county region development locality on_road rural_urban
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 4062 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## 2 414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban
## 3 3016 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## 4 1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>,
## # driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>,
## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>,
## # driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>,
## # crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>,
## # crash_hour <int>, crash_location <chr>, crash_month <chr>,
## # crash_severity <chr>, crash_time <time>, crash_type <chr>,
## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## # light_condition <chr>, road_character <chr>, road_class <chr>,
## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, frm_road <chr>,
## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr>
18 / 29

Logical operators in R

operator definition operator definition
< less than x | y x OR y
<= less than or equal to is.na(x) test if x is NA
> greater than !is.na(x) test if x is not NA
>= greater than or equal to x %in% y test if x is in y
== exactly equal to !(x %in% y) test if x is not in y
!= not equal to !x not x
x & y x AND y
19 / 29

select to keep variables

ncbikecrash %>%
filter(county == "Durham", bike_age_group == "0-5") %>%
select(locality, speed_limit)
## # A tibble: 4 x 2
## locality speed_limit
## <chr> <chr>
## 1 Urban (>70% Developed) 30 - 35 MPH
## 2 Urban (>70% Developed) 5 - 15 MPH
## 3 Urban (>70% Developed) 20 - 25 MPH
## 4 Urban (>70% Developed) 20 - 25 MPH
20 / 29

select to exclude variables

ncbikecrash %>%
select(-object_id)
## # A tibble: 7,467 x 65
## city county region development locality on_road rural_urban speed_limit
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural 50 - 55 M…
## 2 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban 30 - 35 M…
## 3 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural 50 - 55 M…
## 4 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban 30 - 35 M…
## 5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban <NA>
## 6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural 50 - 55 M…
## 7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural 30 - 35 M…
## 8 Rale… Wake Piedm… Commercial Urban (… PERSON… Urban 30 - 35 M…
## 9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban 30 - 35 M…
## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban 20 - 25 M…
## # … with 7,457 more rows, and 57 more variables: traffic_control <chr>,
## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>,
## # driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>,
## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>,
## # driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>,
## # crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>,
## # crash_hour <int>, crash_location <chr>, crash_month <chr>,
## # crash_severity <chr>, crash_time <time>, crash_type <chr>,
## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## # light_condition <chr>, road_character <chr>, road_class <chr>,
## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, frm_road <chr>,
## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr>
21 / 29

select a range of variables

ncbikecrash %>%
select(city:locality)
## # A tibble: 7,467 x 5
## city county region development locality
## <chr> <chr> <chr> <chr> <chr>
## 1 None - Rural C… Wayne Coastal Farms, Woods, Pas… Rural (<30% Developed)
## 2 Henderson Vance Piedmont Residential Mixed (30% To 70% Dev…
## 3 None - Rural C… Lincoln Piedmont Farms, Woods, Pas… Rural (<30% Developed)
## 4 Whiteville Columbus Coastal Commercial Urban (>70% Developed)
## 5 Wilmington New Hanov… Coastal Residential Urban (>70% Developed)
## 6 None - Rural C… Robeson Coastal Farms, Woods, Pas… Rural (<30% Developed)
## 7 None - Rural C… Richmond Piedmont Residential Mixed (30% To 70% Dev…
## 8 Raleigh Wake Piedmont Commercial Urban (>70% Developed)
## 9 Whiteville Columbus Coastal Residential Rural (<30% Developed)
## 10 New Bern Craven Coastal Residential Urban (>70% Developed)
## # … with 7,457 more rows
22 / 29

summarize to reduce variables to values

ncbikecrash %>%
summarize(avg_hr = mean(crash_hour))
## # A tibble: 1 x 1
## avg_hr
## <dbl>
## 1 14.7
23 / 29

group_by to do calculations on groups

ncbikecrash %>%
group_by(hit_run) %>%
summarise(avg_hr = mean(crash_hour))
## # A tibble: 2 x 2
## hit_run avg_hr
## <chr> <dbl>
## 1 No 14.6
## 2 Yes 15.0
24 / 29

mutate to add new variables

ncbikecrash %>%
mutate(weather_clear = case_when(
weather == "Clear" ~ "Yes",
TRUE ~ "No"
))

25 / 29

"Save" when you mutate

Most often when you define a new variable with mutate you'll also want to save the resulting data frame, often by writing over the original data frame.

ncbikecrash <- ncbikecrash %>%
mutate(weather_clear = case_when(
weather == "Clear" ~ "Yes",
TRUE ~ "No"
))
26 / 29

"Save" when you mutate

Most often when you define a new variable with mutate you'll also want to save the resulting data frame, often by writing over the original data frame.

ncbikecrash %>%
mutate(weather_clear = case_when(
weather == "Clear" ~ "Yes",
TRUE ~ "No"
)) -> ncbikecrash
27 / 29

Check before you move on

ncbikecrash %>%
count(weather_clear, weather)
## # A tibble: 6 x 3
## weather_clear weather n
## <chr> <chr> <int>
## 1 No Cloudy 951
## 2 No Fog, Smog, Smoke 17
## 3 No Other 15
## 4 No Rain 295
## 5 No Snow, Sleet, Hail, Freezing Rain/Drizzle 8
## 6 Yes Clear 6181
28 / 29

NC bike crashes

  • Go to the sta-363-s20 GitHub organization and search for appex-02-nc-bike-crashes
  • Clone this repository into RStudio Cloud
  • Complete the exercises
  • Commit your changes
  • Push them back to GitHub
29 / 29

NC bike crashes

2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow