class: center, middle, inverse, title-slide # Tidy data and data wrangling
🔧 --- layout: true <div class="my-footer"> <span> <img src = "img/dsbox-logo.png" width = "30"> </img> Slides adapted from <a href="https://datasciencebox.org" target="_blank">datasciencebox.org</a> by Dr. Lucy D'Agostino McGowan </span> </div> --- ## <i class="fas fa-laptop"></i> `NC bike crashes` - Go to the [sta-363-s20 GitHub organization](https://github.com/sta-363-s20) and search for `appex-02-nc-bike-crashes` - Clone this repository into RStudio Cloud --- class: center, middle # Tidy data --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. ] -- .pull-right[ **Characteristics of untidy data:** !@#$%^&*() ] --- class: center, middle # Pipes --- ## Where does the name come from? The pipe operator is implemented in the package **magrittr**, it's pronounced "and then". .pull-left[ ![pipe](img/02/magritte.jpg) ] .pull-right[ ![magrittr](img/02/magrittr.jpg) ] --- ## Review: How does a pipe work? - You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park. - Expressed as a set of nested functions in R pseudocode this would look like: ```r park(drive(start_car(find("keys")), to = "campus")) ``` - Writing it out using pipes give it a more natural (and easier to read) structure: ```r find("keys") %>% start_car() %>% drive(to = "campus") %>% park() ``` --- ## What about other arguments? To send results to a function argument other than first one or to use the previous result for multiple arguments, use `.`: ```r starwars %>% filter(species == "Human") %>% lm(mass ~ height, data = .) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = .) ## ## Coefficients: ## (Intercept) height ## -116.58 1.11 ``` --- # Data wrangling ![](img/02/dplyr_wrangling.png) .my-footer[ <font size="2"> Artwork by @allison_horst </font> ] --- ## Bike crashes in NC 2007 - 2014 The dataset is in the **dsbox** package: ```r library(dsbox) ncbikecrash ``` --- ## Variables View the names of variables via .small[ ```r names(ncbikecrash) ``` ``` ## [1] "object_id" "city" "county" ## [4] "region" "development" "locality" ## [7] "on_road" "rural_urban" "speed_limit" ## [10] "traffic_control" "weather" "workzone" ## [13] "bike_age" "bike_age_group" "bike_alcohol" ## [16] "bike_alcohol_drugs" "bike_direction" "bike_injury" ## [19] "bike_position" "bike_race" "bike_sex" ## [22] "driver_age" "driver_age_group" "driver_alcohol" ## [25] "driver_alcohol_drugs" "driver_est_speed" "driver_injury" ## [28] "driver_race" "driver_sex" "driver_vehicle_type" ## [31] "crash_alcohol" "crash_date" "crash_day" ## [34] "crash_group" "crash_hour" "crash_location" ## [37] "crash_month" "crash_severity" "crash_time" ## [40] "crash_type" "crash_year" "ambulance_req" ## [43] "hit_run" "light_condition" "road_character" ## [46] "road_class" "road_condition" "road_configuration" ## [49] "road_defects" "road_feature" "road_surface" ## [52] "num_bikes_ai" "num_bikes_bi" "num_bikes_ci" ## [55] "num_bikes_ki" "num_bikes_no" "num_bikes_to" ## [58] "num_bikes_ui" "num_lanes" "num_units" ## [61] "distance_mi_from" "frm_road" "rte_invd_cd" ## [64] "towrd_road" "geo_point" "geo_shape" ``` ] --- ## Variables See detailed descriptions with `?ncbikecrash`. ![](img/02/bike-help.png) --- ## Viewing your data - In the Environment, after loading with `data(ncbikecrash)`, click on the name of the data frame to view it in the data viewer - Use the `glimpse` function to take a peek ```r glimpse(ncbikecrash) ``` ``` ## Observations: 7,467 ## Variables: 66 ## $ object_id <int> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1675, … ## $ city <chr> "None - Rural Crash", "Henderson", "None - Rural… ## $ county <chr> "Wayne", "Vance", "Lincoln", "Columbus", "New Ha… ## $ region <chr> "Coastal", "Piedmont", "Piedmont", "Coastal", "C… ## $ development <chr> "Farms, Woods, Pastures", "Residential", "Farms,… ## $ locality <chr> "Rural (<30% Developed)", "Mixed (30% To 70% Dev… ## $ on_road <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURKHEAD … ## $ rural_urban <chr> "Rural", "Urban", "Rural", "Urban", "Urban", "Ru… ## $ speed_limit <chr> "50 - 55 MPH", "30 - 35 MPH", "50 - 55 MPH", … ## $ traffic_control <chr> "No Control Present", "Stop Sign", "Double Yello… ## $ weather <chr> "Clear", "Clear", "Clear", "Rain", "Clear", "Clo… ## $ workzone <chr> "No", "No", "No", "No", "No", "No", "No", "No", … ## $ bike_age <chr> "52", "66", "33", "52", "22", "15", "41", "14", … ## $ bike_age_group <chr> "50-59", "60-69", "30-39", "50-59", "20-24", "11… ## $ bike_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", "No",… ## $ bike_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ bike_direction <chr> "With Traffic", "With Traffic", "With Traffic", … ## $ bike_injury <chr> "B: Evident Injury", "C: Possible Injury", "C: P… ## $ bike_position <chr> "Bike Lane / Paved Shoulder", "Travel Lane", "Tr… ## $ bike_race <chr> "Black", "Black", "White", "Black", "White", "Na… ## $ bike_sex <chr> "Male", "Male", "Male", "Male", "Female", "Male"… ## $ driver_age <chr> "34", NA, "37", "55", "25", "17", NA, "50", "32"… ## $ driver_age_group <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-19", … ## $ driver_alcohol <chr> "No", "Missing", "No", "No", "No", "No", "Missin… ## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ driver_est_speed <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-15 mph… ## $ driver_injury <chr> "O: No Injury", "Unknown Injury", "O: No Injury"… ## $ driver_race <chr> "White", "Unknown/Missing", "Hispanic", "Black",… ## $ driver_sex <chr> "Male", NA, "Female", "Male", "Male", "Female", … ## $ driver_vehicle_type <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "Passe… ## $ crash_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", "No",… ## $ crash_date <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14DEC201… ## $ crash_day <chr> "Wednesday", "Wednesday", "Sunday", "Saturday", … ## $ crash_group <chr> "Motorist Overtaking Bicyclist", "Bicyclist Fail… ## $ crash_hour <int> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22, 12,… ## $ crash_location <chr> "Non-Intersection", "Intersection", "Non-Interse… ## $ crash_month <chr> "December", "November", "November", "December", … ## $ crash_severity <chr> "B: Evident Injury", "C: Possible Injury", "C: P… ## $ crash_time <time> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13:27:0… ## $ crash_type <chr> "Motorist Overtaking - Undetected Bicyclist", "B… ## $ crash_year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, … ## $ ambulance_req <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", … ## $ hit_run <chr> "No", "Yes", "No", "No", "No", "No", "Yes", "No"… ## $ light_condition <chr> "Dark - Roadway Not Lighted", NA, "Dark - Roadwa… ## $ road_character <chr> "Straight - Level", "Straight - Level", "Straigh… ## $ road_class <chr> "State Secondary Route", "Local Street", "US Rou… ## $ road_condition <chr> "Dry", "Dry", "Dry", "Water (Standing, Moving)",… ## $ road_configuration <chr> "Two-Way, Not Divided", "Two-Way, Divided, Unpro… ## $ road_defects <chr> "None", NA, "None", "None", "None", "None", "Non… ## $ road_feature <chr> "No Special Feature", "T-Intersection", "No Spec… ## $ road_surface <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth Asph… ## $ num_bikes_ai <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ num_bikes_bi <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ num_bikes_ci <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ num_bikes_ki <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ num_bikes_no <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ num_bikes_to <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ num_bikes_ui <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ num_lanes <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", "8 la… ## $ num_units <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, … ## $ distance_mi_from <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"… ## $ frm_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ rte_invd_cd <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ towrd_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ geo_point <chr> "35.3336070056, -77.9955023901", "36.3151872016,… ## $ geo_shape <chr> "{\"type\": \"Point\", \"coordinates\": [-77.995… ``` --- ## A Grammar of Data Manipulation **dplyr** is based on the concepts of functions as verbs that manipulate data frames. .pull-left[ ![](img/02/dplyr-part-of-tidyverse.png) ] .pull-right[ .midi[ - `filter`: pick rows matching criteria - `slice`: pick rows using index(es) - `select`: pick columns by name - `pull`: grab a column as a vector - `arrange`: reorder rows - `mutate`: add new variables - `distinct`: filter for unique rows - `sample_n` / `sample_frac`: randomly sample rows - `summarise`: reduce variables to values - ... (many more) ] ] --- ## **dplyr** rules for functions - First argument is *always* a data frame - Subsequent arguments say what to do with that data frame - Always return a data frame - Doesn't modify in place --- ## A note on piping and layering - The `%>%` operator in **dplyr** functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code. -- - The `+` operator in **ggplot2** functions is used for "layering". This means you create the plot in layers, separated by `+`. --- ## `filter` to select a subset of rows for crashes in Durham County .small[ ```r ncbikecrash %>% * filter(county == "Durham") ``` ``` ## # A tibble: 340 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 2452 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 2441 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 3 2466 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 4 549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban ## 5 598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban ## 6 603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban ## 7 3974 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 8 7134 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 9 1670 Durh… Durham Piedm… Commercial Urban (… INFINI… Urban ## 10 1773 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## # … with 330 more rows, and 58 more variables: speed_limit <chr>, ## # traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>, ## # bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>, ## # bike_direction <chr>, bike_injury <chr>, bike_position <chr>, ## # bike_race <chr>, bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, driver_est_speed <chr>, ## # driver_injury <chr>, driver_race <chr>, driver_sex <chr>, ## # driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>, ## # crash_day <chr>, crash_group <chr>, crash_hour <int>, crash_location <chr>, ## # crash_month <chr>, crash_severity <chr>, crash_time <time>, ## # crash_type <chr>, crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, frm_road <chr>, ## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr> ``` ] --- ## `filter` for many conditions at once for crashes in Durham County where biker was 0-5 years old .small[ ```r ncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5") ``` ``` ## # A tibble: 4 x 66 ## object_id city county region development locality on_road rural_urban ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 4062 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban ## 3 3016 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 4 1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>, ## # driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>, ## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>, ## # driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>, ## # crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, frm_road <chr>, ## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr> ``` ] --- ## Logical operators in R operator | definition || operator | definition ------------|------------------------------||--------------|---------------- `<` | less than ||`x` | `y` | `x` OR `y` `<=` | less than or equal to ||`is.na(x)` | test if `x` is `NA` `>` | greater than ||`!is.na(x)` | test if `x` is not `NA` `>=` | greater than or equal to ||`x %in% y` | test if `x` is in `y` `==` | exactly equal to ||`!(x %in% y)` | test if `x` is not in `y` `!=` | not equal to ||`!x` | not `x` `x & y` | `x` AND `y` || | --- ## `select` to keep variables .small[ ```r ncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5") %>% select(locality, speed_limit) ``` ``` ## # A tibble: 4 x 2 ## locality speed_limit ## <chr> <chr> ## 1 Urban (>70% Developed) 30 - 35 MPH ## 2 Urban (>70% Developed) 5 - 15 MPH ## 3 Urban (>70% Developed) 20 - 25 MPH ## 4 Urban (>70% Developed) 20 - 25 MPH ``` ] --- ## `select` to exclude variables .small[ ```r ncbikecrash %>% select(-object_id) ``` ``` ## # A tibble: 7,467 x 65 ## city county region development locality on_road rural_urban speed_limit ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural 50 - 55 M… ## 2 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban 30 - 35 M… ## 3 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural 50 - 55 M… ## 4 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban 30 - 35 M… ## 5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban <NA> ## 6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural 50 - 55 M… ## 7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural 30 - 35 M… ## 8 Rale… Wake Piedm… Commercial Urban (… PERSON… Urban 30 - 35 M… ## 9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban 30 - 35 M… ## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban 20 - 25 M… ## # … with 7,457 more rows, and 57 more variables: traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>, ## # driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>, ## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>, ## # driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>, ## # crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <int>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <int>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>, ## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>, ## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>, ## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>, frm_road <chr>, ## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr> ``` ] --- ## `select` a range of variables .small[ ```r ncbikecrash %>% select(city:locality) ``` ``` ## # A tibble: 7,467 x 5 ## city county region development locality ## <chr> <chr> <chr> <chr> <chr> ## 1 None - Rural C… Wayne Coastal Farms, Woods, Pas… Rural (<30% Developed) ## 2 Henderson Vance Piedmont Residential Mixed (30% To 70% Dev… ## 3 None - Rural C… Lincoln Piedmont Farms, Woods, Pas… Rural (<30% Developed) ## 4 Whiteville Columbus Coastal Commercial Urban (>70% Developed) ## 5 Wilmington New Hanov… Coastal Residential Urban (>70% Developed) ## 6 None - Rural C… Robeson Coastal Farms, Woods, Pas… Rural (<30% Developed) ## 7 None - Rural C… Richmond Piedmont Residential Mixed (30% To 70% Dev… ## 8 Raleigh Wake Piedmont Commercial Urban (>70% Developed) ## 9 Whiteville Columbus Coastal Residential Rural (<30% Developed) ## 10 New Bern Craven Coastal Residential Urban (>70% Developed) ## # … with 7,457 more rows ``` ] --- ## `summarize` to reduce variables to values .small[ ```r ncbikecrash %>% summarize(avg_hr = mean(crash_hour)) ``` ``` ## # A tibble: 1 x 1 ## avg_hr ## <dbl> ## 1 14.7 ``` ] --- ## `group_by` to do calculations on groups .small[ ```r ncbikecrash %>% group_by(hit_run) %>% summarise(avg_hr = mean(crash_hour)) ``` ``` ## # A tibble: 2 x 2 ## hit_run avg_hr ## <chr> <dbl> ## 1 No 14.6 ## 2 Yes 15.0 ``` ] --- ## `mutate` to add new variables .small[ ```r ncbikecrash %>% mutate(weather_clear = case_when( weather == "Clear" ~ "Yes", TRUE ~ "No" )) ``` ] <img height="400" src="img/02/dplyr_mutate.png"> </img> .my-footer[ <font size="2"> Artwork by @allison_horst </font> ] --- ## "Save" when you `mutate` Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame. .small[ ```r *ncbikecrash <- ncbikecrash %>% mutate(weather_clear = case_when( weather == "Clear" ~ "Yes", TRUE ~ "No" )) ``` ] --- ## "Save" when you `mutate` Most often when you define a new variable with `mutate` you'll also want to save the resulting data frame, often by writing over the original data frame. .small[ ```r ncbikecrash %>% mutate(weather_clear = case_when( weather == "Clear" ~ "Yes", TRUE ~ "No" * )) -> ncbikecrash ``` ] --- ## Check before you move on .small[ ```r ncbikecrash %>% count(weather_clear, weather) ``` ``` ## # A tibble: 6 x 3 ## weather_clear weather n ## <chr> <chr> <int> ## 1 No Cloudy 951 ## 2 No Fog, Smog, Smoke 17 ## 3 No Other 15 ## 4 No Rain 295 ## 5 No Snow, Sleet, Hail, Freezing Rain/Drizzle 8 ## 6 Yes Clear 6181 ``` ] --- ## <i class="fas fa-laptop"></i> `NC bike crashes` - Go to the [sta-363-s20 GitHub organization](https://github.com/sta-363-s20) and search for `appex-02-nc-bike-crashes` - Clone this repository into RStudio Cloud - Complete the exercises - **Commit** your changes - **Push** them back to GitHub