Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js
+ - 0:00:00
Notes for current slide
Notes for next slide

Welcome to Statistical Learning

Dr. D’Agostino McGowan

1 / 34

👋

Lucy D'Agostino McGowan

  mcgowald@wfu.edu
  Manchester 342
  Wed 10:00-11:00a, Thu 10:00-11:00a

2 / 34

Everything you will need will be posted at:

bit.ly/sta-363-s20

3 / 34

Statistical Learning Problems

  • Identify risk factors for breast cancer

4 / 34

Statistical Learning Problems

  • Identify risk factors for breast cancer
  • Customize an email spam detection system
  • Data: 4601 labeled emails sent to George who works at HP Labs
  • Input features: frequencies of words and punctuation
george you hp free ! edu remove
spam 0.00 2.26 0.02 0.52 0.51 0.01 0.28
email 2.27 1.27 0.90 0.07 0.11 0.29 0.01
5 / 34

Statistical Learning Problems

  • Identify risk factors for breast cancer
  • Customize an email spam detection system
  • Identify numbers in handwritten zip code

6 / 34

Statistical Learning Problems

  • Identify risk factors for breast cancer
  • Customize an email spam detection system
  • Identify numbers in handwritten zip code
  • Establish the relationship between variables in population survey data

Income survey data for males from the central Atlantic region of US, 2009

7 / 34

Statistical Learning Problems

  • Identify risk factors for breast cancer
  • Customize an email spam detection system
  • Identify numbers in handwritten zip code
  • Establish the relationship between variables in population survey data
  • Classify pixels of an image

Usage {red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil}

8 / 34

✌️ types of statistical learning

  • Supervised Learning
  • Unsupervised Learning
9 / 34

Supervised Learning

  • outcome variable: Y, (dependent variable, response, target)
10 / 34

Supervised Learning

  • outcome variable: Y, (dependent variable, response, target)
  • predictors: vector of p predictors, X, (inputs, regressors, covariates, features, independent variables)
10 / 34

Supervised Learning

  • outcome variable: Y, (dependent variable, response, target)
  • predictors: vector of p predictors, X, (inputs, regressors, covariates, features, independent variables)
  • In the regression problem, Y is quantitative (e.g price, blood pressure)
10 / 34

Supervised Learning

  • outcome variable: Y, (dependent variable, response, target)
  • predictors: vector of p predictors, X, (inputs, regressors, covariates, features, independent variables)
  • In the regression problem, Y is quantitative (e.g price, blood pressure)
  • In the classification problem, Y takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of tissue sample)
10 / 34

Supervised Learning

  • outcome variable: Y, (dependent variable, response, target)
  • predictors: vector of p predictors, X, (inputs, regressors, covariates, features, independent variables)
  • In the regression problem, Y is quantitative (e.g price, blood pressure)
  • In the classification problem, Y takes values in a finite, unordered set (survived/died, digit 0-9, cancer class of tissue sample)
  • We have training data (x1,y1),,(xN,yN). These are observations (examples, instances) of these measurements
10 / 34

Supervised Learning

What do you think are some objectives here?

11 / 34

Supervised Learning

What do you think are some objectives here?

Objectives

  • Accurately predict unseen test cases
  • Understand which inputs affect the outcome, and how
  • Assess the quality of our predictions and inferences
11 / 34

Unsupervised Learning

  • No outcome variable, just a set of predictors (features) measured on a set of samples
12 / 34

Unsupervised Learning

  • No outcome variable, just a set of predictors (features) measured on a set of samples
  • objective is more fuzzy -- find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation
12 / 34

Unsupervised Learning

  • No outcome variable, just a set of predictors (features) measured on a set of samples
  • objective is more fuzzy -- find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation
  • difficult to know how well your are doing
12 / 34

Unsupervised Learning

  • No outcome variable, just a set of predictors (features) measured on a set of samples
  • objective is more fuzzy -- find groups of samples that behave similarly, find features that behave similarly, find linear combinations of features with the most variation
  • difficult to know how well your are doing
  • different from supervised learning, but can be useful as a pre-processing step for supervised learning
12 / 34

Let's go!

13 / 34

Create a GitHub account

Go to https://github.com/, and create an account (unless you already have one). Tips for selecting a username:1

  • Incorporate your actual name! People like to know who they’re dealing with. Also makes your username easier for people to guess or remember.
  • Reuse your username from other contexts if you can, e.g., Twitter or Slack.
  • Pick a username you will be comfortable revealing to your future boss.
  • Shorter is better than longer.
  • Be as unique as possible in as few characters as possible. In some settings GitHub auto-completes or suggests usernames.
  • Make it timeless. Don’t highlight your current university, employer, or place of residence.
  • Avoid words laden with special meaning in programming, like NA.

[1] Source: Happy git with R by Jenny Bryan.

Once done, place a green sticky on your laptop. If you have questions, place a pink sticky.
14 / 34

Let me know your GitHub name

bit.ly/sta-363-s20-ghsurvey

Once done, place a green sticky on your laptop. If you have questions, place a pink sticky.
15 / 34

Join RStudio.cloud

Go to bit.ly/sta-363-s20-rstudio-join and sign up.

Once done, place a green sticky on your laptop. If you have questions, place a pink sticky.

16 / 34

Create a classification model

  • Once you log on to RStudio Cloud, click on this course's workspace "STA 363 - S20" then click "Projects"
  • You should see a project called zipcode, click it.
  • In the Files pane in the bottom right corner, spot the file called zipcode.Rmd. Open it, and then click on the "Knit" button. You will likely see an pop-up error, click "Try Again"
  • Go back to the file and change your name on top (in the yaml -- we'll talk about what this means later) and knit again.
  • Then, scroll to the recipe chunk, below First we create a recipe. Instead of creating a model to classify whether the number is 0 or not, create a model to predict whether the number is 1. Knit again & voila!
Once done, place a green sticky on your laptop. If you have questions, place a pink sticky.
17 / 34

Let's take a tour - class website

  • Concepts introduced:
    • How to find slides
    • How to find assignments
    • How to find RStudio Cloud
    • How to get help
    • How to find policies
18 / 34

Course structure and policies

19 / 34

Class meetings

  • Interactive
  • Some lectures, lots of learn-by-doing
  • Bring your laptop to class every day
20 / 34

Diversity & Inclusiveness:

  • Intent: Students from all diverse backgrounds and perspectives be well-served by this course, that students' learning needs be addressed both in and out of class, and that the diversity that the students bring to this class be viewed as a resource, strength and benefit. It is my intent to present materials and activities that are respectful of diversity: gender identity, sexuality, disability, age, socioeconomic status, ethnicity, race, nationality, religion, and culture. Let me know ways to improve the effectiveness of the course for you personally, or for other students or student groups.
  • If you have a name and/or set of pronouns that differ from those that appear in your official Wake Forest records, please let me know!
21 / 34

Diversity & Inclusiveness:

  • If you feel like your performance in the class is being impacted by your experiences outside of class, please don't hesitate to come and talk with me. I want to be a resource for you. If you prefer to speak with someone outside of the course, your academic dean is an excellent resource.
  • I (like many people) am still in the process of learning about diverse perspectives and identities. If something was said in class (by anyone) that made you feel uncomfortable, please talk to me about it.
22 / 34

Disability Policy

Students with disabilities who believe that they may need accommodations in the class are encouraged to contact Learning Assistance Center & Disability Services at 336-758-5929 or lacds@wfu.edu as soon as possible to better ensure that such accommodations are implemented in a timely fashion.

23 / 34

How to get help

All course discussion will be via GitHub: sta-363-s20/community

  • See course policies for tips on posting questions.
  • For personal and grade related questions, use email.
24 / 34

How to get help

  • Go to https://github.com/sta-363-s20/community and bookmark this page.
  • In the issues tab find the issue created by Dr. D'Agostino McGowan (@LucyMcGowan) and click on it to view it.
  • Respond to it with a hello, or something else. Feel free to add some code formatted text, inline surrounded by a single backtick or on a new line surrounded by three backtics, or a hyperlink. Or try bolding or italicizing. You could also tag someone if you know their GitHub username. Your post doesn't have to be meaningful.
  • Hit Comment when you're done.
  • Read the "How to get help" section on the course policies page.
25 / 34

How to get help

Study sessions

  • Monday evenings - location TBD

Math & Stats center

26 / 34

Academic integrity

Adhere to the Wake Forest Honor Code. Academic dishonesty will not be tolerated.

27 / 34

Sharing/reusing code

  • There are many online resources for sharing code (for example, StackOverflow) - you may use these resources but must explicitly cite where you have obtained code (both code you used directly and "paraphrased" code / code used as inspiration). Any reused code that is not explicitly cited will be treated as plagiarism.
  • You may discuss the content of assignments with others in this class. If you do so, please acknowledge your collaborator(s) at the top of your assignment, for example: "Collaborators: Gertrude Cox, Florence Nightingale David". Failure to acknowledge collaborators will result in a grade of 0. You may not copy code and/or answers directly from another student. If you copy someone else's work, both parties will receive a grade of 0.
  • Rather than copying someone else's work, ask for help. You are not alone in this course!
28 / 34

Course components:

  • Application exercises: Usually start in class and finish in teams by the next class period, check/no check
  • Reading assessments: no make-ups, lowest two will be dropped
  • Homework: on your own, lowest score dropped
  • Lab: start in class, lowest score dropped
  • Exams: 2 in class midterms, 1 take home midterm
29 / 34

Grading

Component Weight
Participation & application exercises 5%
Reading assessments 10%
Homework 10%
Labs 10%
Midterm exam 1 25%
Midterm exam 2 25%
Midterm exam 3 15%
  • Class attendance is a firm expectation; frequent absences or tardiness will be considered a legitimate cause for grade reduction.
30 / 34

Late/missed work policy

  • Late work policy for homework and lab assignments:
    • late, but within 24 hours of due date/time: -50%
    • any later: no credit
  • Late work will not be accepted for the take-home Midterm exam 3.
31 / 34

Other policies

  • Please refrain from texting or using your computer for anything other than coursework during class.
  • You must be in class on a day when you're scheduled to present, there are no make ups for presentations.
  • Regrade requests must be made within 1 week of when the assignment is returned.
32 / 34

Intros

  • name
  • major / intended major
  • what you hope to get out of this class OR fun fact
33 / 34

RStudio Cloud

  • If you had issues creating your RStudio Cloud account, opening the project, or running the analysis, stick around to try it again.
  • If RStudio Cloud worked for you and you were able to run the analysis, you're free to leave.
34 / 34

👋

Lucy D'Agostino McGowan

  mcgowald@wfu.edu
  Manchester 342
  Wed 10:00-11:00a, Thu 10:00-11:00a

2 / 34
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow