Lab 02 - LDA and QDA

Due: Monday 2020-02-10 at 11:59pm

Introduction

In this lab we are going to classify speech data using linear discriminant analysis and quadratic discriminant analysis. A few reminders:

Getting started

Packages

In this lab we will work with three packages: tidyverse which is a collection of packages for doing data analysis in a “tidy” way, tidymodels which is a collection of packages for doing statistical analysis, and MASS a package that contains the functions to fit the models.

Install these packages by running the following in the console.

install.packages("tidyverse")
install.packages("tidymodels")
install.packages("MASS")

Now that the necessary packages are installed, you should be able to Knit your document and see the results.

If you’d like to run your code in the Console as well you’ll also need to load the packages there. To do so, run the following in the console.

library(tidyverse) 
library(tidymodels)
library(MASS)

Note that the packages are also loaded with the same commands in your R Markdown document.

Housekeeping

Git configuration

Your email address is the address tied to your GitHub account and your name should be first and last name.

To confirm that the changes have been implemented, run the following

Password caching

If you would like your git password cached for a week for this project, type the following in the Terminal:

Project name:

Currently your project is called Untitled Project. Update the name of your project to be “Lab 02 - LDA and QDA”.

Warm up

Before we introduce the data, let’s warm up with some simple exercises.

YAML:

Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.

Commiting and pushing changes:

Data

The first few exercises will involve using a data frame that you will enter based on a table.

The remainder of the exercises will be based on speech data. These data arose from a collaboration between Andreas Buja, Werner Stuetzle and Martin Maechler. The data were extracted from the TIMIT database (TIMIT Acoustic-Phonetic Continuous Speech Corpus, NTIS, US Dept of Commerce) which is a widely used resource for research in speech recognition. A dataset was formed by selecting five phonemes for classification based on digitized speech from this database. The phonemes are transcribed as follows: “sh” as in “she”, “dcl” as in “dark”, “iy” as the vowel in “she”, “aa” as the vowel in “dark”, and “ao” as the first vowel in “water”. From continuous speech of 50 male speakers, 4509 speech frames of 32 msec duration were selected, approximately 2 examples of each phoneme from each speaker.

The data contain 256 columns labelled x.1 - x.256 and a response column labelled g. The response column, g has five classes: ao, aa, iy, dcl, and sh. The predictors, x.1 - x.256 consist of logged periograms at various frequencies. A periodogram is essentially an estimate of the spectral density of a signal; in this case we are using sound data from subjects to predict what sound they were saying. Here is a plot of one of the observations:

Here is a plot of all of the observations, split by the response variable, the sound they were saying:

Exercises

By hand

  1. Create a data frame containing two columns, \(x\) and \(y\), using the data in the table below.
x y
2.1 1
1.7 1
2.0 1
3.8 1
4.2 1
0.8 2
1.1 2
1.3 2
1.5 2
3.8 2
  1. Using the data created in Exercise 1, calculate the discriminant scores for \(x = 3\) without using the lda() function. What class would you classify this observation into?

  2. Using the lda() function from the MASS package, check your work for Exercise 2.

Speech data

  1. Load the data using read_csv. There are two data frames, train and test. You will fit the models on train and evaluate the models using test. How many variables are in each data frame? How many observations?

Hint: Since we read this data in from a .csv file, we cannot rely on the ? to find out more about these data frames. You can still use the glimpse() function (in the Console!).

  1. Using the lda() function from the MASS package, perform a linear discriminant analysis predicting g, the outcome variable for the phoneme, from the rest of the variables using the train data. Create a visualization of LD1 versus LD2 colored by the outcome variable.

Hint: When there are lots of variables and you intend to include them all in a model, you can use . as a shortcut in R. For example, lm(y ~ ., data = df) would include all variables except y in the model.

  1. Using the model created from Exercise 5, use the predict() function to capture the predicted class for each observation in the test data frame. Add this column to the test data frame using the mutate() function.

Below is some code to get you started.

  1. Plot a confusion matrix using the test data frame. How did this model perform?

  2. Calculate the accuracy of the linear discriminant analysis model in the train data frame and in the test data frame. How do they compare?

  3. Now we are going to perform quadratic discriminant analysis. Using the qda() function, predict g using all remaining variables in the train data frame. Predict the class of all observations in the test data frame and plot a confusion matrix. Calculate the accuracy of the quadratic discriminant analysis model in the train data frame and the test data frame. How does this compare to the linear discriminant analysis model?

  4. If you only used the accuracy from the train data frame, which model would you have chosen? If you use the accuracy from the test data frame, which model do you choose?

Conceptual questions

  1. If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

  2. If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

  3. In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?