Due: Monday 2020-02-10 at 11:59pm
In this lab we are going to classify speech data using linear discriminant analysis and quadratic discriminant analysis. A few reminders:
lab-02-lda-qda-YOUR-GITHUB-HANDLE
. This repo contains a template you can build on to complete your assignment.In this lab we will work with three packages: tidyverse
which is a collection of packages for doing data analysis in a “tidy” way, tidymodels
which is a collection of packages for doing statistical analysis, and MASS
a package that contains the functions to fit the models.
Install these packages by running the following in the console.
Now that the necessary packages are installed, you should be able to Knit your document and see the results.
If you’d like to run your code in the Console as well you’ll also need to load the packages there. To do so, run the following in the console.
Note that the packages are also loaded with the same commands in your R Markdown document.
Your email address is the address tied to your GitHub account and your name should be first and last name.
To confirm that the changes have been implemented, run the following
If you would like your git password cached for a week for this project, type the following in the Terminal:
Currently your project is called Untitled Project. Update the name of your project to be “Lab 02 - LDA and QDA”.
Before we introduce the data, let’s warm up with some simple exercises.
Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.
The first few exercises will involve using a data frame that you will enter based on a table.
The remainder of the exercises will be based on speech data. These data arose from a collaboration between Andreas Buja, Werner Stuetzle and Martin Maechler. The data were extracted from the TIMIT database (TIMIT Acoustic-Phonetic Continuous Speech Corpus, NTIS, US Dept of Commerce) which is a widely used resource for research in speech recognition. A dataset was formed by selecting five phonemes for classification based on digitized speech from this database. The phonemes are transcribed as follows: “sh” as in “she”, “dcl” as in “dark”, “iy” as the vowel in “she”, “aa” as the vowel in “dark”, and “ao” as the first vowel in “water”. From continuous speech of 50 male speakers, 4509 speech frames of 32 msec duration were selected, approximately 2 examples of each phoneme from each speaker.
The data contain 256 columns labelled x.1
- x.256
and a response column labelled g
. The response column, g
has five classes: ao
, aa
, iy
, dcl
, and sh
. The predictors, x.1
- x.256
consist of logged periograms at various frequencies. A periodogram is essentially an estimate of the spectral density of a signal; in this case we are using sound data from subjects to predict what sound they were saying. Here is a plot of one of the observations:
Here is a plot of all of the observations, split by the response variable, the sound they were saying:
x | y |
---|---|
2.1 | 1 |
1.7 | 1 |
2.0 | 1 |
3.8 | 1 |
4.2 | 1 |
0.8 | 2 |
1.1 | 2 |
1.3 | 2 |
1.5 | 2 |
3.8 | 2 |
Using the data created in Exercise 1, calculate the discriminant scores for \(x = 3\) without using the lda()
function. What class would you classify this observation into?
Using the lda()
function from the MASS package, check your work for Exercise 2.
read_csv
. There are two data frames, train
and test
. You will fit the models on train
and evaluate the models using test
. How many variables are in each data frame? How many observations?Hint: Since we read this data in from a .csv file, we cannot rely on the ?
to find out more about these data frames. You can still use the glimpse()
function (in the Console!).
lda()
function from the MASS package, perform a linear discriminant analysis predicting g
, the outcome variable for the phoneme, from the rest of the variables using the train
data. Create a visualization of LD1 versus LD2 colored by the outcome variable.Hint: When there are lots of variables and you intend to include them all in a model, you can use .
as a shortcut in R. For example, lm(y ~ ., data = df)
would include all variables except y
in the model.
predict()
function to capture the predicted class
for each observation in the test
data frame. Add this column to the test
data frame using the mutate()
function.Below is some code to get you started.
Plot a confusion matrix using the test
data frame. How did this model perform?
Calculate the accuracy of the linear discriminant analysis model in the train
data frame and in the test
data frame. How do they compare?
Now we are going to perform quadratic discriminant analysis. Using the qda()
function, predict g
using all remaining variables in the train
data frame. Predict the class of all observations in the test
data frame and plot a confusion matrix. Calculate the accuracy of the quadratic discriminant analysis model in the train
data frame and the test
data frame. How does this compare to the linear discriminant analysis model?
If you only used the accuracy from the train
data frame, which model would you have chosen? If you use the accuracy from the test
data frame, which model do you choose?
If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?
If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?
In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?