Setting a Good Threshold for a Classifier (Dropout as Bayesian Approximation)
I was facing an issue today in my speaker identification pipeline. The pipeline was giving different results when the input data was duplicated many times over, which definately seems strange.
The pipeline identifies the names of speakers in a video given a database of known speaker audio and their names. There are a few preparation steps before the core part of the pipeline. Amazon Transcribe is first applied to the videos to figure out the different unique speaker tracks. Then each speaker track is split into chunks of a few seconds each, and then each chunk is run through a pre-trained neural network model that takes in audio and outputs high-level embeddings in speaker space. The model is trained to plot audio samples from the same speaker close together and audio samples from different speakers further apart. This embedding helps to make the speaker identification so much easier.
The central part of my model is a neural network that is trained to discriminate between embeddings from different speakers. Critically, it has a measure of confidence using dropout as a bayesian approximation. This means that not only do I get the probability that a speaker track is person X but I get a measure of confidence in that prediction. If the confidence is low, then I can say that this speaker track is from an unknown person. Without the measure of confidence, the model would tend to label a speaker track with a known person’s name even it should not. For those in the know, this partly has to do with the softmax function at the end of the neural network where predictions for different classes must sum to 1.
I figure out a good threshold for determining a known versus unknown speaker by taking an independent set of speakers. I can see how likely the model will assign a name to each of these out-of-sample speakers 90-95% of the time. This way I get a probability threshold that ensures the vast majority of the time if a speaker is brand new, it will be labaled as unknown.
The general pipeline is the following:
- input audio
- -> extract 3s audio clips for each speaker
- -> dimensionality reduction (512 dims to X dims, where X usually ends up being around 100s)
- -> training a model to discrimate between different speaker embeddings
- -> determining a good threshold for the model
- -> predicting the speaker identify of new embeddings
Testing It Out
I tried to run my model with 2 training videos (3 speakers) and 1 test video (1 speaker from the training video). The results showed that the probability of correctly identifying the speaker was high (average was almost 1 and 90% of the probabilities were close to 1). However, when considering the confidence estimates for each audio sample, those estimates for the test video were low. Remember that for each speaker, we have many 3s audio samples. For the test speaker, we can make a prediction for each audio sample, and if we do this, we find that around 80% of the samples were labeled as unknown.
This value is especially strange when the test video is actually the same as one of the training videos! And, what’s interesting is if I duplicate the 2 training videos and make it like 10 videos, then way fewer of the test audio samples are labeled as unknown, around 25% now.
Removing the PCA Step
Currently, I apply dimensionality reduction to the embeddings. This is because I noticed that the results of the neural network were very noisy, meaning the output from run to run could vary widely. I think this might be due to runs were we only had 1 or a few videos and hence had way too few observations for the neural network to properly learn. When I applied a PCA and took the top 99% of variance, I found the results from run-to-run were more stable. Most of the time, the PCA took the 512 input dimensions from the embedding space and compressed it to ~100-200 dimensions, which seems quite striking to me.
This PCA is applied to a combination of the training, unknown (that’s the out-of-sample data I have), and test data. Hence, I wondered if by adding duplicate copies of the training data, it might bias the network towards extracting more relevant features present in training, and then eliminated the effect seen above.
Indeed, that is what I found. When I removed the PCA step, then even duplicating the input data did not have an effect on what was known or unknown. This solves one issue but leaves the other problem about why so many of the test audio samples are being identified as unknown.
Why is the threshold so high?
When looking at the thresholds, they are close to ~1 for our target speaker X. That means that a probability value for speaker X in the test set would need to be even higher then the threshold to be correctly classified. Anything below the threshold means that the input sample is unknown.
Changing the PCA
I can have the PCA only be applied to the training data and not also the unknown and test samples. This is proably what it should be.
Interestingly, if I do this then you get better accuracy (less unknowns) but still below an adjusted probability of 0.5 (adjusted means taking the probability and adjusting it by the amount that’s unknown). And, adding more data to the mix improves the prediction.
Oh wait doing this makes the results much more unstable. So it seems by having the combined PCA, it helps to stabilize things.