I don't feel bad to confess this that ROC curve, AUC, True-positive and related terms took quite some time for me to understand. If today I contemplate on the reasons why I found this topic confusing. The first would be there are not many resources that explains intuitively what these mean. They just jump to the terms and the mathematical formula for them. The second being I had not used them even in my project work. You see the project work is never enough for all your learnings.
In this post, I will try to explain my understanding both intuitively and mathematically.
I will illustrate this concept with the help of an example.
There is a bank say SBI that wants to understand which of its future customers would default on a loan granted by the bank. The bank would already have historical data from the past years that says how many of the customers have defaulted, what type of customer were they and many other information about the past loans.
It is very rare that customers default. To give an example, say the bank has data having 1000 rows that contains information like age, gender, income, marital-status, other related columns and then a variable named default that says whether the customer has defaulted or not.
As I mentioned earlier, very few customers default on their loans. So a realistic example would be say 100 customers defaulting out of a total of 1000 customers.
The bank is interested in knowing if a customer would default or not. This is a typical binary classification problem. The bank would really be interested in the customers who are likely to default. Let us now try to understand terms like True-positive, True-negative, False-positive and other related terms.
There are two levels in the default variable that we are trying to predict-default and not-default. One has to define first whether default will be treated as positive or not-default as positive. It is just a convention. Normally, the class of interest is treated as positive. So in the bank's case, what do you think we should take as positive? You might have guessed it correctly, we will treat default as positive and not-default as negative.
We train a binary classification model on this dataset having 1000 rows. The predictions that the model makes for the training data will either be correct or incorrect.
There will be two cases for incorrect predictions
- Predicting positive when the actual was negative i.e classifying a customer as default when in actual he is not-default
- Predicting negative when the actual was positive i.e classifying a customer as not-default when in actual he is default
There will be two cases for correct predictions
- Predicting positive when the actual was positive i.e correctly classifying default as default
- Predicting negative when the actual was negative i.e correctly classifying not-default as not-default
Now, FP, FN are incorrect predictions (notice False in the name) as the name suggests and TP, TN are the correct predictions. Don't be in a hurry here. Take some time to digest these 2 lettered acronyms. Read them loud. Take a notebook and write them down on your own.
Once we are comfortable with these terms we will discuss about something called confusion-matrix. Don't get confused yet. If you understood TP, TN, FP, FN then confusion-matrix is just a matrix having these values. The diagonal elements contain the count of correct predictions (TP, TN) whereas the off-diagonal contain the count of incorrect predictions (FP, FN). The rows are the predicted and the columns are the actual. This looks something like this.
Why we need TPR, FPR if we already have mis-classification error?
The data I describe above is a typical case of imbalanced data wherein one of the class is having majority of observations (90% non-defaulters (negatives) in our data) and the remaining class is a minority (only 10% defaulters (positives)). In such cases, the predictions on new dataset will be skewed towards negatives i.e the model will classify a lot of defaulters (positives) into negatives. The bank can't afford to have such predictions. The bank wants to know for sure the defaulters (positives). Imagine, the loss to bank if the model classify a probable defaulter (positive) to non-defaulter.
In such cases, accuracy corresponding to mis-classification alone is not acceptable. The bank would be more interested in correctly classifying the positives into positive i.e the bank wants to classify the defaulters into defaulters without fail.
Comes into picture TPR, TNR, FPR, FNR. These 3 lettered acronyms are nothing but the rates of TP, TN, FP and FN respectively. Below is the formula. To digest the formula, let's move to our data having 1000 rows - 100 defaulters (positives) and 900 non-defaulters(negatives). Suppose we employed a logistic regression that classified 80 defaulters correctly and incorrectly classified 90 non-defaulters as defaulters. TPR = 80 / 100 = 0.8 FPR = 90/ 900 = 0.1
How does the TPR and FPR gets calculated?
Whenever you do any classification, the model always gives you probabilities of each observation getting classified in each of the classes.
Based on what cut-off you choose, you will get different predictions for the data and hence different TPR and FPR overall. You can choose whatever probability cut-off between [0, 1] and you will get different tuples of TPR and FPR.
TPR and FPR are be generated for each of the probability value one chooses. And these values are then plotted on an ROC curve.
One can plot these tuples (probability, FPR, TPR) on a graph. You know what this graph is called? ROC-Receiver Operating Characteristics. There is a trade-off between TPR and FPR. Depending on the requirement one can choose the probability cut-off that best fulfills their purpose. For instance, in the bank's case, the bank wants to not miss a single defaulter(positives) i.e the bank wants a higher TPR. The ROC curve looks something like this.
Alright, all this is clear to me.
What about AUC?
AUC is nothing but the area under curve of ROC curve. Let's say we built a logistic regression model that gave us the probabilities for each row. Now we try with probability cut-offs from 0.1 to 1.0 with step size of 0.1 i.e we would have 10 probabilites to try with and corresponding to each of the 10 values we would have corresponding (FPR, TPR). If we plot these values on a graph we would get a graph having 10 points. This 10 point graph is what we call an ROC curve and the area under this graph is called AUC.
The AUC is a common evaluation metric for binary classification problems. Consider a plot of the true positive rate vs the false positive rate as the threshold value for classifying an item as 0 or is increased from 0 to 1: if the classifier is very good, the true positive rate will increase quickly and the area under the curve will be close to 1. If the classifier is no better than random guessing, the true positive rate will increase linearly with the false positive rate and the area under the curve will be around 0.5.
One characteristic of the AUC is that it is independent of the fraction of the test population which is class 0 or class 1: this makes the AUC useful for evaluating the performance of classifiers on unbalanced data sets.
The larger the area the better. If we have to choose between two classifiers having different AUCs, we choose the one having larger AUC.
How do you decide what probability cut-off should you choose to classify them into either of the classes?
Some say they take a cut-off of 0.5 i.e observation having probability greater than 0.5 will be classified as positive or else negative. Do you see the problem in here? Try thinking.
Come into picture ROC curve. Look at the ROC curve and depending on what value of TPR or FPR you want from your model, you take probability corresponding to that point.
ROC, AUC can easily be plotted and calculated using modern analytical tools like R or Python. But I would suggest for better understanding of this topic, try writing your own code.
I will make an attempt of the same soon and will share it on this space. Keep learning and sharing.
Did you find the article useful? If you did, share your thoughts on the topic in the comments.