Note of data science training EP 7: Metrics – It is qualified
We need to validate how strong our models are.
After we played Supervised ML algorithms, Regression and Classification. At this time, we need to validate how strong our models are.
Technical terms
- Bias
Bias data goes away from “normal” values.
For example, we found number 100 from the employees’ age list while there should be less than 50 years old. - Variance
This is located out of “trending” values.
For example, we found OT data of a factory department which is between 0 – 20 hours as it should be narrower such as 0 – 5 hours. This causes our prediction be more difficult. - Overfitting
It is an event our model try to capture all of the data.
As a result, our model will have low bias but high variance. Decision tree is an example of this case. - Underfitting
This is opposite to overfitting case, it tries to capture trending data.
Our model will have high bias but low variance. Regressions are the examples.
Source: Understanding the Bias-Variance Tradeoff
What are tools to test models
Regressors’ benchmarks
Here is a recap from EP 4:
- \(r^2\) score
Comparing predictions to the real result. Higher is better and maximum at 1.
This indicates how much performance our models are compared to the base line (refer to dummy model in the last episode). - \(MedAE\) or Median Absolute Error
Median of the errors. Lower is better. - \(MAE\) or Mean Absolute Error
Average of errors. Lower is better.
This identifies how many outliers in data. - \(MSE\) or Mean Square Error
Average of errors power 2. The lower is the better.
It refers how many errors affecting our model not to be normal distribution.
Classifiers’ benchmarks
- Accuracy score
Ratio of correct predictions and number of predictions.
Worst at 0 and best at 1. - Precision score
Ratio of correct positive prediction and number of positive predictions.
Worst at 0 and best at 1. - Recall score
Ratio of correct positive prediction and number of positive real data.
Worst at 0 and best at 1. - \(F_1\) score
Calculated by the formula \(F_1 = (\frac{Precision^{-1} + Recall^{-1}}{2})^{-1} = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}\).
This will be used in most of real-world problems as this is robust to large number of negative real data.
Worst at 0 and best at 1. - Confusion matrix
This matrix enumerates a number of each prediction and real data. Alternative way ispd.crosstab()
which displays them as percentages.
Model Selection
How can we build a model that generates best scores?
Here is a solution. It’s a module sklearn.model_selection.GridSearchCV()
.
For the example, we defined a GridSearchCV()
with a parameter set as “criterion” is “gini” and “max_depth” starts at 3 and less than 10. “cv” is a default value for cross-validation algorithm (5 for newer version of the library at this writing time).
As a result, the sample model can be the best when we define “max_depth” as 3 as shown in .best_params
and the .best_score
can be that high at 0.76.
These are basic model evaluations and I found lots of way while researching about it. You can try new methods to test your models and feel free to share me. 😄
See ya next time.
Bye~