Note of data science training EP 4: Scikit-learn & Linear Regression – Linear trending
Linear regression calculation that draws a straight line.
In EP 3, we now understand how to create some graphs. This episode we are going to analyze data in serious way.
One of basic data science knowledge is linear regression calculation that draws a straight line. It is \(Y = aX + b\) which is the best when the line passes most dots in a plane or closest.
Figure above is not a very good line. We call the dots that are not in the line as “outliers”. They are errors. When we find there are too many outliers, it can be either our line is not enough fit or so many errors that is unable to draw a line.
Here are sample ways to create a good line.
Theil-Sen estimator
Theil-Sen estimator randoms pairs of dots and create line between them. In the end, find the average value of those lines. This benefit is speed but not good if there are too many outliers or it produces inaccurate results.
RANSAC algorithm
RANSAC stands for RAndom SAmple Consensus. It is to find the best line which pass through maximum dots.
This algorithm is depending to slope as the formula is \(slope=\frac{y_1-y_2}{x_1-x_2}\). It means this is resisting to outliers in Y-axis but not to ones in X-axis.
RANSAC is slower than Theil-Sen.
Huber regression
Huber uses \(\epsilon\) (epsilon) which is greater than 1.0 and calculate over epsilon to find the linear formula.
Huber is faster than the first two.
That was just a lecture. Now we go to code in Jupyter.
Scikit-learn
Introduce sklearn
or scikit-learn library. This is a great tool for data analysis and prediction.
We shall import sklearn.linear_model
that is a collection of linear regression models. And we do import sklearn.model_selection
for data correction in this case.
We try titanic data on column “Pclass”, “Age”, and “Fare”.
Now we want to predict “Fare” from “Pclass” and “Age”. Therefore, we assign x
as the latter two and y
as “Fare”.
Run sklearn.model_selection.train_test_split()
to split both x
and y
into two each that are training group and testing group with 10% size of testing group (test_size
= 0.1)
Scikit-learn with Theil-Sen
This time we finished preparing data. Let’s go for Theil-Sen first.
We create TheilSenRegressor
object, run fit()
with training group of x
and y
, predict()
with testing group of x
and… Gotcha! we got the predicted result of Theil-Sen.
We can show the results as below:
coef_
is slope or \(a\) from \(Y = aX + b\)intercept_
is \(b\)
The formula from Theil-Sen estimator is \(fare=-13.19\times Pclass - 0.04\times age + 51.49\).
We use y_train.values.ravel()
to fix data type issue.
Scikit-Learn with RANSAC
Second, RANSAC. Create RANSACRegressor()
.
Repeat the method and now we got this formula \(fare=-10.86\times Pclass + 0.02\times age + 40.80\).
Scikit-Learn with Huber
Last one, Huber as HuberRegressor()
.
We got \(fare=-21.23\times Pclass - 0.25\times age + 79.99\).
Comparison
Got all three and time to plot. Give x-axis as the real value that is the testing group of y
and y-axis is the predicted result.
Metrics
scikit-learn provides sklearn.metrics
for evaluating prediction. This time is these 3:
r2_score()
\(r^2\) is coefficient of determination. Higher is better.median_absolute_error()
\(MedAE\) is the median of errors between prediction and actual. Lower is better.mean_absolute_error()
\(MAE\) is the mean of errors between prediction and actual. Lower is better.
This episode was suddenly attacking us with lots of mathematics stuff. LOL.
Let’s see what’s next.
See ya. Bye~