Note of data science training EP 4: Scikit-learn & Linear Regression – Linear trending

Linear regression calculation that draws a straight line.

Posted Feb 27, 2020

4 min read

In this series

In EP 3, we now understand how to create some graphs. This episode we are going to analyze data in serious way.

One of basic data science knowledge is linear regression calculation that draws a straight line. It is \(Y = aX + b\) which is the best when the line passes most dots in a plane or closest.

Figure above is not a very good line. We call the dots that are not in the line as “outliers”. They are errors. When we find there are too many outliers, it can be either our line is not enough fit or so many errors that is unable to draw a line.

Here are sample ways to create a good line.

Theil-Sen estimator

Theil-Sen estimator randoms pairs of dots and create line between them. In the end, find the average value of those lines. This benefit is speed but not good if there are too many outliers or it produces inaccurate results.

RANSAC algorithm

RANSAC stands for RAndom SAmple Consensus. It is to find the best line which pass through maximum dots.

This algorithm is depending to slope as the formula is \(slope=\frac{y_1-y_2}{x_1-x_2}\). It means this is resisting to outliers in Y-axis but not to ones in X-axis.

RANSAC is slower than Theil-Sen.

Huber regression

Huber uses \(\epsilon\) (epsilon) which is greater than 1.0 and calculate over epsilon to find the linear formula.

Huber is faster than the first two.

That was just a lecture. Now we go to code in Jupyter.

Scikit-learn

Introduce sklearn or scikit-learn library. This is a great tool for data analysis and prediction.

We shall import sklearn.linear_model that is a collection of linear regression models. And we do import sklearn.model_selection for data correction in this case.

We try titanic data on column “Pclass”, “Age”, and “Fare”.

Now we want to predict “Fare” from “Pclass” and “Age”. Therefore, we assign x as the latter two and y as “Fare”.

Run sklearn.model_selection.train_test_split() to split both x and y into two each that are training group and testing group with 10% size of testing group (test_size = 0.1)

Scikit-learn with Theil-Sen

This time we finished preparing data. Let’s go for Theil-Sen first.

We create TheilSenRegressor object, run fit() with training group of x and y, predict() with testing group of x and… Gotcha! we got the predicted result of Theil-Sen.

We can show the results as below:

coef_ is slope or \(a\) from \(Y = aX + b\)
intercept_ is \(b\)

The formula from Theil-Sen estimator is \(fare=-13.19\times Pclass - 0.04\times age + 51.49\).

We use y_train.values.ravel() to fix data type issue.

Scikit-Learn with RANSAC

Second, RANSAC. Create RANSACRegressor().

Repeat the method and now we got this formula \(fare=-10.86\times Pclass + 0.02\times age + 40.80\).

Scikit-Learn with Huber

Last one, Huber as HuberRegressor().

We got \(fare=-21.23\times Pclass - 0.25\times age + 79.99\).

Comparison

Got all three and time to plot. Give x-axis as the real value that is the testing group of y and y-axis is the predicted result.

Metrics

scikit-learn provides sklearn.metrics for evaluating prediction. This time is these 3:

r2_score()
\(r^2\) is coefficient of determination. Higher is better.
median_absolute_error()
\(MedAE\) is the median of errors between prediction and actual. Lower is better.
mean_absolute_error()
\(MAE\) is the mean of errors between prediction and actual. Lower is better.

This episode was suddenly attacking us with lots of mathematics stuff. LOL.

Let’s see what’s next.

See ya. Bye~

References

data, data science

This post is licensed under CC BY 4.0 by the author.