Post

Note of data science training EP 11: NLP & Spacy – Languages are borderless

Computers are capable to learn human languagues.

Note of data science training EP 11: NLP & Spacy – Languages are borderless
In this series

Computers are capable to learn human languagues.


Natural Language Processing (NLP)

It is the methodology to translate human languages to datasets to analyse. For instances, “I Love You” can be translated as “positive”, “romantic”, and “sentimental”.

One basic term is “Tokenization” that is splitting a set of text into groups of words. We understand what we listen to by combining all meanings of words, so is computer.

Python has many libraries for this task. One is Spacy.


Spacy

This problem is from my final project. It is to predict rating from cartoons’ names. The steps are to split the names and transform into numbers and use Random Forest estimator as a predictor.

Let’s go.


1. Install

Find Spacy package here.


2. Prepare a dataset

The dataset is from Kaggle via this link.

dataset


3. import libraries and files

Import Pandas and .read_csv().

pd read csv


4. import spacy

As the dataset is in English, we have to download Spacy model "en_core_web_sm" with .load() then we got a class object.

At this step, we can use that object to tokenize (word splitting) as the figure below.

spacy load

We can display tokenized text with .text and their parts of speech with .pos_.

tokenized


5. Custom tokenization

We don’t want special characters but only letters and numbers, so we need to improve the tokenizer with the regular expression in this method.

1
2
3
4
import re
def splitter(val, processor):
    pattern = r'[0-9a-zA-Z]+'
    return [r.group().lower() for r in re.finditer(pattern, processor(val).text)]

setup splitter

[0-9a-zA-Z]+ means to capture only number (0 – 9), lowercases (a – z), and uppercases (A – Z). Sign symbol means the captures are one letter or more.


6. Tokenize them all

OK, we now have to tokenize all names.

1
2
pattern_splitter = [splitter(n, processor) for n in anime.name]
pattern_splitter

call splitter

Then we add the tokenized value in a new column “name_token”.

1
2
anime.loc[:, 'name_token'] = pd.Series(pattern_splitter)
anime

output


7. Cleanse before use

As we require rating to predict, we have to remove non-value of rating here.

clean


8. Make train and test sets

From all 12,064 rows, we are going to separate them into train set and test set. We apart 70% to train set here.

train test


9. Vectorizer

Vectorizer in Scikit-learn is to transform words to matrix. It applies TF-IDF formula to calculate frequency of each word in the matrix.

tf-idf

First to create TfidfVectorizer object.

create tf-idf vectorizer

Run .fit_transform() on train set to learn words and store in the matrix then run .transform() on test set.

fit_transform


10. Random Forest

We now at the time to train it. Start with create a Regressor.

random forest regressor

Assign “y” as the rating of train set.

prep y

Finally, run .fit() with the matrix and “y”. Now we got an estimator.

fit


11. Scores of Random Forest

After that, we have to scoring the estimator. Here we have mse = 1.64 .

metrics

Try to compare predicted and real rating.

compare

Then plot a graph. It might prove that, there are less relationships between name and rating of cartoons. Anyway, it is ok for the prediction results.

plot


12. Interesting features

We can find feature rankings by .feature_importances_ of Random Forest and feature values by .get_feature_names() of vectorizer.

Use them altogether to find which feature value causes the highest rating.

important features

This is a DataFrame of feature names and feature importances.

feature importance


13. Linear Regression version

We are curious how about Linear Regression. As a result, its mse = 2.98 that is higher than the Random Forest.

OK. this one is worse.

linear regression


NLP with Thai language

The teacher recommended pythainlp. This library can interpret Thai text in similar style as Spacy.


This blog is just an introduction. We can go further by learning Content Classification, Sentiment Analysis, etc.

See you next time, Bye.

This post is licensed under CC BY 4.0 by the author.