REGEX is sexy
Regular Expression is a tool for processing text. It's popular in most programming languages.
Regular Expression is a tool for processing text. It’s popular in most programming languages. This is one of problematic works for many and yes I do think so.
However, we are learning the concepts and rules to make it familiar then we will be happy and comfortable to use it in every projects.
Concept of REGEX
REGEX stands for REGular EXpression. It is in a form of string and is built for string or text processing methods.
These are main concepts of the REGEX:
- We need to select what will be in each position in a string.
- Every character has its own class.
- Define quantifier for number of the character series in the same class.
- We have choices, we use alteration.
- Add anchor for beginnings or endings of words or strings.
- Apply escape characters if needed.
- Select parts of the text with capture groups.
Class
ASCII is the basic of characters in programming 101.
REGEX benefits this to define classes:
- Digits are
\d
. Not digits are\D
. - English alphabets are
\w
or word, otherwise\W
. - Spaces are
\s
, otherwise\S
. - In case of any characters, we put
.
(dot.) - New line symbols are
\R
from “Return”, otherwise\N
. - For other languages, it’s known as unicode characters. There are
\p{language}
for example\p{Thai}
.
More info, please visit regular-expressions.info/unicode.
Sometimes require a list of characters, apply []
.
- Select only A, B, C, or D then
[ABCD]
. - Select anything but A, B, C, and D then
[^ABCD]
. - Select a range such as letter ‘a’ to ‘x’ then
[a-x]
.
We can change some class with []
.
\d
can be replaced with[0-9]
.\D
can be replaced with[^0-9]
.\w
can be replaced with[a-zA-Z]
.\W
can be replaced with[^a-zA-Z]
.
Quantifier
Classes are selected, now we can define how many.
Least | Most | Add this followed by the class |
---|---|---|
0 | Any | * |
1 | Any | + |
0 | 1 | ? |
3 | 3 | {3} |
3 | 9 | {3,9} |
3 | Any | {3,} |
For instance, a text consists of 3 digits followed by any letters at any length can be \d{3}\w*
.
Alteration
Put |
between choices.
a|b
means eithera
orb
.cat|dog
means eithercat
ordog
.
Anchor
Anchor represents beginning or ending of the words or texts
- Beginning of the line would be represented by
^
.
^a
is starting witha
. - Ending of the line is
$
.
z$
meansz
is the last character of that line. - Ending of the words can be used with
\b
from “boundary”.
It ends the word if that position is not a word class.
For example,x\b
will targetx.
,x;
,x!
but notxa
,xx
. - Ending of the words followed by any word class can be used with
\B
.
For example,x\B
will targetxa
,xx
,xz
but notx.
,x+
.
Escape characters
Add backslash \
preceding the characters.
*
will be\*
..
will be\.
.$
will be\$
and so on.
Capture group
Apply parentheses surrounding the REGEX to make a capture group then the following syntaxes will be enabled.
- Refer the capture group using
\index
as the index of that group.
Let’s say(a|b|c)\1
means there is a capture group selecting letter “a”, “b”, “c”, or “d” as the 1st group, plus\1
as the reference to that result of the 1st group. Result should be one ofaa
,bb
, orcc
. - Refer the capture group using their names. We need to name the capture group before.
For example,(?'x1'(a|b|c))\k'x1'
, will result as same as the above but now we’re using the name x1. - Benefit with the method for substitution and extraction.
For instance, substitute all digits to an “x” or extract all digit followed by letter “a” from given texts.
Tools
These are my tools to check the REGEX strings before run on my jobs.
Real cases
SQL on Google BigQuery
On Google BigQuery, it supports REGEX well as the example below.
1
2
3
4
5
6
7
8
9
10
11
12
13
WITH test_set AS (
SELECT ["+66876543210", "0812345678", "9876543210987",
"[email protected]", "[email protected]",
"[email protected]",
"1234567890123", "9876543210987", "#iphone12mini", "#รักเธอที่สุด"
] AS text
)
SELECT text,
regexp_contains(text, r'^0[689]\d{8}$') as is_mobile,
regexp_contains(text, r'[\d\w\-_\.]+\@[\d\w\-_\.]+\..*') as is_email,
regexp_contains(text, r'^\d{13}$') as is_thaiid,
regexp_contains(text, r'#[\p{Thai}\w\d_]+') as is_hashtag
FROM test_set, unnest(text) text
This diagram illustrates how REGEX is translated to sample string.
This is the result of the query above. true
here shows the text
is which type indicated by each REGEX.
Result of REGEX functions in Google BigQuery
Python script
Second here is Python. I rather do with library regex which is more flexible than the standard library re. re
doesn’t support class \p{language}
for this case.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import regex
test_set = ["+66876543210", "0812345678", "9876543210987",
"[email protected]", "[email protected]",
"[email protected]", "1234567890123",
"9876543210987", "#iphone12mini", "#รักเธอที่สุด"]
rgx_mobile = "^0[689]\d{8}$"
rgx_email = "[\d\w\-_\.]+\@[\d\w\-_\.]+\..*"
rgx_thaiid = "^\d{13}$"
rgx_hashtag = "#[\p{Thai}\w\d_]+"
for t in test_set:
if regex.match(rgx_mobile, t) is not None:
print(t, "is mobile")
elif regex.match(rgx_email, t) is not None:
print(t, "is email")
elif regex.match(rgx_thaiid, t) is not None:
print(t, "is thaiid")
elif regex.match(rgx_hashtag, t) is not None:
print(t, "is hashtag")
else:
print(t, "is others")
Result of REGEX methods in Python
Be careful
As aforementioned, REGEX has solid patterns but we need to concern which REGEX engine do we use because different engines may not compatible with our REGEX strings.
Python has library re
and regex
while Google BigQuery functions are relied on re2
of Golang.
For more info about REGEX engine, please read Comparison of regular expression engines.
I’m quite pretty certain we the coder have times working with REGEX.