Post

REGEX is sexy

Regular Expression is a tool for processing text. It's popular in most programming languages.

REGEX is sexy

Regular Expression is a tool for processing text. It’s popular in most programming languages. This is one of problematic works for many and yes I do think so.

However, we are learning the concepts and rules to make it familiar then we will be happy and comfortable to use it in every projects.


Concept of REGEX

REGEX stands for REGular EXpression. It is in a form of string and is built for string or text processing methods.

These are main concepts of the REGEX:

  1. We need to select what will be in each position in a string.
  2. Every character has its own class.
  3. Define quantifier for number of the character series in the same class.
  4. We have choices, we use alteration.
  5. Add anchor for beginnings or endings of words or strings.
  6. Apply escape characters if needed.
  7. Select parts of the text with capture groups.

Class

ASCII is the basic of characters in programming 101.

REGEX benefits this to define classes:

  • Digits are \d. Not digits are \D.
  • English alphabets are \w or word, otherwise \W.
  • Spaces are \s, otherwise \S.
  • In case of any characters, we put . (dot.)
  • New line symbols are \R from “Return”, otherwise \N.
  • For other languages, it’s known as unicode characters. There are \p{language} for example \p{Thai}.
    More info, please visit regular-expressions.info/unicode.

Sometimes require a list of characters, apply [].

  • Select only A, B, C, or D then [ABCD].
  • Select anything but A, B, C, and D then [^ABCD].
  • Select a range such as letter ‘a’ to ‘x’ then [a-x].

We can change some class with [].

  • \d can be replaced with [0-9].
  • \D can be replaced with [^0-9].
  • \w can be replaced with [a-zA-Z].
  • \W can be replaced with [^a-zA-Z].

Quantifier

Classes are selected, now we can define how many.

LeastMostAdd this followed by the class
0Any*
1Any+
01?
33{3}
39{3,9}
3Any{3,}

For instance, a text consists of 3 digits followed by any letters at any length can be \d{3}\w*.


Alteration

Put | between choices.

  • a|b means either a or b.
  • cat|dog means either cat or dog.

Anchor

Anchor represents beginning or ending of the words or texts

  • Beginning of the line would be represented by ^.
    ^a is starting with a.
  • Ending of the line is $.
    z$ means z is the last character of that line.
  • Ending of the words can be used with \b from “boundary”.
    It ends the word if that position is not a word class.
    For example, x\b will target x., x;, x! but not xa, xx.
  • Ending of the words followed by any word class can be used with \B.
    For example, x\B will target xa, xx, xz but not x., x+.

Escape characters

Add backslash \ preceding the characters.

  • * will be \*.
  • . will be \..
  • $ will be \$

and so on.


Capture group

Apply parentheses surrounding the REGEX to make a capture group then the following syntaxes will be enabled.

  • Refer the capture group using \index as the index of that group.
    Let’s say (a|b|c)\1 means there is a capture group selecting letter “a”, “b”, “c”, or “d” as the 1st group, plus \1 as the reference to that result of the 1st group. Result should be one of aa, bb, or cc.
  • Refer the capture group using their names. We need to name the capture group before.
    For example, (?'x1'(a|b|c))\k'x1' , will result as same as the above but now we’re using the name x1.
  • Benefit with the method for substitution and extraction.
    For instance, substitute all digits to an “x” or extract all digit followed by letter “a” from given texts.

Tools

These are my tools to check the REGEX strings before run on my jobs.


Real cases

SQL on Google BigQuery

On Google BigQuery, it supports REGEX well as the example below.

1
2
3
4
5
6
7
8
9
10
11
12
13
WITH test_set AS (
  SELECT ["+66876543210", "0812345678", "9876543210987",
    "[email protected]", "[email protected]",
    "[email protected]",
    "1234567890123", "9876543210987", "#iphone12mini", "#รักเธอที่สุด"
    ] AS text 
)
SELECT text, 
  regexp_contains(text, r'^0[689]\d{8}$') as is_mobile,
  regexp_contains(text, r'[\d\w\-_\.]+\@[\d\w\-_\.]+\..*') as is_email,
  regexp_contains(text, r'^\d{13}$') as is_thaiid,
  regexp_contains(text, r'#[\p{Thai}\w\d_]+') as is_hashtag
FROM test_set, unnest(text) text

This diagram illustrates how REGEX is translated to sample string.

bq regex REGEX diagrams

This is the result of the query above. true here shows the text is which type indicated by each REGEX.

query result Result of REGEX functions in Google BigQuery

Python script

Second here is Python. I rather do with library regex which is more flexible than the standard library re. re doesn’t support class \p{language} for this case.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import regex
test_set = ["+66876543210", "0812345678", "9876543210987",
            "[email protected]", "[email protected]", 
           "[email protected]", "1234567890123", 
            "9876543210987", "#iphone12mini", "#รักเธอที่สุด"] 

rgx_mobile = "^0[689]\d{8}$"
rgx_email = "[\d\w\-_\.]+\@[\d\w\-_\.]+\..*"
rgx_thaiid = "^\d{13}$"
rgx_hashtag = "#[\p{Thai}\w\d_]+"

for t in test_set:
    if regex.match(rgx_mobile, t) is not None:
        print(t, "is mobile")
    elif regex.match(rgx_email, t) is not None:
        print(t, "is email")
    elif regex.match(rgx_thaiid, t) is not None:
        print(t, "is thaiid")
    elif regex.match(rgx_hashtag, t) is not None:
        print(t, "is hashtag")
    else:
        print(t, "is others")

python regex Result of REGEX methods in Python


Be careful

As aforementioned, REGEX has solid patterns but we need to concern which REGEX engine do we use because different engines may not compatible with our REGEX strings.

Python has library re and regex while Google BigQuery functions are relied on re2 of Golang.

For more info about REGEX engine, please read Comparison of regular expression engines.


I’m quite pretty certain we the coder have times working with REGEX.

This post is licensed under CC BY 4.0 by the author.