by KeyC0de Posted: Friday 04-06-2021, 00:12:30 --- Modified: Thursday 24-02-2022, 18:18:23

731 views

Regular Expression tutorial

Difficulty:

1/5

Regular expressions are sequences of characters and symbols to be searched for (CTRL + F) within a longer piece of text.

Here I summarize my personal notes on regexps, awk, sed, grep and C++ regexps, Java regexps, Python regexps, Bash (linux shell) regexps, Javascript (regexps), PHP (regexps) etc.

For example to create a pattern that matches only the numbers out of this character sequence: 917-555-1234 and 646.867-5309 stop it!

use \d{3}[-.]\d{3}[-.]\d{4}

in PCRE regexp dialect.

But what are all these cryptic runes and sequences? Let's find out. Read on!

Meta Characters

\g : global matching, matches found in entire text
\G : matches the end of previously successful match
\i : case insensitive matching
\m : enabled multiline mode
\A : match whole text
\d : any digit \in {0,1,…,9}
\D : any non-digit
\w : any character \in {A-Z, a-z, 0-9}
\W : any non-character
\s : any whitespace eg. tab, space
\S : any non-whitespace
. : literal dot character (the backslash \ makes literal any metacharacter)
[ , ] : literal brackets only matching
(sequence) : matches sequence literally and “captures”

Quantifier Meta Characters

\regexp{quantifier-meta-character} : they always go after the main regexp

* : 0 or more (greedy wildcard)
*? : non-greedy version (matches separated strings)
+ : 1 or more occurrences, eg. given this regex: colou?rs? : the u and s are optional in the search string, matches: colour, color, colours, colors
{min,max} : matches at least min occurrences and maximum max occurrences
{min,} : at least min occurrences
{number} : eg sw{6} : find all 6 letter/number occurrences
{n,} : finds all n or more occurring matches

Position Meta Characters

$ : ending the match
\b : word boundary
\B : non-word boundary
^ : beginning

eg. \b\w{4,6}\b : all words between 4 & 6 (arithmetic or numeric characters)

Character Class

Stuff that appears in between square brackets stuff.

eg. l[yi C]nk : matches sequences of characters that contain a y or an i or a SPACE or an opening parentheses and are enclosed between a starting 'l' character and two closing “nk” characters - lynk, link, l nk, lCnk

[a-z] : any character (only one) between a and z
[0-5] : any number between 0 and 5
^[abc] : any character except a or b or c which is located at the beginning of a word. ^ is a special character inside a character class only when it is the first symbol in the character class

eg \b[A-Z][a-z]+\b : match a single capital letter, followed by 1 or more lower case letters in a word

Alternation

To match any of these email addresses: daniel@shiffman.net daniel.shiffman@gmail.com daniel.shiffman@nyu.edu

[\w.]+@\w+.(net|com|edu) : any number of characters/numbers or a dot followed by an @ symbol, followed by any number of characters/numbers, followed by a dot followed by “net”, or “com”, or “edu”

Groups

Optionally selecting a group of text.
\d{3}-(\d{3})-(\d{4}) : picks a string that starts with 3 numbers, followed by a dash and optionally followed by a group of 3 numbers

Most regex flavors support up to 99 capturing groups (aka captures).

Back References

Useful for “replacement” patterns (or $#)
In a match to refer to a group you have already specified again, use # (\1 or \2 eg.) where # is the number of the group in the regexp. This is called back referencing.

eg ([a-c])x\1x\1 matches axaxa, bxbxb or cxcxc

Lookahead & Lookbehind assertions

Negative Lookahead:

q(?!u) : matches a q that is not followed by a u. Only q is part of the match

Positive Lookahead:

q(?=u) : matches a q that is followed by a u. Only q is part of the match

eg Monday\s(?=Wednesday)

Negative Lookbehind:

(?<!a)b : matches a b that is not preceded by an a

Positive Lookbehind:

(?<=a)b : matches a b that is preceded by an a

grep

grep -E[otherFlags] [pattern] [file] : displays lines matching a pattern
-n : display line numbers
-i : ignore case sensitivity
egrep [flags] [pattern] [file] : extended grep (same as grep -E)

egrep (one)|(two) : matches one OR two (it's similar to ? for multiple characters)

Bash (linux shell)

pattern="^[0-9]{8}$";
if [[ $date =~ $pattern ]]; then
    echo "date is valid"

# =~ is the regexp-match operator

Python Regexps

robj = re.compile(r'\d{3}-d{3}-d{4}')   # create regex object based on "compiled" query
matches = robj.search('Numbers:213-456-5678,432-901-9111') # match regex with string provided
matches.group(k) # pick match(es) desired -if \in group(0) more than 1.
# by default .search() returns 1st match only

Python regexps are greedy by default, ie. in ambiguous situations they match the longest find.
To make a regexp non-greedy append ? eg r'...?'

robj.findall(str) # instead of .search(str) to return all matches
robj.compile( regexp, re.I) # case insensitive matching
robj.sub(strPattern, string) # replace regex with strPattern(1st match only) in string
robj.sub(lambda x:repls[x.group()], string) # replace multiple matches in string - repls = {pat1:repl1, pat2:repl2}

Java regexps

Java regexps try to match the entire string.

java.util.regex.*;
String str = "whatever...";
Pattern pat = Pattern.compile("\\b[A-Za-z]\\b");
Matches matches = pat.matcher(str);

while(matches.find())
{
    if (matches.group.length() != 0)
    {
        System.out.println(matches.group());
    }
    System.out.println("Start index=" + matches.start() + " End index = " + matches.end() );
}

// replace all spaces with ", " in str:
Pattern pat2 = Pattern.compile("\\s");
Matches matches2 = pat2.matcher(str);
System.out.println(matches2.replaceAll(", ")"))

Javascript regexps

regexp.test(exression)
// eg. document.write(/cats/i.test("Cats are funny.")) -> outputs true
expression.replace(regexp, replaceStr)
// eg document.write("Cats are friendly".replace(/cats/gi,"dogs")) -> outputs dogs are friendly

PHP regexps

$regexp = "/[a-z]/";
$field = "Undercover Brother";
preg_match($regexp, $field)
ereg[i]($pattern, $replacement, $str)   // replace

C++ regexps

#include 
string str;
cin >> str;
regex re{"abc(a){2}", regex_constants::icase};	// [[:w:]] word + number matching
bool match = regex_match( str,
	re );	// [[:d:]] number match
bool match2 = regex_search( str,
	re );	// \. literal , matches as 2nd argument (it's an overload)
std::match_results matches ...	// [^cd] not c and not d

Github

Github repository link.

Tags:

# regexp

# regular

# expression

0 likes