Author image

Regular Expression tutorial


Difficulty:
1/5


Regular expressions are sequences of characters and symbols to be searched for (CTRL + F) within a longer piece of text.

Here I summarize my personal notes on regexps, awk, sed, grep and C++ regexps, Java regexps, Python regexps, Bash (linux shell) regexps, Javascript (regexps), PHP (regexps) etc.

For example to create a pattern that matches only the numbers out of this character sequence: 917-555-1234 and 646.867-5309 stop it!

use \d{3}[-.]\d{3}[-.]\d{4}

in PCRE regexp dialect.

But what are all these cryptic runes and sequences? Let's find out. Read on!

Meta Characters

  • \g : global matching, matches found in entire text
  • \G : matches the end of previously successful match
  • \i : case insensitive matching
  • \m : enabled multiline mode
  • \A : match whole text
  • \d : any digit \in {0,1,…,9}
  • \D : any non-digit
  • \w : any character \in {A-Z, a-z, 0-9}
  • \W : any non-character
  • \s : any whitespace eg. tab, space
  • \S : any non-whitespace
  • . : literal dot character (the backslash \ makes literal any metacharacter)
  • [ , ] : literal brackets only matching
  • (sequence) : matches sequence literally and “captures”

Quantifier Meta Characters

\regexp{quantifier-meta-character} : they always go after the main regexp

  • * : 0 or more (greedy wildcard)
  • *? : non-greedy version (matches separated strings)
  • + : 1 or more occurrences, eg. given this regex: colou?rs? : the u and s are optional in the search string, matches: colour, color, colours, colors
  • {min,max} : matches at least min occurrences and maximum max occurrences
  • {min,} : at least min occurrences
  • {number} : eg sw{6} : find all 6 letter/number occurrences
  • {n,} : finds all n or more occurring matches

Position Meta Characters

  • $ : ending the match
  • \b : word boundary
  • \B : non-word boundary
  • ^ : beginning

eg. \b\w{4,6}\b : all words between 4 & 6 (arithmetic or numeric characters)

Character Class

Stuff that appears in between square brackets stuff.

eg. l[yi C]nk : matches sequences of characters that contain a y or an i or a SPACE or an opening parentheses and are enclosed between a starting 'l' character and two closing “nk” characters - lynk, link, l nk, lCnk

  • [a-z] : any character (only one) between a and z
  • [0-5] : any number between 0 and 5
  • ^[abc] : any character except a or b or c which is located at the beginning of a word. ^ is a special character inside a character class only when it is the first symbol in the character class

eg \b[A-Z][a-z]+\b : match a single capital letter, followed by 1 or more lower case letters in a word

Alternation

To match any of these email addresses: daniel@shiffman.net daniel.shiffman@gmail.com daniel.shiffman@nyu.edu

  • [\w.]+@\w+.(net|com|edu) : any number of characters/numbers or a dot followed by an @ symbol, followed by any number of characters/numbers, followed by a dot followed by “net”, or “com”, or “edu”

Groups

Optionally selecting a group of text.
\d{3}-(\d{3})-(\d{4}) : picks a string that starts with 3 numbers, followed by a dash and optionally followed by a group of 3 numbers

Most regex flavors support up to 99 capturing groups (aka captures).

Back References

Useful for “replacement” patterns (or $#)
In a match to refer to a group you have already specified again, use # (\1 or \2 eg.) where # is the number of the group in the regexp. This is called back referencing.

eg ([a-c])x\1x\1 matches axaxa, bxbxb or cxcxc

Lookahead & Lookbehind assertions

Negative Lookahead:

  • q(?!u) : matches a q that is not followed by a u. Only q is part of the match

Positive Lookahead:

  • q(?=u) : matches a q that is followed by a u. Only q is part of the match

eg Monday\s(?=Wednesday)

Negative Lookbehind:

  • (?<!a)b : matches a b that is not preceded by an a

Positive Lookbehind:

  • (?<=a)b : matches a b that is preceded by an a

grep

  • grep -E[otherFlags] [pattern] [file] : displays lines matching a pattern
  • -n : display line numbers
  • -i : ignore case sensitivity
  • egrep [flags] [pattern] [file] : extended grep (same as grep -E)

egrep (one)|(two) : matches one OR two (it's similar to ? for multiple characters)

Bash (linux shell)

pattern="^[0-9]{8}$";
if [[ $date =~ $pattern ]]; then
    echo "date is valid"

# =~ is the regexp-match operator

Python Regexps

robj = re.compile(r'\d{3}-d{3}-d{4}')   # create regex object based on "compiled" query
matches = robj.search('Numbers:213-456-5678,432-901-9111') # match regex with string provided
matches.group(k) # pick match(es) desired -if \in group(0) more than 1.
# by default .search() returns 1st match only

Python regexps are greedy by default, ie. in ambiguous situations they match the longest find.
To make a regexp non-greedy append ? eg r'...?'

robj.findall(str) # instead of .search(str) to return all matches
robj.compile( regexp, re.I) # case insensitive matching
robj.sub(strPattern, string) # replace regex with strPattern(1st match only) in string
robj.sub(lambda x:repls[x.group()], string) # replace multiple matches in string - repls = {pat1:repl1, pat2:repl2}

Java regexps

Java regexps try to match the entire string.

java.util.regex.*;
String str = "whatever...";
Pattern pat = Pattern.compile("\\b[A-Za-z]\\b");
Matches matches = pat.matcher(str);

while(matches.find())
{
    if (matches.group.length() != 0)
    {
        System.out.println(matches.group());
    }
    System.out.println("Start index=" + matches.start() + " End index = " + matches.end() );
}

// replace all spaces with ", " in str:
Pattern pat2 = Pattern.compile("\\s");
Matches matches2 = pat2.matcher(str);
System.out.println(matches2.replaceAll(", ")"))

Javascript regexps

regexp.test(exression)
// eg. document.write(/cats/i.test("Cats are funny.")) -> outputs true
expression.replace(regexp, replaceStr)
// eg document.write("Cats are friendly".replace(/cats/gi,"dogs")) -> outputs dogs are friendly

PHP regexps

$regexp = "/[a-z]/";
$field = "Undercover Brother";
preg_match($regexp, $field)
ereg[i]($pattern, $replacement, $str)   // replace

C++ regexps

#include 
string str;
cin >> str;
regex re{"abc(a){2}", regex_constants::icase};	// [[:w:]] word + number matching
bool match = regex_match( str,
	re );	// [[:d:]] number match
bool match2 = regex_search( str,
	re );	// \. literal , matches as 2nd argument (it's an overload)
std::match_results matches ...	// [^cd] not c and not d

Github

Github repository link.


0 likes