Regular Expression tutorial
Regular expressions are sequences of characters and symbols to be searched for (CTRL + F
) within a longer piece of text.
Here I summarize my personal notes on regexps, awk, sed, grep and C++ regexps, Java regexps, Python regexps, Bash (linux shell) regexps, Javascript (regexps), PHP (regexps) etc.
For example to create a pattern that matches only the numbers out of this character sequence: 917-555-1234 and 646.867-5309 stop it!
use \d{3}[-.]\d{3}[-.]\d{4}
in PCRE regexp dialect.
But what are all these cryptic runes and sequences? Let's find out. Read on!
Meta Characters
- \g : global matching, matches found in entire text
- \G : matches the end of previously successful match
- \i : case insensitive matching
- \m : enabled multiline mode
- \A : match whole text
- \d : any digit \in {0,1,…,9}
- \D : any non-digit
- \w : any character \in {A-Z, a-z, 0-9}
- \W : any non-character
- \s : any whitespace eg. tab, space
- \S : any non-whitespace
- . : literal dot character (the backslash
\
makes literal any metacharacter) - [ , ] : literal brackets only matching
- (sequence) : matches sequence literally and “captures”
Quantifier Meta Characters
\regexp{quantifier-meta-character} : they always go after the main regexp
- * : 0 or more (greedy wildcard)
- *? : non-greedy version (matches separated strings)
- + : 1 or more occurrences, eg. given this regex: colou?rs? : the u and s are optional in the search string, matches: colour, color, colours, colors
- {min,max} : matches at least min occurrences and maximum max occurrences
- {min,} : at least min occurrences
- {number} : eg sw{6} : find all 6 letter/number occurrences
- {n,} : finds all n or more occurring matches
Position Meta Characters
- $ : ending the match
- \b : word boundary
- \B : non-word boundary
- ^ : beginning
eg. \b\w{4,6}\b : all words between 4 & 6 (arithmetic or numeric characters)
Character Class
Stuff that appears in between square brackets stuff.
eg. l[yi C]nk : matches sequences of characters that contain a y or an i or a SPACE or an opening parentheses and are enclosed between a starting 'l' character and two closing “nk” characters - lynk, link, l nk, lCnk
- [a-z] : any character (only one) between a and z
- [0-5] : any number between 0 and 5
- ^[abc] : any character except a or b or c which is located at the beginning of a word. ^ is a special character inside a character class only when it is the first symbol in the character class
eg \b[A-Z][a-z]+\b : match a single capital letter, followed by 1 or more lower case letters in a word
Alternation
To match any of these email addresses:
daniel@shiffman.net
daniel.shiffman@gmail.com
daniel.shiffman@nyu.edu
- [\w.]+@\w+.(net|com|edu) : any number of characters/numbers or a dot followed by an @ symbol, followed by any number of characters/numbers, followed by a dot followed by “net”, or “com”, or “edu”
Groups
Optionally selecting a group of text.
\d{3}-(\d{3})-(\d{4}) : picks a string that starts with 3 numbers, followed by a dash and optionally followed by a group of 3 numbers
Most regex flavors support up to 99 capturing groups (aka captures).
Back References
Useful for “replacement” patterns (or $#
)
In a match to refer to a group you have already specified again, use # (\1 or \2 eg.) where # is the number of the group in the regexp. This is called back referencing.
eg ([a-c])x\1x\1
matches axaxa, bxbxb or cxcxc
Lookahead & Lookbehind assertions
Negative Lookahead:
- q(?!u) : matches a
q
that is not followed by au
. Onlyq
is part of the match
Positive Lookahead:
- q(?=u) : matches a
q
that is followed by au
. Onlyq
is part of the match
eg Monday\s(?=Wednesday)
Negative Lookbehind:
- (?<!a)b : matches a
b
that is not preceded by ana
Positive Lookbehind:
- (?<=a)b : matches a
b
that is preceded by ana
grep
- grep -E[otherFlags] [pattern] [file] : displays lines matching a pattern
- -n : display line numbers
- -i : ignore case sensitivity
- egrep [flags] [pattern] [file] : extended grep (same as grep -E)
egrep (one)|(two) : matches one
OR two
(it's similar to ?
for multiple characters)
Bash (linux shell)
pattern="^[0-9]{8}$";
if [[ $date =~ $pattern ]]; then
echo "date is valid"
# =~ is the regexp-match operator
Python Regexps
robj = re.compile(r'\d{3}-d{3}-d{4}') # create regex object based on "compiled" query
matches = robj.search('Numbers:213-456-5678,432-901-9111') # match regex with string provided
matches.group(k) # pick match(es) desired -if \in group(0) more than 1.
# by default .search() returns 1st match only
Python regexps are greedy by default, ie. in ambiguous situations they match the longest find.
To make a regexp non-greedy append ?
eg r'...?'
robj.findall(str) # instead of .search(str) to return all matches
robj.compile( regexp, re.I) # case insensitive matching
robj.sub(strPattern, string) # replace regex with strPattern(1st match only) in string
robj.sub(lambda x:repls[x.group()], string) # replace multiple matches in string - repls = {pat1:repl1, pat2:repl2}
Java regexps
Java regexps try to match the entire string.
java.util.regex.*;
String str = "whatever...";
Pattern pat = Pattern.compile("\\b[A-Za-z]\\b");
Matches matches = pat.matcher(str);
while(matches.find())
{
if (matches.group.length() != 0)
{
System.out.println(matches.group());
}
System.out.println("Start index=" + matches.start() + " End index = " + matches.end() );
}
// replace all spaces with ", " in str:
Pattern pat2 = Pattern.compile("\\s");
Matches matches2 = pat2.matcher(str);
System.out.println(matches2.replaceAll(", ")"))
Javascript regexps
regexp.test(exression)
// eg. document.write(/cats/i.test("Cats are funny.")) -> outputs true
expression.replace(regexp, replaceStr)
// eg document.write("Cats are friendly".replace(/cats/gi,"dogs")) -> outputs dogs are friendly
PHP regexps
$regexp = "/[a-z]/";
$field = "Undercover Brother";
preg_match($regexp, $field)
ereg[i]($pattern, $replacement, $str) // replace
C++ regexps
#include
string str;
cin >> str;
regex re{"abc(a){2}", regex_constants::icase}; // [[:w:]] word + number matching
bool match = regex_match( str,
re ); // [[:d:]] number match
bool match2 = regex_search( str,
re ); // \. literal , matches as 2nd argument (it's an overload)
std::match_results matches ... // [^cd] not c and not d
Github
Github repository link.