Regular Expressions#

Regular Expressions Concept#

Regular Expression (aka RE or Regexp) is the pattern used to match or find the regular strings. The basic concept is similar to the wildcards known from Shell: with asterisk or question mark ? you can match the files with some patterns, like cutxt means everything starting with cu and ending with txt. We say that pattern cu*txt matches cumail.txt and cucumber.txt.

If you know Regexp already (as the Sysadmin you probably are using sed), you may be bored with the basics, but I consider this topic to be extremaly useful way to string processing, especially where parsing the logs, working with files, etc. Because of the tough syntax and poor human readability, RE are not easy to learn and use, so this chapter may be a bit challenging. Anyway, it is very good to be familiar with Regular Expressions, as shown in XKCD!

Syntax and Simple Matching#

In Ruby, Regular Expression is a pattern between two slashes with optional modifier: /pattern/modifier. Like everything, Regexp is an object.

The simplest way to match the string is to use matching operator =~, used with string and RE as the operands: string =~ regexp. This returns the index of first Regular Expression pattern matched or nil if not found.

'the empire strikes back' =~ /the/    # match at the first char
#=> 0

'the empire strikes back' =~ /ikes /  # note the space
#=> 14

'the empire strikes back' =~ /ikes  / # this pattern doesn't match
#=> nil                                   # because of double space

'the empire strikes back' =~ /i/      # single character is RE as well
#=> 7

There is an excellent web page to test your Regular Expressions: http://rubular.com. I strongly recommend to develop your Regexp there.

The matching operator is often used in if-then-else statements, because Ruby treats everything except false and nil to be true.

Another operator often used for matching is case equality operator, triple equal sign ===. This one gets the Regexp as the left argument, and string as the right. It always returns true or false.

/the/ === 'the empire strikes back'      # true, found at the beginning
#=> true
/empire/  === 'The Empire Strikes Back'  # Regexp operations are case-sensitive
#=> false
/empire/i === 'The Empire Strikes Back'  # unless you add modifier 'i'
#=> true

On the example above we introduced the modifier ‘i’ - it means that all the operations on this Regular Expression will be case-insensitive.

Note that the case equality operator is used in case-when statements, so it is very covinient to use RE in such statement:

case `uname -a`
when /linux/i, /freebsd/i
  puts 'Users database is in /etc/passwd'
when /darwin/i
  puts 'Users are kept in Open Directory'
end

Patterns#

So far we know how to match the string itself, but what about the wildcards, like asterisk in the shell? In Regular Expression, the approach is a bit different. Let’s start with the simplest patterns:

  • . matches any single character
  • ^ matches the beginning of the line (or the string, if it is one line long)
  • $ matches the end of or line or string, like above
  • \A matches the beginning of the whole string (not only the line)
  • \z matches the end of the whole string (not only the line)
/Back$/ =~ 'The Empire Strikes Back'   # 'Back' found at the end of the line
#=> 19
/Back$/ =~ 'The Empire Strikes Back!'  # but not here - there is a character before the end
#=> nil
/Back.$/ =~ 'The Empire Strikes Back!' # dot is a pattern for ANY SINGLE character, so '!' matches
#=> 19
/^The Empire ....... Back$/ =~ 'The Empire Strikes Back'  # matched
#=> 0

Matching a single characters is not especially useful. It would be good to have something similar to wildcard to match more characters in the expression. Unlike in the shell, there is no universal zero or more any characters wildcard, and the approach is different: asterisk * means zero or more preceding Regular Expression. The preceding Regexp may be any Regular Expression, in particular the single character. This sometimes confuses the Sysadmins, because res*conf does not match resolv.conf! It matches resconf, reconf and resssssssconf - zero or more occurrence of preceding expression, in this case the single character ‘s’.

With this knowledgle we can construct the equivalent of the shell wilcard - in Regular Expressions it is . - because the dot means any character, and the asterisk is zero or one occurence of the preceding expression. Dot followed by asterisk . is then the Regular Expressions equivalent of wildcard in Unix Shell.

Asterisk is not the only pattern repetition:

  • RE* matches zero or more occurences of RE
  • RE+ matches one or more occurences of RE
  • RE? matches zero or one occurence of RE
'The Empire Strikes Back' =~ /The.*Back/  # match, there are zero or more occurences of any character
#=> 0
'The Empire Strikes Back' =~ /The.?Back/  # not match: there is more than one any characters between the words
#=> nil

'The Empire Strikes Back' =~ /Empire +Strikes/    # found one or more spaces between this words
#=> 4
'The Empire    Strikes Back' =~ /Empire +Strikes/ # as above
#=> 4
'TheEmpireStrikesBack' =~ /Empire +Strikes/       # it must be at least one space between the words
#=> nil
'TheEmpireStrikesBack' =~ /Empire *Strikes/       # asterisk allows zero occurrences

The square brackets are used to match any single character from the collection. For example, [aeiouy] matches a single vowel, [012345679] - a digit and [.,?!] - one of this punctation marks: .,?!. Notice that the dot and question mark (as well as the other metacharacters) must be escaped by the backslash because they are a part of Regular Expression Syntax.

Instead of typing [0123456789] you can use a range: [0-9]. The same is for the letters.

  • [qwerty] matches any the characters: 'q', 'w', 'e', 'r', 't' and 'y'
  • [0-9] matches range: any single digit
  • [^0-9] matches any character except the digits
  • [a-z] matches range: any single letter (from 'a' to 'z')
  • [a-zA-z] matches range: any single uppercase letter (from 'A' to 'Z')
'The Empire Strikes Back'.gsub(/[aeiouy]/, '*')        # replace all occurences of vowels with star
#=> "Th* Emp*r* Str*k*s B*ck"
'The Empire Strikes Back'.sub(/[aeiouy]/, '*')         # replace first match only
#=> "Th* Empire Strikes Back"
'The Empire Strikes Back'.gsub(/[A-Z][a-z]*/, 'Yoda')  # replace all uppercase words with string
#=> "Yoda Yoda Yoda Yoda"
'The Empire Strikes Back'.scan(/[A-Z][a-z]*/)          # find and return all uppercase words
#=> ["The", "Empire", "Strikes", "Back"]

In the example above we introduced three useful String methods which uses Regular Expressions:

  • sub(regexp, string), which replaces the first match of RE with the given string
  • gsub(regexp, string) to replace all matches, not only the first one
  • scan to find out all matches and return an array containing found strings

There are more Regular Expression patterns available, for the complete list take a look at the documentation ri Regexp. Here is a list of few useful patterns:

  • \s any single whitespace character (space, tab)
  • \S any non-whitespace character (everything except space, tab)
  • \d any digit (equivalent to [0-9])
  • \D any non-digit (equivalent to [^0-9])
  • \w any word character (letter, digit, underscore)
  • \W any non-word character (everything except letter, digit, underscore)
'The Empire Strikes Back'.gsub(/\s/, '.')         # replace all whitespaces with dots
#=> "The.Empire.Strikes.Back"
'I have $33, and you $15.50.'.scan /\$\d+\.*\d*/  # search for the money
#=> ["$33", "$15.50"]

# returns information if the given string is the real, floating point number
def is_float?(str)
  # string must have at least one digit at the beginnig, a dot, and at least one digit before the end
  if str =~ /^\d+\.+\d+$/
    true
  else
    false
  end
end
is_float? '42.42'
#=> true
is_float? 'Yoda'
#=> false
is_float? '42'
#=> false

Capturing the Matched Strings#

The simple Regular Expression matching is not enough - it is only the check if the pattern fits the string. It would be nice to extract the matched substrings. To do it, use parenthesis - the part of the string matching the Regular Expression enclosed in the parenthesis will be a part of the output as a matching group.

It is easier to understand it using the example. Lets say we have a string with two amounts in USD. What we need to do, is to search for a numbers (which may contain dot) followed by the dollar sign - this RE /$\d+.?\d*/ does the trick. It is a dollar sign (escaped with backslash, as dollar means end of line), then comes one or more digits \d+, then zero or one dot (escaped as well), and at the end the optional numbers. We want to output only the numbers, without following dollar sign, so we need to enclose in parenthesis only this part of RE, which match the number: /$(\d+.?\d*)/. In can be anything between this two numbers, so we put .* between and now we can extract the exact amount from the string:

usd = /\$(\d+\.?\d*).*\$(\d+\.?\d*)/         # RE definition
#=> /\$(\d+\.?\d*).*\$(\d+\.?\d*)/

m = usd.match 'I have $33, and you $15.50.'  # match this string
#=> #<MatchData "$33, and you $15.50" 1:"33" 2:"15.50">
m[1]                                         # first match group
#=> "33"
m[2]                                         # second match group
#=> "15.50"

m = usd.match 'This movie costs $12.50 ($4 to rent).'
#=> #<MatchData "$12.50 ($4" 1:"12.50" 2:"4">

Notice the match(string) method on Regular Expression object. It is quite similar to match operator =~, but it does not return the position of the first match, but the MatchData object, which contains the found substrings and the match groups - found substrings. In the previous example there are two match groups, number one m[1] and number two m[2]. Notice that matched groups counts from 1, because m[0] contains the whole matched string.

Match groups does not have to be indexed by the numbers - they could be named. The syntax for naming the groups is: (?) or (?’name’). Then you can use this name to extract the string from the MatchData object. To have an example, let’s define the Regular Expression to extract the data from the /etc/passwd file:

re = /(?'username'\w+):.*:\d+:\d:(?'fullname'.*):.*:(?'shell'.+)/
#=> /(?'username'\w+):.*:\d+:\d:(?'fullname'.*):.*:(?'shell'.+)/

m = re.match 'root:*:0:0:System Administrator:/var/root:/bin/sh'
#=> #<MatchData "root:*:0:0:System Administrator:/var/root:/bin/sh" username:"root" fullname:"System Administrator" shell:"/bin/sh">
m[:username]
#=> "root"

m = re.match 'daemon:*:1:1:System Services:/var/root:/usr/bin/false'
#=> #<MatchData "daemon:*:1:1:System Services:/var/root:/usr/bin/false" username:"daemon" fullname:"System Services" shell:"/usr/bin/false">
m[:shell]
#=> "/usr/bin/false"

Last Match#

There is a quite handy shortcut for Regular Expressions: the Regexp.last_match class method. It is set up after every matching and contains the MatchData object of the last match. Thus it could be used with the operators:

passwd_re = /(?'username'\w+):.*:\d+:\d:(?'fullname'.*):.*:(?'shell'.+)/

line = 'root:*:0:0:System Administrator:/var/root:/bin/sh'
if passwd_re =~ line
  puts "Shell for #{Regexp.last_match[:username]} is #{Regexp.last_match[:shell]}"
else
  puts "String is not passwd line!"
end

line = 'wheel:*:0:root'
case line
when passwd_re
  puts "Shell for #{Regexp.last_match[:username]} is #{Regexp.last_match[:shell]}"
when /(?'group'\w+):.*:(?'groupid'\d+):.*/
  puts "Group #{Regexp.last_match(:group)} has id: #{Regexp.last_match(:groupid)}"
else
  puts "String is not passwd or group line!"
end

The first part uses match operator to determine if the given line matches the passwd pattern and if so it reads the MatchData values from the last_match. The second part is more interesting: as you probably remeber, Ruby uses case equality operator === to comparing the objects in case-when statements. Because this operator can be used on the Regexp as well, the Regular Expressions work in case-when by default, for free.

Real World Example: Apache Log#

At the end let’s try to parse the real log file entries. Notice that the line variable contains few lines ending with newline, so we need to split them and iterate on each line. We already know split method, and it is not surprising that it could take RE instead of String as an argument.

re = /^(\d+\.\d+\.\d+\.\d+).*\[(.*)\] "(.*)" (\d*) (\d*) "(.*)" "(.*)"$/
lines = '93.200.119.68 - - [20/Jun/2014:19:20:46 +0200] "GET /images/hidr.jpg HTTP/1.1" 200 335 "http://www.tg.pl/iPhone/AppStoreReviews/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
93.200.119.68 - - [20/Jun/2014:19:20:46 +0200] "GET /favicon.ico HTTP/1.1" 404 209 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
93.200.119.68 - - [20/Jun/2014:19:20:56 +0200] "GET /iPhone/AppStoreReviews/AppStoreReviews.zip HTTP/1.1" 200 4167 "http://www.tg.pl/iPhone/AppStoreReviews/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
5.9.18.147 - - [20/Jun/2014:19:24:28 +0200] "GET /robots.txt HTTP/1.0" 404 208 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)"
5.9.18.147 - - [20/Jun/2014:19:24:32 +0200] "GET / HTTP/1.0" 200 3713 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.5; http://www.majestic12.co.uk/bot.php?+)"
'

lines.split(/\n/).each do |line|                    # split log to array and iterate on it
  m = re.match line                                 # match the lines one by one
  puts "Request from #{m[1]} at #{(m[2])}" if m     # display message if matched
end

Summary#

There is much more about RE in Ruby, this chapter contains only the basics needed to start working with it. For detailed informations visit http://ruby-doc.org/core-2.0.0/Regexp.html or use the embeded documentation ri Regexp.

Regular Expressions need some practice and could be hard to debug, especially when became long and complicated. Always use Rubular to create and debug your Regexps!