ObiLab - BASH Scripting

More then one way, but is it the right one?

"The value and utility of any experiment are determined by the fitness of the material to the purpose for which it is used...".
(Gregor Mendel)

Regular Expressions

Regular expressions are extremely useful in extracting information from text such as code, log files, spreadsheets, or even documents. And while there is a lot of theory behind formal languages, the following lessons and examples will explore the more practical uses of regular expressions so that you can use them as quickly as possible. The first thing to recognize when using regular expressions is that everything is essentially a character, and we are writing patterns to match a specific sequence of characters (also known as a string). Most patterns use normal ASCII, which includes letters, digits, punctuation and other symbols on your keyboard like %#$@!, but unicode characters can also be used to match any type of international text.

Common Token


\    #Escape Character
abc…	#Letters
123… 	#Digits
\d 	#Any Digit
\D 	#Any Non-digit character
. 	#Any Character
\. 	#Period
[abc] 	#Only a, b, or c
[^abc] 	#Not a, b, nor c
[a-z] 	#Characters a to z
[0-9] 	#Numbers 0 to 9
\w 	#Any Alphanumeric character
\W 	#Any Non-alphanumeric character
{m} 	#m Repetitions
{m,n} 	#m to n Repetitions
* 	#Zero or more repetitions
+ 	#One or more repetitions
? 	#Optional character
\s 	#Any Whitespace
\S 	#Any Non-whitespace character
^…$ 	#Starts and ends
\n      #new line
\t      #tab
\0      #null character
(…) 	#Capture Group
(a(bc)) #Capture Sub-group
(.*) 	#Capture all
(abc|def) #Matches abc or def

Example 1: letters

Download this Sample File and use "GREP -P" to identify rows that "matches" with this text: "abcde", "abc", "abcd"




cat file.txt | grep -nP 'abc' #we are using the "perl" engine in order to match extended regExp. It is powerful then -e or -E
1:abc
2:abcdef
4:abcdefg

Example 2: digits

GREP to identify rows that "matches" with this text: "abc123xyz", "define '123'", "var g = 123;"




cat file.txt | grep -nP '123'
5:123
6:var g = 123;
7:define '123'

Esample 3: rows with any digit inside


grep -Pn '\d' file.txt
5:123
6:var g = 123;
7:define '123'

Esample 4: rows with any "white space" inside


grep -Pn '\s' file.txt
6:var g = 123;
7:define '123'

Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using the special parentheses ( and ) metacharacters. Any subpattern inside a pair of parentheses will be captured as a group. In practice, this can be used to extract information like phone numbers or emails from all sorts of data.

Esample 5: rows with filename of pictures, not of other files. So, .jpg, .png, but not .zip or .tar.gz


#with this regex we will catch all rows that ENDS with .png, but we'd like to have multiple matchs
#e.g. for jpg, png, tif
    grep -nP '.png$' file.txt

#so we can use a group with the "or" operator
    grep -nP '(.png$|.jpg$|.tif$)' file.txt
#I would like to remind to you the table of truth of "or" binary operator. It will be "true" if one or both expression are true.

Esample 5: rows with filename of pictures, starting with "foto", not of other regular files. So, foto_.jpg, foto__.png, but not image.jpg foto.zip or .tar.gz



grep -nP '^(foto\w.png$|foto\w.jpg$|foto\w.tif$)' file.txt

When you are working with complex data, you can easily find yourself having to extract multiple layers of information, which can result in nested groups. Generally, the results of the captured groups are in the order in which they are defined (in order by open parenthesis). The nested groups are read from left to right in the pattern, with the first capture group being the contents of the first parentheses group, etc.

Esample 5: Match rows with nested group


grep -nP '^(I love)' file.txt #rows starting with "I love"

#but I would like to match rows just with program languages "java" OR "c"  
grep -nP '^(I love (java|C)$)' file.txt

Write a good and optimized regex isn't so easy, sometimes, but often they can help us to save alot of time