Regular Expressions with Greedy Sed

Regular expressions are scary to many. I overcame that fear when I ended up having to use them in my PHP scripting. The complexities of regular expressions are many and because of these complexities, the flexibility is enormous.

I found an old XML file in my drive which was 100kb in size and with a single line of text, it was horrible. However, I managed to use *sed* to enter newlines where they should be and processed it that way. It still wasn’t enough, I had to create a sed regular expression that would extract two pieces of data from a single line. Consider the following example:

<Element attrib1="prop1" attrib2="prop2" attrib3="prop3" attrib4="prop4"><Child><Child2 attrib5="EXTRACT1"/></Child><Sibling><Child3 attrib5="prop5"/></Sibling><Sibling2 attrib6="prop6">EXTRACT2</Sibling2></Element>

(Note that this is just an example, and not the actual XML data in the file)

What I wanted to do is extract the word EXTRACT1 with the text after the last closing tag, EXTRACT2. The final sed command I used was this:

sed -e ‘s#^.*attrib5="\([^"]*\)".*attrib5[^>]*><[^>]*><[^>]*>\([^<]*\)<.*$#\1 – \2#g’ file

Here is what all that means. To begin, here are the matches and their corresponding regular expression parts in a table:

Regular Expression Matched Text
^.* <Element attrib1="prop1" attrib2="prop2" attrib3="prop3" attrib4="prop4"><Child><Child2
attrib5=" attrib5="
\([^"]*\) EXTRACT1
".*attrib5 "/></Child><Sibling><Child3 attrib5
[^>]*> ="prop5"/>
<[^>]*> </Sibling>
<[^>]*> <Sibling2 attrib6="prop6">
\([^<]*\) EXTRACT2
<.*$ </Sibling2></Element>
  • The First part which is ^.* means, match the start of the string up to any text after it.
  • The attrib5=" matches exactly that string.
  • The \([^”]*\) is where it gets entertaining.
    • First of all the \( escapes the parentheses since I’m using the sed in a command line interface. The same goes for the \) at the end.
    • The brackets in the [^”]* are used to group what is inside them, which means not to match a double quote. This is important because sed (as well as grep) are greedy in their matching which means that they will not stop after the first matched string, but the last matched string. Since I want the match to stop when reaching a double quote (non inclusive) I’m telling it to match everything except a double quote, which will make it stop at the first double quote it encounters. It actually took me quite a while to figure this out, but as you will see, this pattern keeps repeating in the regular expression.
  • The “.*attrib5 matches a quotation and anything after it until reaching the other occurrence of attrib5. This is important again because sed is greedy and would match the this instance of attrib5 rather than the previous one if I didn’t include it. 
  • The [^>]*> matches everything up to a >. Which basically matches everything after attrib5 until the first >.
  • The next two <[^>]*> match the next two xml tags, including all text in these tags and opening and closing <>.
  • The \([^<]*\)< will match the text up until the <.
  • The .*$ will then match everything up until the end of the line.
Here are some basic regular expression special symbols

The ^ or the caret as it is called is a negation. It simply means do not to match any of the following. If it is used in the beginning of the regular expression then it means to match the beginning of the string.

The * or also known as star or asterisk (as it should be called) is used to match many of the previous item. The previous item can be a string of characters or an expression surrounded by opening and closing brackets like [ and ].