Don’t Be Greedy: 6 Tips for Better Regular Expressions

If you’re a programmer, you’ve no doubt been in this situation before: you’ve got to convert a bunch of data from one form to another.  Maybe you’ve got an enum that needs to processed with a switch statement (and you don’t have ReSharper to do it for you!), or maybe it’s static strings that need to be put into a dictionary…or a million other possibilities.  If it’s only five or ten lines, you’d probably just retype it, or cut and paste — the rote approach.  But what if it’s twenty lines…or a hundred?  That’s when it’s worth it to get down & dirty with some regular expressions.

I had to do this just the other day: I had an HTML table containing a descriptive name and a product code, and some other (possibly empty) fields, and I needed to turn it into T-SQL ALTER statements.  I’ll use this example to explain some of my favorite regex tricks.

What I had was something like this (I’ve substituted t for the actual tabs):

Widget Alpha -- 50', BluetWG-A-50BtSome FieldtLast Field
Widget Beta -- 10', RedtWG-B-10R ttLastField
[...150 more lines like this]

And what I needed was this:

ALTER Product SET Name='Widget Alpha -- 50'', Blue' WHERE Model='WG-A-50B';
ALTER Product SET Name='Widget Beta -- 10', Red' WHERE Model="WG-B-10R';

I did this in Visual Studio 2010, which uses it’s own brand of regular expressions, but I’ll make sure to point out where I’m using Microsoft-specific regex syntax.

 

1. Don’t try to do everything in one regular expression.

Sure, you probably could, but in doing so, you’re going to drive yourself crazy.  It’s easiest to handle what you need to do in steps.  The first thing I knew was going to cause trouble is the single quotation marks in the descriptions: those need to be escaped to be used in a T-SQL string.  T-SQL escapes single quotation marks by doubling them, so the first replacement was easy (and didn’t even need regular expressions): replace ‘ with ”.

 

2. Don’t be greedy.

One of the subtleties of regular expressions that people often neglect is the difference between greedy and lazy matches.  Greedy matching (the default) means that any matching specifier (like * or +) will keep on matching as long as it can, whether that’s what you intended or not.  In general, you can get the job done with either greedy or lazy matching, but it’s often more work to use greedy matching (the default).

 

3. Tagged expressions in Visual Studio.

Note that VS uses curly brackets to denote tagged subexpressions, whereas basic regular expressions use ( and ) and extended regular expressions use plain parentheses.

In our example, the “obvious” regular expression won’t do what I want: replacing “{.*}t{.*}” with “ALTER Product SET Name=’1′ WHERE Model=’2;'” yields:

ALTER Product SET Name='Widget Alpha -- 50', BluetWG-A-50BtSome Field' WHERE Model='Last Field
ALTER Product SET Name='Widget Beta -- 10', RedtWG-B-10Rt ' WHERE Model='LastField

Whoops!  What’s going on here is that first “.*” is greedily matching everything up to the last tab character, effectively lumping together the product name with the second field.

Of course I could just use “{.*}t{.*}t.*t.*” but I like my regular expressions to be lazy, just like I am.  POSIX and Perl regular expression syntax use *? and +? to specify the corresponding lazy matches, but Microsoft has chosen to go their own way and use @ and #.

So now I can finally get what I want by replacing “{.@}t{.@}t.*” with “ALTER Product SET Name=’1′ WHERE Model=’2′;”

 

4. Ctrl-Z is your friend.

Unless you’re using regular expressions many times a day every day, chances are, you’re not always going to remember every modifier, and there might be a little little trial-and-error and, fortunately, our friend Ctrl-Z is there to help us out. Don’t be afraid to shrug your shoulders and try something out…if it mangles your text, just undo it and try it again.

 

5. If it’s not matching, make sure you’re not accidentally using an un-escaped metacharacter.

If you keep getting “no matches found”, you might want to check and make sure you’re not trying to match a literal with an unescaped metacharacter. VS in particular is tricky this way…almost every punctuation mark is a metacharacter in VS, and needs to be escaped (with a backslash).

 

6. Learning regular expressions is so worth it.

If you’re new to regular expressions, or inexperienced with them, this may all seem completely impenetrable and archaic, and I get it man, I do.  There’s an undeniable learning curve to regular expressions, but once you reach a certain proficiency, you will save an unbelievable amount of time.  There are a million good tutorials and books out there if you’re just getting started.  I’m especially partial to Jim Hollenhorst’s tutorial, especially if you’re a VS user.

 

Search for Engineering careers at Pop Art >