Regular (Expression) Magic

This post was first published on the Idealware Blog in June of 2009.

Let’s get a bit geeky. Many Idealware visitors come here for advice on purchasing and deploying data management systems, such as donor databases, constituent relation management systems and content management systems. And, more often than not, are replacing older systems with new ones, meaning that one of the trickiest tasks is data migration. If any of this work has ever fallen to you, then you might have found yourself doing tedious editing and corrections in Excel, pouring over data screens or rows in Access trying to formalize non-formalized data entry, and generally settling for some lost or incorrect data moving from old system to new.

Wouldn’t it be great to have a magic wand that can instantly reformat the data to the proper format? Well, I have one for you. But, just as Harry Potter had to go to school before he could effectively wave his wand, mine comes with a lesson or two as well.

The wand in question is a search/replace language called regular expressions. Regular expressions are a set of terms that can be used, in supported software, to perform advanced search and replace functions. They were originally popularized in the Unix Stream Editor (SED), but are now standardly found in text editors, word processors, scripting languages (such as PHP) and other software, usually as an advanced option.

The reason to use them instead of a regular search and replace function is simple: they can search for things that regular search tools can’t. For example:

  • the first three characters at the beginning of each line
  • the three at the end of each line
  • one or more spaces

Regular expressions can also do multiple replacements in one phrase, allowing you to either remove the first comma encountered in a sentence, or all commas. Here are the basics:

A regular expression takes the form of /Search Phrase/Replacement/. A simple search to replace all instances of the word “fish” with the word “bird” would look like:

/fish/bird/

But regular expressions only prove their worth when you learn their special characters:

. (any character)

* (one or more characters)

^ (the beginning of a line

$ (the end of a line)

() (parentheses surrounding characters in the search phrase can be recalled in the replacement)

$1, $2 (substitute in the replacement for characters saved by parentheses in the search phrase)

\ (backslashes treat the next character literally, even if it’s a Regular expression special character)

[a-z], [0-9], [A-Za-z] (groupings search for all of the characters specified between the brackets, using dashes to identify ranges

Examples:

If you have a text printout of a document that you want to whittle into something more useful, like a CSV file, step one might be to remove any dead space.

/ */ /

will search for one or more spaces (the asterisk means “any number of the preceding character) and replace them with one space.

/^$/d

will remove all blank lines (lines with nothing between the beginning and the end of the line)

If you are moving data from one system to another, you might have to reformat dates for the new system. Say the old system exports dates as MM/DD/YYYY and the SQL database you’re importing them to expects YYYY-MM-DD. This Regular Expression will convert all dates to the new format:

/([01][0-9])\/([0-3][0-9])\/([12][0-9][0-9][0-9])/$3-$1-$2/

Let’s break this down:

/ – a slash starts the search phrase section.

( – parentheses surround things that we want to remember, so this starts a section we’ll remember.

[01][0-9] – a month (MM) will be a number between 1 and 12, so, if your system is exporting dates with leading zeros (if not, you can do this with a series of regular expressions to get around that), then the [01] set will match either a leading zero or a one. The [0-9] set will match any digit following that one or zero.

) – this will be remembered in the replacement as $1, because it’s the first thing we remembered.

\/ – since the slash is a regular expression special character (the delimiter), we precede it with a backslash, telling the parser to treat it a a slash, not a delimiter.

([0-3][0-9]) – this will find any pair of numbers between 01 and 39, which we know as the day, and remember it as $2, because it’s enclosed in parentheses.

\/ – next slash

([12][0-9][0-9][0-9]) – this catches the year. You see how, right? It is specifying that the year will be in this millennia or the last by limiting the first character to one or two. We use parentheses to remember this as well.

/ – this slash signifies that the search phrase is done, and the replacement will follow.

$3-$1-$2 – this takes our three remembered phrases and reorders them from month, day, year to year ($3), month ($1), day ($2), placing dashes in-between them.

/ – finally, we close the command with a slash.

One of my standard uses is to take a list – which could be an Excel spreadsheet, or a database dump, or a Word table — clean it up, and then format it into SQL statements that can then be pulled into a database. Most databases can import in CSV files, but Excel, while good at doing some reformatting, can’t do the fancy cleanup tasks that my regular expression-enabled editor can. Once my specific clean-up chores are done, if I’m left with a tab-delimited file, I can do the following three simple searches to turn it into a SQL input file that can just be run in my SQL interpreter.

/\t/’,’/ — searches for all tabs (\t is a symbol that means “tab”) and replaces them with ‘,’

/(.)$/$1′);/ – searches for the last character in a line and replaces it with that character followed by a close quote, close parens and semi-colon.

/^(.)/insert into players (name, title, company) values (‘$1/ – searches for the first character in any line and prepends the front end of the SQL statement.

If we had an input file with lines like this:

Joe NamathQuarterbackForty-niners

It would become

insert into players (name, title, company) values (‘Joe Namath’,’Quarterback’,’Forty-niners’);

There are plenty of excellent resources for learning about regular expressions on the web, but many of them are targeted at programmers, making them a bit thick to read through. For more friendly introductions, I recommend The regular-expressions.info quickstart. While many text-processing tools, including Microsoft Word, support regular expression search and replace, I recommend using a good text editor over a word processor, because it will likely include supporting functionality, such as block copying/pasting, and they’ll handle very large files with far more speed and grace. I’ve been happy using TextPad and EditPlus on Windows, and TextMate and TextWrangler on the Mac. Wikipedia publishes an incomplete list of applications that include regular expression functionality.

Comments are closed.