Tag Archives: regular expressions

Adventures In Web Site Migration

This post was first published on the Idealware Blog in April of 2010.

I recently took on the project of migrating the Idealware articles and blog from their old homes on Idealware’s prior web site and Google’s Blogger service to our shiny, new, Drupal-based home. This was an interesting data-migration challenge. The Idealware articles were static HTML web pages that needed to be put in Drupal’s content database. And there is no utility that imports Blogger blogs to Drupal. Both projects required research and creativity.

The first step in any data migration project is to determine if automating the task will be more work than just doing it by hand. Idealware has about 220 articles published; cutting and pasting the text into Drupal, and then cleaning up the formatting, would be a grueling project for someone. On the other hand, automating the process was not a slam dunk. Database data is easier to write conversion processes for than free form text. HTML is somewhere in the middle, with HTML codes that identify sections, but lots of free form data as well.

Converting HTML Articles with Regular Expressions

My toolkit (of choice) for this project was Sed, the Unix Stream Editor, and a generic installation of Drupal. Sed does regular expression searching and replacing. So I wrote a script that:

  1. Deleted lines with HTML tags that we didn’t need;
  2. stored data between title and body tags;
  3. and converted those items to SQL code that would insert the title and article text into my Drupal database.

This was the best I could do: other standardized information, such as author and publishing date, was not standardized in the text, so I left calling those out for a clean-up phase that the Idealware staff took on. The project was a success, in it that it took less than two days to complete the conversion. It was never going to be an easy one.

Without going too far, the sed command to delete, say, a “META” tag is:

/\<meta/d

That says to search for a literal “less than” bracket (the forward slash implies literal) and the text meta and delete any line that contains it. A tricky part of the cleanup was to make sure that my search phrases weren’t ones that might also match article text.

Once I’d stripped the file down to just the data between the “title” and “body” tags, I issued this command:

s/\<title\>(.*)\<\/title\>.*\<body\>(.*)\<\/body\>/insert into articles (title, body) values (‘\1’, ‘\2’);/

This searches for the text between HTML “title” tags, storing it in variable 1, then the text between “body” tags, storing it in variable 2, then substitutes the variable data into a simple SQL insert statement in the replacement string. Iterating a script with all of the clean-up commands, culminating in that last command, gave me a text file that could be imported into the Drupal database. The remaining cleanup was done in Drupal’s WYSIWYG interface.

Blog Conversion

As I said, there is no such thing as a program or module that converts a Blogger Blog into Drupal format. And our circumstance was further complicated by the fact that the Idealware Blog was in Blogger’s legacy “FTP” format, so the conversion options available were further limited.

There is an excellent module for converting WordPress blogs to Drupal, and there were options for converting a legacy Blogger blog to WordPress. So, then the question was, how well will the blog survive a double conversion? The answer was: very well! I challenge any of you to identify the one post that didn’t come through with every word and picture intact.

I had a good start for this, Matthew Saunders at the Nonprofits and Web 2.0 Blog posted this excellent guide. If you have a current Blogger blog to migrate, every step here will work. My problem was that the Idealware blog was in the old “FTP” format. Google has announced that blogs in their original publishing format must be converted by May 1st. While this fact had little or no relationship to the web site move to Drupal, it’s convenient that we made the move well in advance of that.

To prep, I installed current, vanilla copies of WordPress and Drupal at techcafeteria.com. I tracked down Google’s free blog converters. While there is no WP to Drupal converter, most other formats are covered, and I just used their web-based Blogger to WordPress tool to convert the exported Idealware blog to WP format. The conversion process prompted me to create accounts for each author.

To get from WordPress to Drupal, I installed above-mentioned WordPress-import module. As with the first import, this one also prompted me to create the authors’ Drupal accounts. It also had an option to store all images locally (which required rights to create a public-writeable folder on the Drupal server). Again, this worked very well.

With my test completed, I set about doing it all over again on the new Idealware blog. Here I had a little less flexibility. I had administrative rights in Drupal, but I didn’t have access to the server. Two challenges: The server’s file upload limit (set in both Drupal and PHP’s initialization file) was set to a smaller size than my WordPress import file. I got around this by importing it in by individual blogger, making sure to include all current and former Idealware bloggers. The second issue was in creating a folder for the images, which I asked our host and designer at Digital Loom.com to do for me.

Cleanup!

The final challenge was even stickier — the posts came across, but the URLs were in a different format than the old Blogger URLs This was a problem for the articles as well. How many sites do you think link to Idealware content out there? For this, I begged for enough server access to write and run a PHP script that renamed the current URLs to their former names — a half-successful effort, as Drupal had dramatically renamed a bunch of them. The remainder we manually altered.

All told, about two hours research time, three or four hours conversion (over a number of days) and more for the clean-up, as I wasted a lot of time trying to come up with a pure SQL command to do the URL renaming, only to eventually determine that it couldn’t be done without some scripting. A fun project, though, but I’d call it a success.

I hope this helps you out if you ever find yourself faced with a similar challenge.

Regular (Expression) Magic

This post was first published on the Idealware Blog in June of 2009.

Let’s get a bit geeky. Many Idealware visitors come here for advice on purchasing and deploying data management systems, such as donor databases, constituent relation management systems and content management systems. And, more often than not, are replacing older systems with new ones, meaning that one of the trickiest tasks is data migration. If any of this work has ever fallen to you, then you might have found yourself doing tedious editing and corrections in Excel, pouring over data screens or rows in Access trying to formalize non-formalized data entry, and generally settling for some lost or incorrect data moving from old system to new.

Wouldn’t it be great to have a magic wand that can instantly reformat the data to the proper format? Well, I have one for you. But, just as Harry Potter had to go to school before he could effectively wave his wand, mine comes with a lesson or two as well.

The wand in question is a search/replace language called regular expressions. Regular expressions are a set of terms that can be used, in supported software, to perform advanced search and replace functions. They were originally popularized in the Unix Stream Editor (SED), but are now standardly found in text editors, word processors, scripting languages (such as PHP) and other software, usually as an advanced option.

The reason to use them instead of a regular search and replace function is simple: they can search for things that regular search tools can’t. For example:

  • the first three characters at the beginning of each line
  • the three at the end of each line
  • one or more spaces

Regular expressions can also do multiple replacements in one phrase, allowing you to either remove the first comma encountered in a sentence, or all commas. Here are the basics:

A regular expression takes the form of /Search Phrase/Replacement/. A simple search to replace all instances of the word “fish” with the word “bird” would look like:

/fish/bird/

But regular expressions only prove their worth when you learn their special characters:

. (any character)

* (one or more characters)

^ (the beginning of a line

$ (the end of a line)

() (parentheses surrounding characters in the search phrase can be recalled in the replacement)

$1, $2 (substitute in the replacement for characters saved by parentheses in the search phrase)

\ (backslashes treat the next character literally, even if it’s a Regular expression special character)

[a-z], [0-9], [A-Za-z] (groupings search for all of the characters specified between the brackets, using dashes to identify ranges

Examples:

If you have a text printout of a document that you want to whittle into something more useful, like a CSV file, step one might be to remove any dead space.

/ */ /

will search for one or more spaces (the asterisk means “any number of the preceding character) and replace them with one space.

/^$/d

will remove all blank lines (lines with nothing between the beginning and the end of the line)

If you are moving data from one system to another, you might have to reformat dates for the new system. Say the old system exports dates as MM/DD/YYYY and the SQL database you’re importing them to expects YYYY-MM-DD. This Regular Expression will convert all dates to the new format:

/([01][0-9])\/([0-3][0-9])\/([12][0-9][0-9][0-9])/$3-$1-$2/

Let’s break this down:

/ – a slash starts the search phrase section.

( – parentheses surround things that we want to remember, so this starts a section we’ll remember.

[01][0-9] – a month (MM) will be a number between 1 and 12, so, if your system is exporting dates with leading zeros (if not, you can do this with a series of regular expressions to get around that), then the [01] set will match either a leading zero or a one. The [0-9] set will match any digit following that one or zero.

) – this will be remembered in the replacement as $1, because it’s the first thing we remembered.

\/ – since the slash is a regular expression special character (the delimiter), we precede it with a backslash, telling the parser to treat it a a slash, not a delimiter.

([0-3][0-9]) – this will find any pair of numbers between 01 and 39, which we know as the day, and remember it as $2, because it’s enclosed in parentheses.

\/ – next slash

([12][0-9][0-9][0-9]) – this catches the year. You see how, right? It is specifying that the year will be in this millennia or the last by limiting the first character to one or two. We use parentheses to remember this as well.

/ – this slash signifies that the search phrase is done, and the replacement will follow.

$3-$1-$2 – this takes our three remembered phrases and reorders them from month, day, year to year ($3), month ($1), day ($2), placing dashes in-between them.

/ – finally, we close the command with a slash.

One of my standard uses is to take a list – which could be an Excel spreadsheet, or a database dump, or a Word table — clean it up, and then format it into SQL statements that can then be pulled into a database. Most databases can import in CSV files, but Excel, while good at doing some reformatting, can’t do the fancy cleanup tasks that my regular expression-enabled editor can. Once my specific clean-up chores are done, if I’m left with a tab-delimited file, I can do the following three simple searches to turn it into a SQL input file that can just be run in my SQL interpreter.

/\t/’,’/ — searches for all tabs (\t is a symbol that means “tab”) and replaces them with ‘,’

/(.)$/$1′);/ – searches for the last character in a line and replaces it with that character followed by a close quote, close parens and semi-colon.

/^(.)/insert into players (name, title, company) values (‘$1/ – searches for the first character in any line and prepends the front end of the SQL statement.

If we had an input file with lines like this:

Joe NamathQuarterbackForty-niners

It would become

insert into players (name, title, company) values (‘Joe Namath’,’Quarterback’,’Forty-niners’);

There are plenty of excellent resources for learning about regular expressions on the web, but many of them are targeted at programmers, making them a bit thick to read through. For more friendly introductions, I recommend The regular-expressions.info quickstart. While many text-processing tools, including Microsoft Word, support regular expression search and replace, I recommend using a good text editor over a word processor, because it will likely include supporting functionality, such as block copying/pasting, and they’ll handle very large files with far more speed and grace. I’ve been happy using TextPad and EditPlus on Windows, and TextMate and TextWrangler on the Mac. Wikipedia publishes an incomplete list of applications that include regular expression functionality.