This post was first published on the Idealware Blog in April of 2010.
I recently took on the project of migrating the Idealware articles and blog from their old homes on Idealware’s prior web site and Google’s Blogger service to our shiny, new, Drupal-based home. This was an interesting data-migration challenge. The Idealware articles were static HTML web pages that needed to be put in Drupal’s content database. And there is no utility that imports Blogger blogs to Drupal. Both projects required research and creativity.
The first step in any data migration project is to determine if automating the task will be more work than just doing it by hand. Idealware has about 220 articles published; cutting and pasting the text into Drupal, and then cleaning up the formatting, would be a grueling project for someone. On the other hand, automating the process was not a slam dunk. Database data is easier to write conversion processes for than free form text. HTML is somewhere in the middle, with HTML codes that identify sections, but lots of free form data as well.
Converting HTML Articles with Regular Expressions
My toolkit (of choice) for this project was Sed, the Unix Stream Editor, and a generic installation of Drupal. Sed does regular expression searching and replacing. So I wrote a script that:
- Deleted lines with HTML tags that we didn’t need;
- stored data between title and body tags;
- and converted those items to SQL code that would insert the title and article text into my Drupal database.
This was the best I could do: other standardized information, such as author and publishing date, was not standardized in the text, so I left calling those out for a clean-up phase that the Idealware staff took on. The project was a success, in it that it took less than two days to complete the conversion. It was never going to be an easy one.
Without going too far, the sed command to delete, say, a “META” tag is:
That says to search for a literal “less than” bracket (the forward slash implies literal) and the text meta and delete any line that contains it. A tricky part of the cleanup was to make sure that my search phrases weren’t ones that might also match article text.
Once I’d stripped the file down to just the data between the “title” and “body” tags, I issued this command:
s/\<title\>(.*)\<\/title\>.*\<body\>(.*)\<\/body\>/insert into articles (title, body) values (‘\1′, ‘\2′);/
This searches for the text between HTML “title” tags, storing it in variable 1, then the text between “body” tags, storing it in variable 2, then substitutes the variable data into a simple SQL insert statement in the replacement string. Iterating a script with all of the clean-up commands, culminating in that last command, gave me a text file that could be imported into the Drupal database. The remaining cleanup was done in Drupal’s WYSIWYG interface.
As I said, there is no such thing as a program or module that converts a Blogger Blog into Drupal format. And our circumstance was further complicated by the fact that the Idealware Blog was in Blogger’s legacy “FTP” format, so the conversion options available were further limited.
There is an excellent module for converting WordPress blogs to Drupal, and there were options for converting a legacy Blogger blog to WordPress. So, then the question was, how well will the blog survive a double conversion? The answer was: very well! I challenge any of you to identify the one post that didn’t come through with every word and picture intact.
I had a good start for this, Matthew Saunders at the Nonprofits and Web 2.0 Blog posted this excellent guide. If you have a current Blogger blog to migrate, every step here will work. My problem was that the Idealware blog was in the old “FTP” format. Google has announced that blogs in their original publishing format must be converted by May 1st. While this fact had little or no relationship to the web site move to Drupal, it’s convenient that we made the move well in advance of that.
To prep, I installed current, vanilla copies of WordPress and Drupal at techcafeteria.com. I tracked down Google’s free blog converters. While there is no WP to Drupal converter, most other formats are covered, and I just used their web-based Blogger to WordPress tool to convert the exported Idealware blog to WP format. The conversion process prompted me to create accounts for each author.
To get from WordPress to Drupal, I installed above-mentioned WordPress-import module. As with the first import, this one also prompted me to create the authors’ Drupal accounts. It also had an option to store all images locally (which required rights to create a public-writeable folder on the Drupal server). Again, this worked very well.
With my test completed, I set about doing it all over again on the new Idealware blog. Here I had a little less flexibility. I had administrative rights in Drupal, but I didn’t have access to the server. Two challenges: The server’s file upload limit (set in both Drupal and PHP’s initialization file) was set to a smaller size than my WordPress import file. I got around this by importing it in by individual blogger, making sure to include all current and former Idealware bloggers. The second issue was in creating a folder for the images, which I asked our host and designer at Digital Loom.com to do for me.
The final challenge was even stickier — the posts came across, but the URLs were in a different format than the old Blogger URLs This was a problem for the articles as well. How many sites do you think link to Idealware content out there? For this, I begged for enough server access to write and run a PHP script that renamed the current URLs to their former names — a half-successful effort, as Drupal had dramatically renamed a bunch of them. The remainder we manually altered.
All told, about two hours research time, three or four hours conversion (over a number of days) and more for the clean-up, as I wasted a lot of time trying to come up with a pure SQL command to do the URL renaming, only to eventually determine that it couldn’t be done without some scripting. A fun project, though, but I’d call it a success.
I hope this helps you out if you ever find yourself faced with a similar challenge.