A Possible Approach to Importing Static Content Into Drupal

The Motivation

The bread and butter of my freelance development work often involves converting old, static Web sites of small organizations–small businesses and non-profits–to a more sustainable content management system.

These sites were usually started by amateur, would-be graphic designers who work for free, or very little compensation. They tend to be students or more traditional, print-oriented graphic designers looking for their first real world “HTML coding” experience. They were retained by companies that couldn’t make a significant financial commitment to their Web site, either because they didn’t believe the site would be important to their organization, or they didn’t believe they had the money to hire a professional to build and maintain it.

In these situations, my work will involve not only migrating the content and hosting, but also updating the site design, reorganizing and cleaning up the content, and empowering staff members by educating them not only in the use of the content management system but good Web design and information architecture practices. I usually also try to educate the organization’s management on ways to use their new Web site as an effective marketing tool, a strategic part of their business.

You can imagine the state these sites are in when I get them. I’m talking circa 1998 Microsoft Frontpage (or present day Dreamweaver) garbage, or worse. The graphic design hurts your eyes, the HTML code is dirty, the organization of the content is dizzying. I don’t think this is shocking news to most readers.

Some clients are smart enough to recognize that they have a problem with their site, they trust me to do the job they hired me to do, and they are willing to make a commitment of time and energy to learn to manage their site effectively. In these cases, they will see a CMS as an opportunity to start fresh, re-envision their Web site and rebuild it from the ground up. This frees me to do what I do best and allows the organization to make a much needed investment in their Web site content.

On the other end of the spectrum is the customer that refuses to be educated in best practices, either because they ultimately don’t trust my judgment, or they think they know what’s best for their site. They want everything preserved–their ugly layout, confusing structure and horrible content–they just want to use a CMS to feel a little hipper. In my first year, when I was just trying to establish myself, I swallowed my pride and worked for these clients. Now, I have a better qualification process, and I don’t waste my time with clients like this. They want something for nothing, they don’t trust me, and ultimately, they don’t deserve me. They are simply much more trouble than they are worth.

Of course, most clients fall in the middle. They are willing to listen to most of my advice, but might insist on certain things like an in illogical structure to the navigation, or a hideous header graphic. I’ve never believed that my role is to “save the client from themselves”–its their site, and after stating my case, I need to implement what they want whether I agree with it or not. (I’ve heard of companies that will reverse changes clients make to sites if they don’t agree with what the client has done, without even consulting the client first!) Or, the client may be willing to remake their site quite dramatically, but they need to make changes more slowly and incrementally, and want to wait until the site is driven from a CMS to being transforming it. That approach can make sense too.

In either of these cases, the first problem I often need to address is just migrating existing site content. This may be because the clients don’t feel confident in their ability to be trained to use the CMS right away, or it may be that they just assume that I will initially do this work. I guess its like when I go to a shoe store expecting to buy shoe laces; after condescending laughter, I’m told that shoe stores don’t do that sort of thing: “you don’t expect to buy gas from a car dealer, do you?”

More and more, my “go to” CMS is Drupal. I appreciate its conceptual design, and it is easy to deploy. Its a mature product with equally mature modules that allow me to address a wide range of customer needs. Obviously, like any good CMS, Drupal is great at building new sites and maintaining existing ones, but when it comes to the largely mind numbing task of shoving in lots of existing static content, Drupal can seem agonizingly slow, as does any other mature CMS I have used in that situation.

So, what are my options for trying to automate this task?

The Approach

My first thought was that what I’ve described above must sound very familiar to others working in this space, so there must be some great ideas out there for making this kind of content migration easier. And Drupal has many great modules, perhaps there is one designed for just this occasion.

And of course, there is: Import_HTML. My first reaction was that this appeared to be a very well thought out, clever approach to the problem, and that opinion hasn’t changed, despite the fact that I recently tried it on a particular site I mirrored offline using wget, and I didn’t get the results I was hoping for.

I suspect that no matter how well Import_HTML is implemented, its effectiveness is likely to be limited because the problem its trying to solve is just too big and arbitrarily complex. Because the sites I want to import are not managed by a CMS, or even constructed by someone skilled, each page is a unique mess all its own. I’m skeptical that any automated tool can deal with these situations effectively enough to make its use worthwhile.

(That said, I would encourage anyone using Drupal and facing this situation to give Import_HTML a chance, as it is a very nice module that you may find more helpful in your situation than I did in mine. If your experiences and your assessment of Import_HTML’s broader usefulness differ from mine, I would love to hear more about it. I didn’t invest a lot of time in getting Import_HTML to work, so it would be extremely unfair to conclude that what I am saying about it is anything like an informed “review” of the module.)

Even if Import_HTML worked well in some cases, I’m not sure it would be helpful to have to fall back on a manual process in the cases in which it didn’t work. I started thinking that I would prefer to have a single, consistent approach that worked in all cases, even if it only automated a part of the process. After some thinking, I concluded that a great deal of typing and clicking involved in importing static content manually into Drupal centered around creating a new page, typing its title, its url path and configuring it in the menu structure, including its parent and weighting. It also happens that this is a much smaller set of specifically defined tasks to attempt to automate, and I believe I have a promising start on doing just that.

Imagine a typical tree style navigation on a Web site. The information it provides is all the information that I described in the previous paragraph. So what if we envision this structure beforehand, quickly type it into a plain text file, and use that to then automate those tasks described? Lets take a small example of what that might look like:

home products – content management – customer relationship management services – web hosting – custom programming about us contact us

It looks like this file could be parsed and if we know how to programmatically work with Drupal, we could easily get a leg up on migrating the static content. After some research in the Drupal forums, I created my first pass on making this happen:

#!/usr/bin/php <?php error_reporting(E_ERROR); // check the command line args, provide help if ($argc != 1 || in_array($argv[1], array(‘–help’, ‘-help’, ‘-h’, ‘-?’))) { ?> This is a command line PHP script. Use it as indicated to create the structure of a site, complete with placeholder pages, custom paths and proper placement in the menu structure without the tedium of the Drupal GUI. Then the content of each page can be customized. Usage: <?php echo $argv[0]; ?> When run from the root of the Drupal install, it looks for a file in that directory called structure.import with the following format: <url path> | <menu label> | <page title> – <url path> | <menu label> | <page title> — <url path> | <menu label> | <page title> The lines in this file are in order, optionally with an ‘-‘ character at the beginning to indicate placement in the hierarchy. Only one or two elements may be specified on each line, the remaining fields will be deduced. Capitalization is normalized and whitespace trimmed. With –help, -help, -h, or -? options, you can get this help. <?php } // load in necessary Drupal classes, database connection information require_once ‘./includes/bootstrap.inc’; drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL); // file format configuration $levelDelim = ‘-‘; $elementDelim = ‘|’; // track levels // for each level, map parent ids to current weight for an item on that parent // 1=navigation, 2=primary links $levels[] = array(1,0); $import = “structure.import”; if (file_exists($import)) { $lines = file($import); foreach ($lines as $line_num => $line) { if (trim($line) != ”) { $elements = explode($elementDelim, $line); $level = substr_count($elements[0], $levelDelim) + 1; $path = $elements[0]; $path = str_replace($levelDelim, ”, $path); $path = trim($path); $path = strtolower($path); $label=ucwords($path); $title=$label; if (isset($elements[1])) { $label = trim($elements[1]); } if (substr_count($path, ‘ ‘)) { $path = str_replace(‘ ‘, ‘_’, $path); } if (isset($elements[2])) { $title = trim($elements[2]); } // create the page $node = new StdClass(); $node->uid = 1; $node->type = ‘page’; $node->status = 1; // published $node->promote = 0; // don’t promote to front page $node->path = $path; // ?q=path $node->format=3; // full HTML $node->title = $title; $node->body = ”; // add later node_save($node); $parentLevel = $level-1; $parentLevelInfo =& $levels[$parentLevel]; // create the menu item $menuItem = array(); $menuItem[‘pid’] = $parentLevelInfo[0]; $parentLevelInfo[1]++; $menuItem[‘weight’]=$parentLevelInfo[1]; $menuItem[‘path’]=’node/’ . $node->nid; $menuItem[‘title’]=$label; $menuItem[‘type’]=118; // see includes/menu.inc menu_save_item($menuItem); $levels[$level] = array($menuItem[‘mid’],0); } } } else { echo “nnNo import file: $import found.n”; } ?>

This is obviously pretty rough around the edges, but I think its a promising start, and it definitely automates a lot of tedious clicking in the Drupal content management interface. It will parse the structured text file I presented above and create a basic site structure with placeholder page nodes.

Its designed to run on the command line in the root of Drupal site, and looks for a structured text file following the above conventions in the same directory, called “structure.import”. It fires up the Drupal machinery, just as it would if Drupal were receiving a request through the Web, and programmatically creates page nodes and configures the menuing system. Clearly, the Drupal folks expected that users would want to interact with the system programmatically.

In most cases, the text file I presented is all you would need to create. But as I said, some customers want their site content to be migrated faithfully, at least at first, and one thing I see time and again is that links don’t match the page titles they link to–which for me is a cardinal usability sin. So, I allowed the script to account for situations like this by allowing you to specify different menu labels and page node titles. And, if you only have a few pages that do this, you only have to specify the ones that are different. Also, the program attempts to be very tolerant of things like spacing issues and capitalization. So, you can have a pretty sloppy file that should still work as expected, even with a mess like this:

home Products -content management -customer relationship management services – web hosting – custom programming about us | About Us Label contact us | Contact Us Label | Contact Us Title

The only oddity I have seen so far is that when I initially run it, it appears not to create the menu entries until I actually go to the menu administration area, then they suddenly appear. Obviously, this is likely some caching issue that I could also probably control programmatically, if I knew better what I was doing.

Other enhancements are probably screaming out at you. Obviously we can also programmatically specify the body of the page node, so perhaps we could come up with a semi-automated way of doing that too, perhaps even programmatically running Tidy on the page body before assigning it to the node. Other suggestions are welcome.


As I said, the situation I have described must sound familiar to many, and there are probably many developers with a lot more experience with Drupal than I have, who have given this a lot of thought and perhaps reached very different conclusions. I’d be interested to hear people’s opinions on the potential of this approach and any descriptions of alternate solutions to the challenge of importing static Web pages into Drupal.

Even if this approach turns out to be a bad idea, if nothing else, I think its a solid example of how to manipulate Drupal content programmatically. Maybe that will be the greatest value of presenting this code.

Related Post