Leon on URLs, part 2

AUTHOR

LEON BAKER

DATE

10/12/2010

CATEGORY

BRAINDUMP, SEO, URLs

Back to Articles
Leon intones...

tl;dr: Leon needs a different hobby.


Ok, so let's see if I can come up with an article in less than 7 hours. I'm told that this content generation stuff gets easier over time; possibly due to progressive neuron die-off.

The previous article was sufficiently confused and disorganized that I can probably get away with reprising the same material at least once. So let's get to it!

These are the different groups of URLs I tend to review when building up an in-depth analysis of a site, and some of the sources I use.

Firstly, we go to what I think of as the autobiographic URLs. I source these from the site's sitemaps; both the user-navigable (HTML) sitemap and the XML sitemap (assuming that both exist).

Autobiographic URLs

I call them autobiographic because these documents are the site explictly talking about itself - documents that are about the site. (We'll see the site implicitly talking about itself shortly.)
These are usually the URLs that directly match the content; www.domain.com/widgets/ and www.domain.com/dongles/. The two sources don't necessarily match; the XML sitemap can be a lot bigger, of course, but there might also be things missing from it - that's why I haven't depicted it as a superset of the HTML sitemap.

So do these constitute at least the formal, canonical landing pages of the site, and can we take them as an indicator of importance? Not quite, because these sources have often been constructed manually, or using third-party tools, and you cannot rely on either case having sufficient information to do the job properly.
For that reason, I request a list of URLs pulled from the client's CMS, if there is such a feature on the system.
(This often involves beating the logins out of them and having a look myself, as clients often don't know what their CMS is able to do, and CMS vendors tend to lie a lot if faced with actual work.)

Reflexive URLs

I call this set the reflexive URLs, because they're the site implicitly talking about itself - the knowledge embodied in the CMS. In a sane world, of course, we'd already have all these URLs, because this feature of the system would have been used to produce the XML sitemap. As it is, there's usually new stuff.

Once I've regexp'd all these URLs into a working format, the first step is to deduplicate (though retaining the data on sources) and check consistency - the sources will usually be internally consistent, but I'll check if one shows /widgets/ and another shows /widgets, for example. I also run the list through a header checker looking for 404s and redirects to remove for composting; 404s and redirects in a sitemap are basically a cry for help.

Don't be this guy.

Now we have a nice, neat set of URLs that can be called the formal site, the site as it should be; a closed graph of well-formed, unique URLs that all return HTTP 200, covering the entirety of the client's messaging. It's unlikely to be precisely the same as the formal model that would be in the head of a theoretical hypercompetent client (who wouldn't need us!) but it makes the best proxy for it we have.

So in an ideal world, that would be the end of the story; all incoming links would point to one of those URLs, and Google's index would consist of up-to-date copies of all and only those URLs; and the forest would have plenty of owls. Of course, it's never like that. [You've even managed to screw this site up more than once and it's got, like, three pages - Ed.]
This is because web sites are leaky abstractions; formally they're fine, but something in the loop (humans, CMS design, server setup) is always going to break correspondence with the model.

So next time, we'll go into leaky abstraction diagnosis.