Leon on URLs
tl;dr: Leon thinks about URLs entirely too much.
So my favourite part of SEO has always been putting site structures in order.
Yes, I know. It's probably the least important part of the SEO trinity. However, unlike the infinite gluttons Sir Content and Lord Linkage, website structure is something that - in theory at least - can be done perfectly.
I like things that can be done perfectly.
So, in an effort to say something about SEO that I haven't read elsewhere a million times, I'm going to braindump for a bit regarding the way I think about website structure, and specifically the many things that the word can mean. "Website" is entirely too vague a term when one is trying to fully map and understand a client's presence on the internet.
(Note that I am not talking about something as nebulous as a "brand", or the susurruss of tweets and buzzes that is seen as an extension of any more static property these days. I'm speaking purely of the cloud of accessible URLs that the client's server makes available to the greater network.)
Ideally, I'll then be able to pick bits out of this and expand on them in a more formal manner; but no sense wasting the intermediate stages.
What can "website" mean?
LET US ASK THE CLIENT
- The content my organization is making publically available.
- The page-sized topical chunks that the content is divided up into.
- Set C1, the nice clean URLs that each of those chunks mapped to when we were laying the site out on paper three years ago.
- Set C2, which is Set C1 plus all the various sitemaps, auxilary category pages, bother-a-friend forms, braille translations and goofy Flash microsites that have accumulated since three years ago.
- Set C3, which is an XML sitemap containing most of Set 1 and half of Set 2, which no-one has updated since Bob left in February.
- Set C4, which is a list that Manager Dan uses for his reporting. It contains some of the above plus some entirely new ones, and none know its secret origins.
LET US ASK THE SERVER
- The set of HTML pages stored on my hard disk.
- Set S1, the set of URLs that will be returned by the "list all pages" feature of my CMS. (If we're lucky, Bob used this to make his XML sitemap.)
- Set S2, which is any permutation of S1 that has caused the server to return content (including random capitalisation, HTTPS, session IDs, etc - basically the log files. Let's hope Bob didn't use this.)
- Set S3, the set of all servable URLs - all permutations that would cause the server to return content with a 200 code. (This gets large. Especially if you haven't set up your 404 pages correctly).
LET US ASK THE USER-AGENT
- Set U1, any URLs I can reach from the main navigation.
- Set U2, the set of URLs I can reach via any walk through the site (i.e. the formally crawlable URLs. Yes, including those links to last year's expired offers that you forgot to take down.)
- Set U3, U2 plus any URLs that are only reachable from outside the site; offer landing pages, white-labelled interfaces, any page that uses URL parameter input to generate titles or visible content, etc.
So we have a resource; content. The web server allows that resource to be mapped onto URLs; but due to the complexity of the systems used, it's usually a messy mapping. This is the first discontinuity, allowing great opportunities for you and your client to talk past each other in reports, tracking and troubleshooting while still using the same word "website".
But we've forgotten something.
YEA, LET US EVEN ASK THE CUSTOMERS
- The set of pages found by the sum of all our searches.
...yeah. which brings us to Google. (Mentally insert "and other search engines" as often as you feel required).
The second discontinuity is between the server and Google. Google is in the position of building a model of a model; reconstructing your website based on what it can see down the little holes of your URLs, so that it can then try and get back the original sense of the content.
Therefore, it tries to use as many angles as it can to find stuff; walking the links within the site, following links from outside, removing parameters to see if it can explain its observations by postulating fewer pages - even reading Bob's XML sitemap, god help it.
However, it can only do this by working backwards from its understanding of what a neat and tidy website ought to be; which means that it has to assume - to some degree - that your URLs are a good model for the underlying site. If your URL model does something sufficiently boneheaded (such as linking to the same content under three different URLs in the same navigation menu) you're going to throw it.
When dealing with simple projects, these subtleties don't seem worth explicit attention. However, if you get involved in projects to rationalise huge dynamic sites, I would strongly advise developing conventions to talk about different kinds of "website".
On reviewing this article, I find that it badly needs diagrams. I'll see if I can get on that soon.
