I become a Webmaster

On Tuesday I became a fully fledged webmaster, Jocke showed me how to start and restart Apache and I had edited my first Apache configuration file. That was when my problems started....

Fixing up the Bluetail website was my first priority. We'd had a couple of meetings to decide on a new structure for our website and now was the time to implement the proposals- Interestingly the structure we had so carefully decided on (on paper) turned out not to work. We had planned for a particular logical structure of the site, what links should appear on the "home page" and where you would get to when you clicked on them, but once implemented the result just didn't look nice. The logical structure was ok, but aesthetically a mess. I guess the moral is that you can't design the logical structure of a web without simultaneously thinking about the layout.

As well as the content and the layout there were a number of technical issues that we wanted to sort out, in particular we needed a search engine to index our site, a convenient way of uploading our site to where it is hosted and we should really check that our pages are fully HTML4.0 (or whatever) compliant - Fixing this stuff proved more difficult that I expected - I have some solutions (otherwise you wouldn't be reading this) but I'm open to suggestions on how to improve things.

Indexing the site

It's a couple of years since I last set up a site search engine and then it was easy, since then things have changed a lot. My first thought was to use Glimpse - I remember using it to index my account a couple of years ago and it seemed very good. Surprise number one Glimpse had gone commercial - to my amazement virtually every search engine that had been around two years ago and which was in any way useful had turned into a commercial product. All I wanted was to index a few hundred pages and not buy a large complex product.

After a while of clicking and search I found my way to the list of search engines at http://dmoz.org/Computers/Internet/WWW/Searching_the_Web/Search_Engines/ and dutifully started downloading and compiling up the remaining search engines that still were free (there are a few).

I finally fell for

Checking the links

Link checking was a pain and I'm still not happy with the results. You'd think this would be simple and indeed there are ten quadzillion programs on the web (all written in perl) that claim to do this mysterious thing. There are even web spiders that you can point at your site and say Hey you spider thingy - go check my links and they will do just that (or not, as the case may be). Unfortunately most of the programs that I tried seemed to miss many of my pages.

What 99.9999% of all these program seem to have forgotten is that a) my pages are built with embedded Javascript, cascading style and Apache style server style includes - and 99.9999% of the link checking programs just can't seem to figure this stuff out.

At this point I decided to give up and write my own in Erlang - I thought this would only take half an hour or so, but I was wrong, it took several hours. This program was a lot better it analysed 104 pages where the best Perl version I could lay my hands on gave up after 83 pages. I also include a "mark sweep" type garbage collector which told me about all pages that couldn't be reached from the home page. A true link analyser needs a lot of extra configuration information, and needs to be able to "see" a given page both through a HTTP port and by analysing the local file system. It also needs extra information like "what is the search order used to find a file name if only a directory name is given" and "what does ~ (tilde) expand to" - the general problem is more difficult that you might imagine.

HTML conformance

Again, I though this would be easy. Our site would be 100% HTML4.0 compliant - at least that was, and still is the idea.

Setting up a validation suite is not easy - in fact I havn't done it yet. There are validation services on the WWW. Send them a URL and they will check that the code in the page is ok - but all of these work on the assumption that the page has been published - when you are writing your new web site you don't want to make the URLs publicly available until the whole site is ready - in fact all of our site sits happily behind a firewall until the day comes when we decide to publish the results. All this means that I had to install a validation program inside the firewall. The best (and it appears only) serious candidate is the W3C validator, but this is not an easy program to install. I have now temporarily given up on this; I had almost succeeded there but there seemed to be some problems with Japanese character sets (I think). This is strange since our site is almost exclusively English with a tiny sprinkling of Swedish and not an ounce of Japanese.

What's worse , and this is truly horrible, even if my HTML was pure true to the spec HTML4.0 as verified by W3C there would still be absolutely no guarantee that the page would look nice in both Microsoft's and Netscape's browsers - roll on strict XHTML.

Making a mirror

This worked beautifully - first time. I just used rsync http://samba.anu.edu.au/rsync/. Rsync is a truly great program - the rsync algorithm is monstrously beautiful - Hat's offq guys - thanks a lot!