Posted on Mon 07 October 2013

Fighting Link Rot

Gwern's post on link rotcache got me thinking. Periodically checking links to see if they are broken is nice, but not really an adequate solution - content might already have disappeared by the time I realize. Besides, websites change a lot, in a year a site might look very different from what I linked to originally. This might be acceptable if I'm linking to the site in general, but unacceptable if linking to a very specific post.

Since I had a free Sunday morning anyway, I set out to create a Pelican Plugincache that would automate this process for me: Find all links in my posts, spider and locally cache the site I linked to, rewrite the page so it appears properly. You can see it at work on this very post - the cache link after a normal link will lead you to the cached version hosted on my own server. Useful to see what the link looked like when I first saw it, or simply when the linked site goes down or disappears.

The code is still far from perfect, quite ugly to be honest, but so far it seems to work. I'm not quite sure yet how to best present the link to the cached version. Putting it in the superscript is workable, but not quite as pretty as I'd like. Also, I should probably put a frame or something around the cached page to make it clear that it's just a cached copy.

The big elephant in the room is copyright - I think simply caching a site I directly link to should be covered by fair use, and there's precedent: Google has a cached version for all its search results, there's the Internet Archive, etc. If anybody knows more on this subject, or would like to have their site removed from caching, please don't hesitate to contact me!

To use it, simply put the plugincache into your Pelican plugin dir, enable it in the config and you are done. To selectively disable caching for certain links, put an ! in front of them. This is not quite perfect - links would break if you disabled the plugin - but I'm still looking for a better solution.

Category: misc

Tags: programming, web, blogging

© Julian Schrittwieser. Built using Pelican. Theme by Giulio Fidente on github. .