It used to be that there was a fairly good correlation between “unique visitors” in your web server logs and, well, unique visitors – real people who had actually looked at your web site.
These days it seems that I get almost more junk traffic than real page views, with most of it being automated.
Here’s a list roughly ordered by volume.
- Referrer spam. Thanks to blogging software that automatically displays a list of the top linking pages on the front page, it is worthwhile for spammers to make hundreds of bogus page requests with spoofed HTTP referer headers (usually a gambling website URL).
- Comment spam. Even many months after removing Movable Type, spam bots are still hitting my non-existant mt-comments.cgi page. See my article on Combatting Comment Spam.
- Deep linking of images. When an image is getting an order of magnitude more hits than the page it is displayed on, you know that someone has deep linked your images. This is easy enough to prevent, but most documents that show you how to prevent deep linking of images neglect to mention that if someone is using a proxy that removes the “referer” header from their HTTP request, or they are using a user agent that does not include the header, then the image will not load even if they load the image from the webpage.
- RSS feed requests. Some of this is from RSS aggregator applications, and some seems to be from websites that aggregate RSS feeds or provide RSS search facilities.
- Search engine spiders. Since the search engine wars hotted up, it seems that I am getting hits from Googlebot, MSNbot and Yahoo’s slurp multiple times per day.
- Hacking attempts. I always find it amusing when someone runs IIS exploits against a Linux webserver running Apache.
I need to find a decent web log analysing program this will filter all this junk out for me.
One Comment
Comments are closed.
Trying to prevent deep linking of images by denying all requests that do not have your website url in the HTTP referer header also prevents images from being loaded when someone views the cached version of your page in Google.