No bots allowed?

Written by Andy Davies - May 16, 2024

What does a website that doesn’t allow bot traffic look like?

I don’t know, Google couldn’t find one…

The level of bot traffic flying around the internet was brought into sharp focus for me recently when a client’s site started to consistently use more than its permitted bandwidth per month. Investigations showed that half the daily bandwidth usage came from a handful of IP addresses. The visits were to obscure parts of the site and looked programmatic in nature (i.e. not from human users). Blocking a dozen IP addresses halved the bandwidth used and all was well. The client didn’t need to upgrade their hosting package and saved money.

In spite of solving the problem, the issue nagged at me. So much of the work Wholegrain does is human centric. We focus on an organisations’ audience, their needs, their journey through a site, their behaviours and requirements. At its essence, a website is a way of solving a human problem, and/or meeting a human need.

Ideally, from a digital sustainability point of view, websites should provide these solutions while also minimising carbon emissions. Efficiency is key. At Wholegrain we spend much of our working lives minimising the environmental impact of a high number of humans visiting our sites. However, the level of emissions coming from non-human traffic hasn’t really been on the radar. Until now.

After raising the issue with peers and colleagues, a few key facts emerged: 

  1. Around half of all traffic on the internet is non-human.
  2.  The level of bot traffic is increasing. 
  3. A significant portion of the increase comes from a rise in the number of ‘bad’ bots. 

So what is a sustainably- minded web design professional to do?

Are bots hot or not?

As a science fiction fan, I’ve seen my share of good and bad bots. For every Skynet causing nuclear Armageddon , there’s a Culture mind benevolently nudging alien races towards more ethically sound ways of life. The same goes for bots on the internet, there are good and bad (although admittedly, the bad are less intent on shaping or destroying life). 

In internet terms, a bot is a piece of software that carries out automated tasks.

‘Good’ bots carry out useful functions that help the web work better. Some, like SEO crawlers, help users find you by indexing your site’s content for web searches. Others might monitor your site for errors or downtime.

‘Bad’ bots are only useful for their creators. They carry out malicious actions in an automated manner. This could be scraping data from your site without permission, spamming your contact forms or even taking your whole site down by overwhelming it with a flood of bot driven traffic.

Without the good bots, the internet wouldn’t function in the way it does now. With them, a search for “The greatest bots in popular culture* yields relevant results (and an afternoon’s worth of geeky nostalgia). 

While blocking all bots might seem like an ideal solution, doing this may mean that it doesn’t appear in search results at all. In researching this article, I read that Liz Truss’s website is one such site that has blocked all bot traffic. If you search for Liz Truss on a search engine, you’ll likely be served news reports on the last days of her premiership and not see any evidence of her personal website. Not ideal given how all that ended…

At Wholegrain we also use a service to monitor the ‘up-time’ of the sites we manage. If any of them are suffering an outage, the bots monitoring the sites will give us the heads up. This means issues can be resolved before too many people experience issues accessing the site.

So we do need some bots. They make the web more robust, efficient and improve the discoverability of websites. But according to Imperva’s 2024 Bad Bot Report, bad bots accounted for around a third of all web traffic in 2023 – almost double that of good bots. Overall, around 50% of traffic was human and 50% some sort of bot. There were also four months of the year when the level of bot traffic was actually higher than human traffic. I feel like I’m in The Matrix just thinking about it.

The role of AI and LLMs in bot traffic levels

Interestingly, the Imperva report above suggests that the rise of AI is driving bot traffic. It’s doing this in two ways. Firstly, Large Language Models (LLMs) are scraping the content and data they need to function using crawler type bots. I’ll return to this topic shortly as I think it’s a troubling element in all this. But not only are LLMs contributing to the level of bot traffic, they are also facilitating it.

Platforms like ChatGPT have lowered the technical expertise required to create automated programs. A couple of prompts will result in a functioning python script that facilitates site scraping. Anyone with an account can go away and take as much of your content and IP as they wish. Imperva and others suggest that this ease of access has resulted in an increase in bot traffic.

Perhaps more importantly, LLM bots are in the process of, or have already visited your site to scrape it themselves. This has huge implications for your IP and copyrighted materials. AI services like ChatGPT can mimic the content on your site and reproduce it with alarming speed. While tech CEOs would tell you this is a huge boon and time saver, I personally think it is hugely problematic.

Try this. Open ChatGPT and use the following prompts (replacing the bold text with relevant information):

  1. “What can you tell me about insert name of your organisation + URL
  2. Assuming the information provided is accurate, continue with this prompt. “I would like to write an article on insert USP of your organisation in the style of your organisation
  3. Next prompt In the above please can you change any mention of my organisation for new organisation
  4. Final prompt Can you turn this article into a service offering for a new organisation?

In less than five minutes, anyone can recreate a tone of voice and service offering that mimics your own. It might need a little work to make it palatable but nowhere near as much as you would need if you started from scratch. When I tried this for a few agencies and organisations I was horrified and furious with how well it worked.

What action can you take?

I think there’s a strong case for blocking bad or unwanted bot traffic from your site. It makes sense from a security point of view. It protects your content from unauthorised reproduction. It reduces the energy and emissions associated with your site without affecting the real, live humans that you actually want to reach.

The problem is that this automated blocking is pretty tricky. Discussions with our developers and the wider community raised as many questions as answers. For the unsophisticated ‘bad bots’ it’s a bit like whack-a-mole, easy but continuous. Services like Cloudflare can throw up a fairly robust firewall for blocking bots, but it’s not a 100% effective. Setting rules in your robots.txt document should work but only if the author of the bot script is happy to honour your request. Sadly, it appears that many users of ‘bad’ bots are less than honourable!

The truth is, this is something that we need to grapple with a bit longer to find a suitable, sustainable, balanced and repeatable solution. Transparency and openness when it comes to these matters is key. Hopefully this opens up discourse to help find solutions.

Is your organisation grappling with this issue? We’d love to hear your thoughts and discuss solutions.

*The correct answer (in my opinion) is Marvin the Paranoid Android.