The Waraas.Com Logo - Click To Go To The Waraas.Com Homepage

Googlebot Testing: 404 vs 410?

Thumbnail image for Googlebot Testing: 404 vs 410?

By Jon Waraas - First Published: April 23rd, 2024

How the heck do I get Google to stop crawling 404's??


The Googlebot tracking tool has only been operational for a month now, so there isn't a lot of data. Yet I still wanna test something.

If you check out the Googlebot logs, then you will see that the Googlebots keep trying to crawl the "/wp-includes/images/smilies/icon_cool.gif" URL on this blog (Waraas.Com). Even though its a 404.

^ Above: You can see that in the last 30 days, the Googlebots have crawled this website 4 different times. The URL produced a 404 to the Googlebot each time.

Now before I move onto the "Why is Google doing this?", let me tell you the back story first.

From 2006 until roughly 2016 I ran a blog called "JonWaraas.Com" that did alright and got some traffic. In 2016 I gave up on the blog thing, and let the blog go offline (big mistake!).

But this year I decided to start to blog up again because I enjoy it (I love talking about web development, and cabin buildin'). This time however, I created my own blog CMS from scratch. I figured that a "web development" blog should be using custom software, that is fun and entertaining for fellow developers, and also not WordPress.

Last year I also picked up the domain name "Waraas.Com", which was dropped (expired) for the first time since 2005. I was super stocked to get the "Waraas.Com" domain. I wanted to use it for my blog!

I then forwarded the "JonWaraas.Com" domain to the "Waraas.Com" domain, so apparently Google thinks I'm still using WordPress. Which leads to my problem:

Google keeps trying to check for an old WordPress core file, even when there is no WordPress at all.

So, why does Google keep checking bad URL's?

I'm one of those that believe in the Google "crawl budget", which basically retains the idea that Google has a bot "crawl" limit per website. So your website will only get some many of those cute little Googlebots per month, depending on how much traffic you get.

And I don't want to waste my "crawl budget" on 404's!!

I have no clue why Google keeps trying that specific URL, but lets test something..

I was reading a new SE Round Table article on the Googlebot crawl activity when I decided to leave a comment about how the "Googlebot is dumb" blah blah. Well Edward Lewis commented that I should try 410's instead of 404's.

^ Above: You can see our discussion on author Barry Schwartz Facebook page.

So I decided to do what any rational person would do, and spend a few hours of my life reading, coding, and writing about something I learned from a complete stranger on Facebook.

Today (4/23/24) I changed the .htaccess file to show 410 errors for the core WordPress files, instead of 404's. The code is super simple, you can see my code below.

^ Above: You can see my code (highlighted) below that should show the Googlebots a 410 error instead of a 404.

Note that on my screenshot above the "RewriteEngine On" aspect is missing since its at the very top of the page. Below is the whole code for it to work correctly:

RewriteEngine On
RewriteRule ^wp-admin/ - [G,L]
RewriteRule ^wp-content/ - [G,L]
RewriteRule ^wp-includes/ - [G,L]

So lets see over the next 30 days if Google can catch on to this, and maybe it can crawl some of the actual important pages.

Also note that it takes Google weeks to crawl the new blog posts here (Waraas.Com). For example, my last blog post "The Art of Bot War: What Are Fake Googlebots?" was published on April 12th. Yet today is the 23rd and it still has not been crawled by Google yet. Meanwhile, that old WordPress smiley face 404 was crawled twice during the same timeframe (that's why I call Googlebot dumb!).

What does Google say about 410's?

Well, it looks like John Mueller was asked that question back in 2018, and this is what he said:

"From our point of view, in the mid term/long term, a 404 is the same as a 410 for us. So in both of these cases, we drop those URLs from our index.

We generally reduce crawling a little bit of those URLs so that we don't spend too much time crawling things that we know don't exist.

The subtle difference here is that a 410 will sometimes fall out a little bit faster than a 404. But usually, we're talking on the order of a couple days or so.

So if you're just removing content naturally, then that's perfectly fine to use either one. If you've already removed this content long ago, then it's already not indexed so it doesn't matter for us if you use a 404 or 410." - John Mueller

Source

Now that was in 2018, so they might have changed things since then?

I'm not exactly sure what will happen. Will the 410's stop Google from crawling old WordPress core files, which is wasting my "crawl budget" (if you believe in that). Or did I just waste two hours of my time coding and writing about something that won't work?

So, what do you think will happen? I'll repost back with my lab results in a month or so :)

Conversation:


No comments yet. Please contribute to the conversation and leave a comment below.

 

Leave A Comment:








This totally free tool will ping your website to Google, Bing & others to give it a little extra boost.
(Results will be emailed after completion.)

Ever since building my first website in 2002, I've been hooked on web development. I now manage my own network of eCommerce/content websites full-time. I'm also building a cabin inside a old ghost town. This is my personal blog, where I discuss web development, SEO, eCommerce, cabin building, and other personal musings.

Recent Comments:

Brett : Very cool to get the back story and will be neat to watch the progress. Hoping eventually to so do something the similar on the west coast of Canada somewhere. Amazing that in 2006 I first found your site for it's myspace page information and how to build PHP site header/footers for resale. How times change hah. Anyways, keep up the great posts, looking forward to the updates.

Posted on: April 11, 2024

Feedburner Image