By Jon Waraas - First Published: April 23rd, 2024
The Googlebot tracking tool has only been operational for a month now, so there isn't a lot of data. Yet I still wanna test something.
If you check out the Googlebot logs, then you will see that the Googlebots keep trying to crawl the "/wp-includes/images/smilies/icon_cool.gif" URL on this blog (Waraas.Com). Even though its a 404.
Now before I move onto the "Why is Google doing this?", let me tell you the back story first.
From 2006 until roughly 2016 I ran a blog called "JonWaraas.Com" that did alright and got some traffic. In 2016 I gave up on the blog thing, and let the blog go offline (big mistake!).
But this year I decided to start to blog up again because I enjoy it (I love talking about web development, and cabin buildin'). This time however, I created my own blog CMS from scratch. I figured that a "web development" blog should be using custom software, that is fun and entertaining for fellow developers, and also not WordPress.
Last year I also picked up the domain name "Waraas.Com", which was dropped (expired) for the first time since 2005. I was super stocked to get the "Waraas.Com" domain. I wanted to use it for my blog!
I then forwarded the "JonWaraas.Com" domain to the "Waraas.Com" domain, so apparently Google thinks I'm still using WordPress. Which leads to my problem:
Google keeps trying to check for an old WordPress core file, even when there is no WordPress at all.
I'm one of those that believe in the Google "crawl budget", which basically retains the idea that Google has a bot "crawl" limit per website. So your website will only get some many of those cute little Googlebots per month, depending on how much traffic you get.
And I don't want to waste my "crawl budget" on 404's!!
I have no clue why Google keeps trying that specific URL, but lets test something..
I was reading a new SE Round Table article on the Googlebot crawl activity when I decided to leave a comment about how the "Googlebot is dumb" blah blah. Well Edward Lewis commented that I should try 410's instead of 404's.
So I decided to do what any rational person would do, and spend a few hours of my life reading, coding, and writing about something I learned from a complete stranger on Facebook.
Today (4/23/24) I changed the .htaccess file to show 410 errors for the core WordPress files, instead of 404's. The code is super simple, you can see my code below.
Note that on my screenshot above the "RewriteEngine On" aspect is missing since its at the very top of the page. Below is the whole code for it to work correctly:
RewriteEngine On
RewriteRule ^wp-admin/ - [G,L]
RewriteRule ^wp-content/ - [G,L]
RewriteRule ^wp-includes/ - [G,L]
So lets see over the next 30 days if Google can catch on to this, and maybe it can crawl some of the actual important pages.
Also note that it takes Google weeks to crawl the new blog posts here (Waraas.Com). For example, my last blog post "The Art of Bot War: What Are Fake Googlebots?" was published on April 12th. Yet today is the 23rd and it still has not been crawled by Google yet. Meanwhile, that old WordPress smiley face 404 was crawled twice during the same timeframe (that's why I call Googlebot dumb!).
Well, it looks like John Mueller was asked that question back in 2018, and this is what he said:
"From our point of view, in the mid term/long term, a 404 is the same as a 410 for us. So in both of these cases, we drop those URLs from our index.
We generally reduce crawling a little bit of those URLs so that we don't spend too much time crawling things that we know don't exist.
The subtle difference here is that a 410 will sometimes fall out a little bit faster than a 404. But usually, we're talking on the order of a couple days or so.
So if you're just removing content naturally, then that's perfectly fine to use either one. If you've already removed this content long ago, then it's already not indexed so it doesn't matter for us if you use a 404 or 410." - John Mueller
Now that was in 2018, so they might have changed things since then?
I'm not exactly sure what will happen. Will the 410's stop Google from crawling old WordPress core files, which is wasting my "crawl budget" (if you believe in that). Or did I just waste two hours of my time coding and writing about something that won't work?
So, what do you think will happen? I'll repost back with my lab results in a month or so :)
No comments yet. Please contribute to the conversation and leave a comment below.
Ever since building my first website in 2002, I've been hooked on web development. I now manage my own network of eCommerce/content websites full-time. I'm also building a cabin inside a old ghost town. This is my personal blog, where I discuss web development, SEO, eCommerce, cabin building, and other personal musings.
Brett : Very cool to get the back story and will be neat to watch the progress. Hoping eventually to so do something the similar on the west coast of Canada somewhere. Amazing that in 2006 I first found your site for it's myspace page information and how to build PHP site header/footers for resale. How times change hah. Anyways, keep up the great posts, looking forward to the updates.
Posted on: April 11, 2024
Avi : Was the plugin officially approved in the Wordpress repo ?
Posted on: August 29, 2024