By Jon Waraas - First Published: October 1st, 2025
I built the Googlebot Tracker 4200 tool back in March 2024 to get a better sense of how Google's crawlers actually work. Since then, it's collected roughly 55,000 rows of crawl data from my own site - but interestingly, about 50 of those rows somehow come from other (external) websites.
- VIEW THE EXTERNAL CRAWL DATA HERE -
While I'm not entirely sure how I got Googlebot data from other websites, I wanted to write a blog post sharing some of my hypotheses on why it might be happening.
The first thing to note is that this website (Waraas.com) is on a dedicated IP address (198.54.114.122), which means only this site is tied to that IP. That's important because it rules out the possibility that the data is coming from a shared hosting setup.
The second thing to note is that the data is verified as coming from Googlebot. I can confirm this by having the software run a reverse hostname check to see the hostname of the crawler's IP address.
If you look at the data, you'll notice that most of the external websites are pretty spammy. Many don't even work, and most of them are just subdomains.
A lot of the external URLs points to the /robots.txt file.
Some of the external websites do still respond, but in most cases the only thing that works is their /robots.txt file.
A snippet of the external data from the Googlebot Tracker 4200
When a crawler like Googlebot visits this website (or any website), it leaves its IP address behind in the server logs. A reverse DNS (rDNS) lookup is a way to take that IP address and figure out the hostname associated with it. This is useful because it helps confirm that the request really came from Googlebot and not some random bot pretending to be it.
For example, an IP from Googlebot might resolve to something like crawl-66-249-66-1.googlebot.com. The googlebot.com part of the hostname is what tells us that this request is legit.
To check the hostname yourself, you can run a reverse DNS lookup using various methods:
host [IP]
or nslookup [IP]
. On Windows, nslookup [IP]
works as well.For example, if you run a reverse DNS lookup on the IP 66.249.66.1, it might return crawl-66-249-66-1.googlebot.com. That's how I can verify that my tracker is actually logging real Googlebot visits and not some fake bot trying to sneak in.
Doing these checks regularly gives confidence that the data in my Googlebot Tracker is accurate, even for those unusual external URLs that show up in the logs.
With that in mind, let's hypothesize why we are getting external data:
Below, I'll go over some of my hypotheses on why Waraas.com is getting external Google crawl data.
Every website has a domain name (like example.com) and an IP address (like 192.0.2.1). The DNS system basically tells browsers and bots where to send their requests. Normally, Waraas.com sits on its own dedicated IP, so only my site is tied to that address.
Anyone who owns a domain can technically point it to your IP by changing the A record. For example, someone could register external-domain.com and set its DNS A record to my server's IP.
Now, when Googlebot (or anyone) crawls that domain, the request ends up at my server. My server doesn't care that it "belongs" to someone else; it just sees a request for a URL.
Googlebot always crawls by domain, not by IP. So when it hits external-domain.com:
1. It looks up the domain via DNS -> finds my IP.
2. It sends an HTTP request with the host header set to external-domain.com.
3. My tracker logs it as a Googlebot visit.
Since the host isn't Waraas.com, my tracker labels it as external crawl data. That's why these rows show up in the logs even though the domain isn't mine.
If other websites embed images, scripts, or other content hosted on my server, my Googlebot Tracker will pick up those requests. Even though the original link is on another domain, my tracker logs the crawl data just like it would for Waraas.com.
If some external domains use reverse proxies or redirect traffic to my server (either on purpose or by accident), my Googlebot Tracker can pick up those requests. That's why Googlebot activity from other domains might show up in my logs.
Even though Waraas.com is on a dedicated IP, Googlebot uses large IP blocks to crawl many sites. My Googlebot Tracker logs requests by IP, so sometimes external URLs can show up in the logs even if they're not part of my site.
Since my Googlebot Tracker scans logs and traffic broadly, it can sometimes pick up URLs from other domains that pass through my server, especially during tests, mirrors, or through CDN caching setups.
Sometimes Googlebot crawls URLs that were once linked to my IP or domain because of cached data or old DNS records. These "phantom" URLs could end up showing up as external data in my Googlebot Tracker.
No comments yet. Please contribute to the conversation and leave a comment below.
Ever since building my first website in 2002, I've been hooked on web development. I now manage my own network of eCommerce/content websites full-time. I'm also building a cabin inside a old ghost town. This is my personal blog, where I discuss web development, SEO, cabin building, and other personal musings.