Google Says They Deploy Hundreds Of Undocumented Crawlers

Google’s Gary Illyes and Martin Splitt published a podcast about Googlebot, explaining that it’s not just one standalone thing but hundreds of crawlers across different products and services, most of which are not publicly documented.

What Googlebot Is

Gary clarifies that the name “Googlebot” is a historical name originating from the early days when Google had just a single crawler. That’s not the case anymore because Google operates many crawlers across different products but the name Googlebot stuck, even though it’s not one thing anymore.

Further, he explains that Googlebot is not the crawling infrastructure itself or a singular system. Googlebot is actually one client interacting with a larger internal crawling service, the infrastructure.

Martin Splitt asked:

“How can I imagine Googlebot? How does our crawling infrastructure roughly look like?”

Gary answered:

“I mean, calling it Googlebot, that’s a misnomer. And it’s something that back in the days, perhaps early 2000s, it worked well because back then we probably had one crawler because we had one product. But then soon after another product came out, I think that was AdWords. And then we started having more crawlers and then more products came out and then more crawlers and then more crawlers.

But the Googlebot name that somehow stuck. Generally when we were talking about our crawling infrastructure in general, then we tended to call it Googlebot, but that was wildly inaccurate because Googlebot was just one thing that was communicating with our crawler infrastructure.”

Crawling Infrastructure Has A Name

Gary next explains that the crawling infrastructure has an internal name within Google but he declined to say what that name is.

He continued:

“Googlebot is not our crawler infrastructure. Our crawler infrastructure doesn’t have an external name. It has an internal name. Doesn’t matter what it is. Let’s call it Jack. And it is, I don’t know how to put it. It’s software as a service, if you like. SaaS. Right? then, so Jack has API endpoints, so to say. And then you can call those API endpoints to do a fetch from the internet.

And then when you do those API calls, then you also need to specify some parameters like how long are you willing to wait for, for the bytes to come back or what is your user agent that you want to send? What is the robots.txt product token that you want to obey and all these parameters.

And we do set a default parameter for most of these things, not all of them, but most of these things. So you can generally omit them, which makes these calls simpler, I guess, because you don’t have to specify all the stuff. But otherwise, it’s really just an API call to something in the cloud or on some random data center. And then that will perform a fetch for you as a software developer or a product.

So this product, because we can call it a product at this point, even if it’s internal, this has been around for a very, very, very, very…

Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We blogs.grocliq.com want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at [email protected]

Categorized in: