Web Crawlers - Top 10 Most Popular - KeyCDN (2024)

By Ben Eaton

Updated on December 16, 2022

Web Crawlers - Top 10 Most Popular - KeyCDN (1)

When it comes to the world wide web, there are both bad bots and good bots. You definitely want to avoid bad bots as these consume your CDN bandwidth, take up server resources, and steal your content. On the other hand, good bots (also known as web crawlers) should be handled with care as they are a vital part of getting your content to index with search engines such as Google, Bing, and Yahoo. In this blog post, we will take a look at the top ten most popular web crawlers.

What are web crawlers?

Web crawlers are computer programs that browse the Internet methodically and automatedly. They are also known as robots, ants, or spiders.

Crawlers visit websites and read their pages and other information to create entries for a search engine's index. The primary purpose of a web crawler is to provide users with a comprehensive and up-to-date index of all available online content.

In addition, web crawlers can also gather specific types of information from websites, such as contact information or pricing data. By using web crawlers, businesses can keep their online presence (i.e. SEO, frontend optimization, and web marketing) up-to-date and effective.

Search engines like Google, Bing, and Yahoo use crawlers to properly index downloaded pages so that users can find them faster and more efficiently when searching. Without web crawlers, there would be nothing to tell them that your website has new and fresh content. Sitemaps also can play a part in that process. So web crawlers, for the most part, are a good thing.

However, there are also issues sometimes when it comes to scheduling and load, as a crawler might constantly be polling your site. And this is where a robots.txt file comes into play. This file can help control the crawling traffic and ensure that it doesn't overwhelm your server.

Web crawlers identify themselves to a web server using the User-Agent request header in an HTTP request, and each crawler has its unique identifier. Most of the time, you will need to examine your web server referrer logs to view web crawler traffic.

Robots.txt

By placing a robots.txt file at the root of your web server, you can define rules for web crawlers, such as allowing or disallowing certain assets from being crawled. Web crawlers must follow the rules defined in this file. You can apply general rules to all bots or get more granular and specify their specific User-Agent string.

Example 1

This example instructs all Search engine robots not to index any of the website's content. This is defined by disallowing the root / of your website.

User-agent: *Disallow: /

Example 2

This example achieves the opposite of the previous one. In this case, the instructions are still applied to all user agents. However, there is nothing defined within the Disallow instruction, meaning that everything can be indexed.

User-agent: *Disallow:

To see more examples make sure to check out our in-depth post on how to use a robots.txt file.

Top 10 good web crawlers and bots

There are hundreds of web crawlers and bots scouring the Internet, but below is a list of 10 popular web crawlers and bots that we have collected based on ones that we see on a regular basis within our web server logs.

1. GoogleBot

As the world's largest search engine, Google relies on web crawlers to index the billions of pages on the Internet. Googlebot is the web crawler Google uses to do just that.

Googlebot is two types of crawlers: a desktop crawler that imitates a person browsing on a computer and a mobile crawler that performs the same function as an iPhone or Android phone.

The user agent string of the request may help you determine the subtype of Googlebot. Googlebot Desktop and Googlebot Smartphone will most likely crawl your website. On the other hand, both crawler types accept the same product token (user agent token) in robots.txt. You cannot use robots.txt to target either Googlebot Smartphone or Desktop selectively.

Googlebot is a very effective web crawler that can index pages quickly and accurately. However, it does have some drawbacks. For example, Googlebot does not always crawl all the pages on a website (especially if the website is large and complex).

In addition, Googlebot does not always crawl pages in real-time, which means that some pages may not be indexed until days or weeks after they are published.

User-Agent

Googlebot

Full User-Agent string

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Googlebot example in robots.txt

This example displays a little more granularity about the instructions defined. Here, the instructions are only relevant to Googlebot. More specifically, it is telling Google not to index a specific page (/no-index/your-page.html).

User-agent: GooglebotDisallow: /no-index/your-page.html

Besides Google's web search crawler, they actually have 9 additional web crawlers:

Web crawlerUser-Agent string
Googlebot NewsGooglebot-News
Googlebot ImagesGooglebot-Image/1.0
Googlebot VideoGooglebot-Video/1.0
Google Mobile (featured phone)SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
Google SmartphoneMozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Google Mobile Adsense(compatible; Mediapartners-Google/2.1; +http://www.google.com/bot.html)
Google AdsenseMediapartners-Google
Google AdsBot (PPC landing page quality)AdsBot-Google (+http://www.google.com/adsbot.html)
Google app crawler (fetch resources for mobile)AdsBot-Google-Mobile-Apps

You can use the Fetch tool in Google Search Console to test how Google crawls or renders a URL on your site. See whether Googlebot can access a page on your site, how it renders the page, and whether any page resources (such as images or scripts) are blocked to Googlebot.

You can also see the Googlebot crawl stats per day, the amount of kilobytes downloaded, and time spent downloading a page.

See Googlebot robots.txt documentation.

2. Bingbot

Bingbot is a web crawler deployed by Microsoft in 2010 to supply information to their Bing search engine. This is the replacement of what used to be the MSN bot.

User-Agent

Bingbot

Full User-Agent string

Mozilla/5.0 (compatible; Bingbot/2.0; +http://www.bing.com/bingbot.htm)

Bing also has a very similar tool as Google, called Fetch as Bingbot, within Bing Webmaster Tools. Fetch As Bingbot allows you to request a page be crawled and shown to you as our crawler would see it. You will see the page code as Bingbot would see it, helping you understand if they see your page as you intended.

See Bingbot robots.txt documentation.

3. Slurp Bot

Yahoo Search results come from the Yahoo web crawler Slurp and Bing's web crawler, as a lot of Yahoo is powered by Bing. Sites should allow Yahoo Slurp access in order to appear in Yahoo Mobile Search results.

Additionally, Slurp does the following:

  • Collects content from partner sites for inclusion within sites like Yahoo News, Yahoo Finance, and Yahoo Sports.
  • Accesses pages from sites across the Web to confirm the accuracy and improve Yahoo's personalized content for our users.

User-Agent

Slurp

Full User-Agent string

Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

See Slurp robots.txt documentation.

4. DuckDuckBot

DuckDuckBot is the Web crawler for DuckDuckGo, a search engine that has become quite popular as it is known for privacy and not tracking you. It now handles over 93 million queries per day. DuckDuckGo gets its results from a variety of sources. These include hundreds of vertical sources delivering niche Instant Answers, DuckDuckBot (their crawler) and crowd-sourced sites (Wikipedia). They also have more traditional links in the search results, which they source from Yahoo! and Bing.

User-Agent

DuckDuckBot

Full User-Agent string

DuckDuckBot/1.0; (+http://duckduckgo.com/duckduckbot.html)

It respects WWW::RobotRules and originates from these IP addresses:

  • 72.94.249.34
  • 72.94.249.35
  • 72.94.249.36
  • 72.94.249.37
  • 72.94.249.38

5. Baiduspider

Baiduspider is the official name of the Chinese Baidu search engine's web crawling spider. It crawls web pages and returns updates to the Baidu index. Baidu is the leading Chinese search engine that takes an 80% share of China Mainland's overall search engine market.

User-Agent

Baiduspider

Full User-Agent string

Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)

Besides Baidu's web search crawler, they actually have 6 additional web crawlers:

Web crawlerUser-Agent string
Image SearchBaiduspider-image
Video SearchBaiduspider-video
News SearchBaiduspider-news
Baidu wishlistsBaiduspider-favo
Baidu UnionBaiduspider-cpro
Business SearchBaiduspider-ads
Other search pagesBaiduspider

See Baidu robots.txt documentation.

6. Yandex Bot

YandexBot is the web crawler to one of the largest Russian search engines, Yandex.

User-Agent

YandexBot

Full User-Agent string

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)

There are many different User-Agent strings that the YandexBot can show up as in your server logs.

7. Sogou Spider

Sogou Spider is the web crawler for Sogou.com, a leading Chinese search engine that was launched in 2004.

Note: The Sogou web spider does not respect the robots exclusion standard, and is therefore banned from many websites because of excessive crawling.

User-Agent

Sogou Pic Spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)Sogou head spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)Sogou Orion spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)Sogou-Test-Spider/4.0 (compatible; MSIE 5.5; Windows 98)

8. Exabot

Exabot is a web crawler for Exalead, which is a search engine based out of France. It was founded in 2000 and has more than 16 billion pages indexed.

User-Agent

Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Exabot-Thumbnails)Mozilla/5.0 (compatible; Exabot/3.0; +http://www.exabot.com/go/robot)

See Exabot robots.txt documentation.

9. Facebook external hit

Facebook allows its users to send links to interesting web content to other Facebook users. Part of how this works on the Facebook system involves the temporary display of certain images or details related to the web content, such as the title of the webpage or the embed tag of a video. The Facebook system retrieves this information only after a user provides a link.

One of their main crawling bots is Facebot, which is designed to help improve advertising performance.

User-Agent

facebotfacebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

See Facebot robots.txt documentation.

10. Applebot

The computer technology brand Apple uses the web crawler Applebot, and in particular Siri and Spotlight Suggestions, to provide personalized services to their users.

User-Agent

Applebot

Full User-Agent string

Mozilla/5.0 (Device; OS_version) AppleWebKit/WebKit_version (KHTML, like Gecko)Version/Safari_version Safari/WebKit_version (Applebot/Applebot_version)

Other popular web crawlers

Apache Nutch

Apache Nutch is an open-source web crawler written in Java. It is released under the Apache License and is managed by the Apache Software Foundation.Nutch can run on a single machine, but it is more commonly used in a distributed environment. In fact, Nutch was designed from the ground up to be scalable and easily extensible.

Nutch is very flexible and can be used for various purposes. For example, Nutch can be used to crawl the entire Internet or only specific websites. In addition, Nutch can be configured to index pages in real-time or on a schedule.

One of the main benefits of Apache Nutch is its scalability. Nutch can be easily scaled to accommodate large volumes of data and traffic. For example, a large ecommerce website may use Apache Nutch to crawl and index its product catalog. This would allow customers to search for products on their website using the company's internal search engine.

In addition, Apache Nutch can be used to gather data about websites. Companies could use Apache Nutch to crawl competitor websites and collect information about their products, prices, and contact information. This information could then be used to improve their online presence.However, Apache Nutch does have some drawbacks. For example, it can be challenging to configure and use. In addition, Apache Nutch is not as widely used as other web crawlers, which means less support is available for it.

Screaming Frog

Screaming Frog SEO Spider is a desktop program (PC or Mac) that crawls websites' links, images, CSS, scripts, and apps from an SEO perspective.

It fetches key onsite elements for SEO, presents them in tabs by types, and allows you to filter for common SEO issues or slice and dice the data how you like by exporting it into Excel.

You can view, analyze and filter the crawl data as it's gathered and extracted in real-time from the simple interface.

The program is free for small sites (up to 500 URLs). Larger sites require a license.

Screaming Frog uses the Chromium WRS to crawl dynamic websites that are rich in JavaScript, such as Angular, React, and Vue.js. WordPress sitemap creation, XPath extraction, and site architecture visualization are other top features.

The platform serves corporations like Apple, Amazon, Disney, and even Google. Screaming Frog is also a popular tool among agency owners and SEOs who manage SEO for multiple clients.

Deepcrawl

Deepcrawl is a cloud-based web crawler that allows users to crawl websites and collect data about their structure, content, and performance.

DeepCrawl provides users with several features and options, including the ability to crawl JavaScript-based websites, customize the crawling process, and generate detailed reports.

One of Deepcrawl's most unique features is its ability to crawl websites built with JavaScript. This is possible because Deepcrawl uses a headless browser (i.e. Chrome) to render the website's content before crawling it.

This means that Deepcrawl can crawl and collect data about websites that other crawlers would not always be able to reach.

Beyond flexible APIs, Deepcrawl's data integrates with Google Analytics, Google Search Console, and other popular tools. This allows users to easily compare their website's data with their competitors. It also allows them to connect business data (e.g. sales data) with their website's data to get a complete picture of how their website is performing.

Deepcrawl works best for companies with large websites with a lot of content and pages. The platform is less well-suited for small websites or those that do not change very often.

There are three different products that Deepcrawl offers:

  • Automation Hub: This product integrates with your CI/CD pipeline and automatically crawls your website with 200+ SEO QA testing rules.
  • Analytics Hub: This product allows you to surface actionable insights from your website data and improve your website's SEO.
  • Monitoring Hub: This product monitors your website for changes and alerts you when new issues arise.

Businesses use these three products to improve their website's SEO, monitor it for changes, and collaborate with dev teams.

Octoparse

Octoparse is a user-friendly client-based web crawling software that lets you extract data from all over the Internet. The program is particularly developed for people who are not programmers and has a simple point-and-click interface.

With Octoparse, you can run scheduled cloud extractions to extract dynamic data, create workflows to extract data from websites automatically, and use its web scraping API to access data.

Its IP proxy servers let you crawl websites without being blocked, and its built-in Regex feature cleans data automatically.

And with its pre-built scraper templates, you can start extracting data from popular websites like Yelp, Google Maps, Facebook, and Amazon within minutes. You can also build your own scraper if there isn't one readily available for your target websites.

HTTrack

You can use HTTrack's freeware to download entire sites to your PC. With support for Windows, Linux, and other Unix systems, this open-source tool can be used by millions.

HTTrack's website copier lets you download a website to your computer so that you can browse it offline. The program can also be used to mirror websites, meaning that you can create an exact copy of a website on your server.

The program is easy to use and has many features, including the ability to resume interrupted downloads, update existing websites, and create static copies of dynamic websites.

You can get the files, photos, and HTML code from its mirrored website and resume interrupted downloads.

While HTTrack can be used to download any type of website, it's particularly useful for downloading websites that are no longer online.

HTTrack is a great tool for anyone who wants to download an entire website or mirror a website. However, it should be noted that the program can be used to download illegal copies of websites.

As such, you should only use HTTrack if you have permission from the website owner.

SiteSucker

SiteSucker is a macOS application that downloads websites. It asynchronously copies the site's webpages, images, PDFs, style sheets, and other files to your local hard drive, duplicating the site's directory structure.

You can also use SiteSucker to download specific files from websites, such as MP3 files.

The program can be used to create local copies of websites, making it ideal for offline browsing.

It's also useful for downloading entire sites so you can view them on your computer without an Internet connection.

One drawback to SiteSucker is that it cannot handle Javascript (though it can handle Flash). Nevertheless, it's still useful for downloading websites to your Mac.

Webz.io

Users can use the Webz.io web application to get real-time data by crawling online sources worldwide into various tidy formats. This web crawler allows you to crawl data and extract keywords in multiple languages based on numerous criteria from a diverse range of sources.

The Archive allows users to access historical data. Users can easily index and search the structured data crawled by Webhose using its intuitive interface/API. You can save the scraped data in JSON, XML, and RSS formats. Plus, Webz.io supports up to 80 languages with its crawling data results.

Webz.io's freemium business model should suffice for businesses with basic crawling requirements. For businesses that need a more robust solution, Webz.io also offers support for media monitoring, cybersecurity threats, risk intelligence, financial analysis, web intelligence, and identity theft protection.

They even support dark web API solutions for business intelligence.

UiPath

UiPath is a Windows application that can be used to automate repetitive tasks. It's beneficial for web scraping, as it can extract data from websites automatically.

The program is easy to use and doesn't require any programming knowledge. It features a visual drag-and-drop interface that makes it easy to create automation scripts.

With UiPath, you can extract tabular and pattern-based data from websites, PDFs, and other sources. The program can also be used to automate tasks such as filling out online forms and downloading files.

The commercial version of the tool provides additional crawling capabilities. When dealing with complicated UIs, this approach is very successful. The Screen Scraping Tool can extract data from tables in both individual words and groups of text, as well as blocks of text such as RSS feeds.

Also, you don't need any programming skills to create intelligent web agents, but if you're a .NET hacker, you'll be able to completely control their data.

Bad bots

While most web crawlers are benign, some can be used for malicious purposes. These malicious web crawlers, or "bots," can be used to steal information, launch attacks, and commit fraud. It has also been increasingly found that these bots ignore robots.txt directives and proceed directly to scan websites.

Some prominent bad bots are as listed below:

  • PetalBot
  • SEMrushBot
  • Majestic
  • DotBot
  • AhrefsBot

Protecting your site from malicious web crawlers

To protect your website from bad bots, you can use a web application firewall (WAF) to protect your website from bots and other threats. A WAF is a piece of software that sits between your website and the Internet, filtering traffic before it reaches your site.

A CDN can also help to protect your website from bots. A CDN is a network of servers that deliver content to users based on their geographic location.

When a user requests a page from your website, the CDN will route the request to the server closest to the user's location. This can help to reduce the risk of bots attacking your website, as they will have to target each CDN server individually.

KeyCDN has a great feature that you can enable in your dashboard called Block Bad Bots. KeyCDN uses a comprehensive list of known bad bots and blocks them based on their User-Agent string.

When a new Zone is added the Block Bad Bots feature is set to disabled. This setting can be set to enabled instead if you want bad bots to automatically be blocked.

Bot resources

Perhaps you are seeing some user-agent strings in your logs that have you concerned. Caio Almeida also has a pretty good list on his crawler-user-agents GitHub project.

Summary

There are hundreds of different web crawlers out there, but hopefully, you are now familiar with a couple of the more popular ones. Again you want to be careful when blocking any of these as they could cause indexing issues. It is always good to check your web server logs to see how often they are crawling your site.

What's your favorite web crawler? Let us know in the comments below.

Web Crawlers - Top 10 Most Popular - KeyCDN (2024)

References

Top Articles
Latest Posts
Article information

Author: Maia Crooks Jr

Last Updated:

Views: 5569

Rating: 4.2 / 5 (43 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Maia Crooks Jr

Birthday: 1997-09-21

Address: 93119 Joseph Street, Peggyfurt, NC 11582

Phone: +2983088926881

Job: Principal Design Liaison

Hobby: Web surfing, Skiing, role-playing games, Sketching, Polo, Sewing, Genealogy

Introduction: My name is Maia Crooks Jr, I am a homely, joyous, shiny, successful, hilarious, thoughtful, joyous person who loves writing and wants to share my knowledge and understanding with you.