The Botify crawler's primary objective is to analyze your website as search engines see it - or as they would, if they could devote unlimited time to your site's exploration.It's also a powerful ally in a number of other situations, as soon as you would like some ‘real world' testing: automated redirects, special treatment according to user's characteristics, or, simply, a site under construction. You can now customize the Botify crawler's user-agent. It will introduce itself to the web server wearing the hat you have chosen for him, instead of that of the Botify robot. This allows any testing scenario.A custom user-agent will allow to:1) Crawl the mobile version of a website which redirects users based on their user-agent2) Analyze a website in a pre-production environment, before it goes live3) Be treated as Google, when Googlebot receives a special treatment4) Transmit parameters to conduct specific tests: performance testing, user-language testing, or any other test
1) Crawl the mobile version of a website
If your website redirects to its mobile version based on user-agent information, you can crawl your site using a mobile user's user-agent (that of an iPhone for instance) to check that redirects are triggered as planned. You will also be able to check the proportion of page-to-page redirects versus bulk redirects - the former are for mobile page to perform well in search engines; the latter should be avoided: this will show in the Botify report, which provides the number of incoming redirects per crawled URL.You can also crawl a second time using a Googlebot Mobile user-agent, to check that the mobile bot is redirected the same way and hence has the same vision of things as mobile internet users. This is a requirement for mobile pages to rank in search engines.
2) Analyze a website in a pre-production environment
Being able to analyze a version of your website in a pre-production environment is of great value, not only for search engine optimization, but also for change management:
- For validation purposes: to check a number of criteria before going live,
- For decision support purposes: to measure the impact of changes on the website's structure - or compare two changes.For instance: impact of navigation (such as a new menu or a new block of links to related content) on website depth and internal linking; proportion of a certain type of content (such as user comments) within the website's overall volume.
How:
- The website's robots.txt file disallows all to all user-agents except Botify, as the Botify crawler will continue to follow rules defined for it, even if it presents itself as another user-agent.
- The Botify is started with a custom user-agent that is known only to people participating in the project - which could be an existing user-agent used internally by the website's technical team, that is systematically white-listed.
- The website returns an HTTP 403 status code (Forbidden) to all user-agents except the one used for the test crawl.
3) Be treated as Google, when Googlebot receives a special treatment
When Google's bots are not treated the same way as other user-agents, we may want to crawl 'as Googlebot' to get a result that is in line with what the search engine sees.This might be the case for different reasons, without implying the site is using cloaking (which would mean that search engines are shown a different content than that shown to users, which, depending on the nature of content differences, might be considered as deceptive and might be sanctioned by Google if considered abusive).For instance, performance has been optimized by eliminating tasks that are not applicable to search engines (such as creating a user session). In this case, we'll want performance analysis results to match Google's actual experience of the website.How:By using a user-agent build from Googlebot's user agent, and adding a character chain that is specific to the website of the project. As a result:
- The ‘Googlebot' part of the user-agent will be detected and will trigger a special treatment as expected
- The additional element will allow to easily distinguish the real Googlebot crawl (from Google) from the fake Googlebot crawl (from the Botify analysis) in the server logs.
Avoid ‘polluting' log files with fake data!The second point is key: without this additional element, log files analysis could be skewed, as some Googlebot crawls could be taken into account, while they weren't actually from Google.That's not all. People who manage and analyze log files need to know there are lines with a ‘Google-like' user-agent that need to be removed before performing any analysis. That's precisely why Botify needs to validate any custom user-agent that includes one of the top search engines' bots names (Googlebot, Bingbot, Yandex, Baidu, Yahoo's Slurp).
4) Transmit parameters to conduct specific tests
Using a custom user-agent, any test is possible. You can add to the user-agent any element that can be detected by the web server and trigger a special treatment. Parameters could apply to technical or functional elements, such as:
- Automated authorizations for restricted access content (as if a user was logged in for instance)
- Respond with or without using cache functionality
- Respond with or without using specific performance optimizations
- Language information included in the user-agent, to test redirections of users to the correct local version of the website (which, for users, would normally be done according to the ‘Accept-language' field of the HTTP header)
Internet politeness
We're talking about manipulating user-agents, which are sorts of business cards on the Internet. But a crawler's behavior and speed have nothing to do with what can be expected from an Internet user. That's why Internet politeness rules suggest to include a link in the user-agent, so that the owner or manager of a website can contact without delay someone who has control over the crawler. We strongly advise to follow this politeness rule with custom user-agents.As for crawl speed, the Botify crawler does everything in its power to avoid straining the website it is crawling: it adjusts its crawl rate not only according to configured speed, but also according to the website's response delay, which can indicate strain.B&W illustrations : Simple Icons from The Noun Project