YouBot: You.com's Web Crawler | YDC

YouBot is the web crawler that powers the You.com search engine. It automatically discovers and indexes web pages to provide real-time, accurate search results for You.com users.

Overview

YouBot is designed to crawl the web efficiently and respectfully, following industry standards and best practices. It respects robots.txt directives and crawl rate preferences to ensure minimal impact on web servers while maintaining comprehensive coverage of the web.

User Agent

YouBot identifies itself with the following user agent string:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; YouBot/1.0; +https://docs.you.com/youbot; env:prod) Chrome/X.X.X.X Safari/537.36

Note: X.X.X.X represents the Google Chrome version number.

The user agent includes:

Compatible identifier: YouBot/1.0
Contact email: spider@you.com
Environment: Production (env:prod)

Verifying YouBot

Since user agent strings can be spoofed, You.com supports three ways to verify that requests are genuinely from YouBot: cryptographic signatures (recommended), reverse DNS checks, and IP range validation.

Cloudflare Web Bot Auth

YouBot uses Cloudflare’s Web Bot Auth standard for authentication. This cryptographic verification ensures that requests claiming to be from YouBot are legitimate.

To verify YouBot requests:

Check the HTTP Message Signatures in the request headers

Retrieve the public keys from You.com’s well-known directory:

https://you.com/.well-known/http-message-signatures-directory

Validate the signature using the public keys provided

The public keys are in JSON Web Key (JWK) format and use the Ed25519 cryptographic algorithm.

Reverse DNS Lookup

You can confirm crawler identity by resolving the client IP to a hostname and checking that it matches the expected pattern. For example, a reverse DNS lookup (dig -x) can return a crawler-specific hostname, and a forward lookup on that hostname should resolve back to the same IP.

$ dig -x 66.249.66.106
crawl-66-249-66-106.googlebot.com.
$ host youbot-68-67-112-106.search.you.com
68.67.112.106

For YouBot, hostnames use the form youbot-{octets-with-hyphens}.search.you.com, where {octets-with-hyphens} matches the connecting IP (for example, 68.67.112.106 becomes youbot-68-67-112-106).

IP Range Check

Legitimate YouBot requests originate from 68.67.112.0/24.

Crawl Rate and Server Load

YouBot is designed to crawl efficiently without overwhelming web servers. The crawler:

Respects robots.txt directives
Implements adaptive crawl rate limiting
Distributes requests across multiple IP addresses
Honors crawl-delay directives

If you notice excessive crawl activity from YouBot, please contact us at spider@you.com.

Controlling YouBot Access

Using robots.txt

You can control YouBot’s access to your site using the standard robots.txt file:

# Block YouBot from entire site
User-agent: YouBot
Disallow: /

# Block YouBot from specific directories
User-agent: YouBot
Disallow: /private/
Disallow: /admin/

# Allow YouBot with crawl delay
User-agent: YouBot
Crawl-delay: 10

HTTP Status Codes

YouBot respects standard HTTP status codes:

200 OK: Page is crawled and indexed
301/302: Redirects are followed
404 Not Found: Page is removed from index
429 Too Many Requests: Crawl rate is reduced
503 Service Unavailable: Crawling is temporarily paused

Technical Properties

Supported Protocols

HTTP/1.1
HTTP/2
HTTPS (TLS 1.2 and above)

Supported Content Types

YouBot crawls and indexes various content types including:

HTML pages
PDF documents
Plain text files
XML and RSS feeds
Structured data (JSON-LD, microdata, RDFa)

Content Encodings

YouBot supports standard content encodings:

gzip
deflate
Brotli (br)

Contact and Support

For questions, concerns, or issues related to YouBot’s crawling activity:

Email: spider@you.com

Common reasons to contact us:

Reporting excessive crawl rates
Requesting crawl adjustments
Reporting technical issues
Discussing custom crawl requirements for large sites

Frequently Asked Questions

Why is YouBot crawling my site?

YouBot crawls publicly accessible web pages to provide comprehensive, real-time search results for You.com users. If your content is public and not blocked by robots.txt, it may be crawled and indexed.

How often does YouBot crawl my site?

Crawl frequency depends on factors like:

How often your content changes
Your site’s popularity and authority
Your server’s response time
Any crawl-delay directives in robots.txt

Can I request a recrawl of my content?

For immediate indexing needs or custom crawl requests, please reach out to spider@you.com.

Does YouBot respect robots.txt?

Yes, YouBot fully respects robots.txt directives, including user-agent specific rules and crawl-delay settings.

How do I report a problem with YouBot?

Your domain name
Timestamps of problematic requests
Description of the issue
Server logs (if applicable)