Robots.txt Platform Documentation

Overview

The Robots.txt platform provides an intelligent layer for managing robots.txt files, sitemaps, and crawler directives across your entire web infrastructure. Instead of manually editing static files, our platform dynamically generates, validates, and deploys optimized crawl rules based on your content strategy and security requirements.

ℹ️ Note This documentation covers the v2.x API and configuration schema. Legacy v1 endpoints are deprecated and will be sunset on Dec 31, 2025.

Directives Reference

The platform supports all standard IETF crawler directives alongside platform-specific extensions for advanced control.

Directive	Type	Description
`User-agent`	Required	Specifies the crawler or bot name.
`Allow`	Optional	Permits crawling of specified path/pattern.
`Disallow`	Optional	Blocks crawling of specified path/pattern.
`Crawl-delay`	Platform	Minimum seconds between requests (honored by compliant bots).
`Sitemap`	Optional	Full URL to XML sitemap.
`Max-image-preview`	Google	Controls image preview size in search results.

Syntax & Rules

Configuration follows a strict YAML/JSON schema that compiles to standard robots.txt output. Wildcards, regex patterns, and conditional directives are fully supported.

config.yaml

defaults:
  crawl-delay: 2
  sitemap: https://api.robots.io/v2/sitemap.xml

rules:
  - agent: *
    allow:
      - /
      - /blog/
      - /products/
    disallow:
      - /api/
      - /admin/
      - /private/*
      - *.pdf
  
  - agent: Googlebot
    allow: /
    max-image-preview: large
    crawl-delay: 0

⚠️ Rule Precedence More specific rules override broader ones. Allow takes precedence over Disallow for overlapping paths when following Google's implementation guidelines.

REST API

Manage configurations programmatically via our REST API. All requests require authentication via API Key or OAuth 2.0.

Generate Configuration

POST /v2/config/generate

# Request
curl -X POST https://api.robots.io/v2/config/generate \\
  -H "Authorization: Bearer $API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{
  "domain": "example.com",
  "strategy": "aggressive",
  "exclude_paths": ["/api/v1/*", "/staging/"]
}'

Response (200 OK)

{
  "status": "success",
  "config_id": "cfg_8x9k2m4p",
  "output": "User-agent: *\nDisallow: /api/v1/\nDisallow: /staging/\nAllow: /\nCrawl-delay: 1",
  "validation": {
    "errors": [],
    "warnings": ["Crawl-delay ignored by Googlebot"]
  }
}

Best Practices

Avoid over-disallowing: Blocking essential assets (CSS/JS) can impact rendering and indexing quality.
Test before deploying: Use the Validation API or Google's Rich Results Test before pushing to production.
Version control: Treat configurations as code. Enable Git sync for audit trails and rollback capabilities.
Monitor crawler behavior: Leverage the dashboard's real-time bot analytics to adjust Crawl-delay thresholds dynamically.

✅ Pro Tip Enable "Adaptive Throttling" in the dashboard to automatically adjust crawl delays based on server load and bot reputation scores.

FAQ

Does this replace my existing robots.txt file?

Yes. Once deployed, our platform serves the dynamically generated robots.txt at the root path. We automatically handle caching, CDN propagation, and bot-specific variants.

How are wildcard patterns evaluated?

Patterns follow the IETF draft standard with platform-enhanced regex support. * matches any sequence of characters, and $ anchors to the end of the URL path.

Can I rollback a configuration?

All deployments are versioned. Use the UI or POST /v2/config/rollback to revert to any previous snapshot within the 90-day retention window.

Configuration & API Reference

On this page

Overview

Directives Reference

Syntax & Rules

REST API

Generate Configuration

Best Practices

FAQ

Does this replace my existing robots.txt file?

How are wildcard patterns evaluated?

Can I rollback a configuration?