Configuration & API Reference

Comprehensive documentation for configuring, deploying, and managing crawler directives using the Robots.txt platform.

Overview

The Robots.txt platform provides an intelligent layer for managing robots.txt files, sitemaps, and crawler directives across your entire web infrastructure. Instead of manually editing static files, our platform dynamically generates, validates, and deploys optimized crawl rules based on your content strategy and security requirements.

â„šī¸ Note This documentation covers the v2.x API and configuration schema. Legacy v1 endpoints are deprecated and will be sunset on Dec 31, 2025.

Directives Reference

The platform supports all standard IETF crawler directives alongside platform-specific extensions for advanced control.

Directive Type Description
User-agentRequiredSpecifies the crawler or bot name.
AllowOptionalPermits crawling of specified path/pattern.
DisallowOptionalBlocks crawling of specified path/pattern.
Crawl-delayPlatformMinimum seconds between requests (honored by compliant bots).
SitemapOptionalFull URL to XML sitemap.
Max-image-previewGoogleControls image preview size in search results.

Syntax & Rules

Configuration follows a strict YAML/JSON schema that compiles to standard robots.txt output. Wildcards, regex patterns, and conditional directives are fully supported.

config.yaml
defaults:
  crawl-delay: 2
  sitemap: https://api.robots.io/v2/sitemap.xml

rules:
  - agent: *
    allow:
      - /
      - /blog/
      - /products/
    disallow:
      - /api/
      - /admin/
      - /private/*
      - *.pdf
  
  - agent: Googlebot
    allow: /
    max-image-preview: large
    crawl-delay: 0
âš ī¸ Rule Precedence More specific rules override broader ones. Allow takes precedence over Disallow for overlapping paths when following Google's implementation guidelines.

REST API

Manage configurations programmatically via our REST API. All requests require authentication via API Key or OAuth 2.0.

Generate Configuration

POST /v2/config/generate
# Request
curl -X POST https://api.robots.io/v2/config/generate \\
  -H "Authorization: Bearer $API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{
  "domain": "example.com",
  "strategy": "aggressive",
  "exclude_paths": ["/api/v1/*", "/staging/"]
}'
Response (200 OK)
{
  "status": "success",
  "config_id": "cfg_8x9k2m4p",
  "output": "User-agent: *\nDisallow: /api/v1/\nDisallow: /staging/\nAllow: /\nCrawl-delay: 1",
  "validation": {
    "errors": [],
    "warnings": ["Crawl-delay ignored by Googlebot"]
  }
}

Best Practices

  • Avoid over-disallowing: Blocking essential assets (CSS/JS) can impact rendering and indexing quality.
  • Test before deploying: Use the Validation API or Google's Rich Results Test before pushing to production.
  • Version control: Treat configurations as code. Enable Git sync for audit trails and rollback capabilities.
  • Monitor crawler behavior: Leverage the dashboard's real-time bot analytics to adjust Crawl-delay thresholds dynamically.
✅ Pro Tip Enable "Adaptive Throttling" in the dashboard to automatically adjust crawl delays based on server load and bot reputation scores.

FAQ

Does this replace my existing robots.txt file?

Yes. Once deployed, our platform serves the dynamically generated robots.txt at the root path. We automatically handle caching, CDN propagation, and bot-specific variants.

How are wildcard patterns evaluated?

Patterns follow the IETF draft standard with platform-enhanced regex support. * matches any sequence of characters, and $ anchors to the end of the URL path.

Can I rollback a configuration?

All deployments are versioned. Use the UI or POST /v2/config/rollback to revert to any previous snapshot within the 90-day retention window.