Configuration & API Reference
Comprehensive documentation for configuring, deploying, and managing crawler directives using the Robots.txt platform.
On this page
Overview
The Robots.txt platform provides an intelligent layer for managing robots.txt files, sitemaps, and crawler directives across your entire web infrastructure. Instead of manually editing static files, our platform dynamically generates, validates, and deploys optimized crawl rules based on your content strategy and security requirements.
Directives Reference
The platform supports all standard IETF crawler directives alongside platform-specific extensions for advanced control.
| Directive | Type | Description |
|---|---|---|
User-agent | Required | Specifies the crawler or bot name. |
Allow | Optional | Permits crawling of specified path/pattern. |
Disallow | Optional | Blocks crawling of specified path/pattern. |
Crawl-delay | Platform | Minimum seconds between requests (honored by compliant bots). |
Sitemap | Optional | Full URL to XML sitemap. |
Max-image-preview | Controls image preview size in search results. |
Syntax & Rules
Configuration follows a strict YAML/JSON schema that compiles to standard robots.txt output. Wildcards, regex patterns, and conditional directives are fully supported.
defaults:
crawl-delay: 2
sitemap: https://api.robots.io/v2/sitemap.xml
rules:
- agent: *
allow:
- /
- /blog/
- /products/
disallow:
- /api/
- /admin/
- /private/*
- *.pdf
- agent: Googlebot
allow: /
max-image-preview: large
crawl-delay: 0
Allow takes precedence over Disallow for overlapping paths when following Google's implementation guidelines.
REST API
Manage configurations programmatically via our REST API. All requests require authentication via API Key or OAuth 2.0.
Generate Configuration
# Request
curl -X POST https://api.robots.io/v2/config/generate \\
-H "Authorization: Bearer $API_KEY" \\
-H "Content-Type: application/json" \\
-d '{
"domain": "example.com",
"strategy": "aggressive",
"exclude_paths": ["/api/v1/*", "/staging/"]
}'
{
"status": "success",
"config_id": "cfg_8x9k2m4p",
"output": "User-agent: *\nDisallow: /api/v1/\nDisallow: /staging/\nAllow: /\nCrawl-delay: 1",
"validation": {
"errors": [],
"warnings": ["Crawl-delay ignored by Googlebot"]
}
}
Best Practices
- Avoid over-disallowing: Blocking essential assets (CSS/JS) can impact rendering and indexing quality.
- Test before deploying: Use the Validation API or Google's Rich Results Test before pushing to production.
- Version control: Treat configurations as code. Enable Git sync for audit trails and rollback capabilities.
- Monitor crawler behavior: Leverage the dashboard's real-time bot analytics to adjust
Crawl-delaythresholds dynamically.
FAQ
Does this replace my existing robots.txt file?
Yes. Once deployed, our platform serves the dynamically generated robots.txt at the root path. We automatically handle caching, CDN propagation, and bot-specific variants.
How are wildcard patterns evaluated?
Patterns follow the IETF draft standard with platform-enhanced regex support. * matches any sequence of characters, and $ anchors to the end of the URL path.
Can I rollback a configuration?
All deployments are versioned. Use the UI or POST /v2/config/rollback to revert to any previous snapshot within the 90-day retention window.