Read Engine Documentation
The Read Engine is Sitemap.xml's high-performance, distributed content crawler designed to discover, parse, and structure web assets for rapid indexing.
Overview
The @sitemap/read-engine package provides a programmatic interface to crawl target domains, extract structured metadata, respect robots.txt directives, and return normalized payloads ready for pipeline ingestion. It handles dynamic content hydration, pagination, and error retries out of the box.
Installation
npm install @sitemap/read-engine # or yarn add @sitemap/read-engine # or (Python) pip install sitemap-read-engine
Basic Usage
import { ReadEngine } from '@sitemap/read-engine'; const engine = new ReadEngine({ concurrency: 5, respectRobots: true, outputFormat: 'sitemap-xml' }); // Discover & parse URLs const results = await engine.crawl('https://example.com'); console.log(results.urls.length, 'pages indexed');
Configuration Reference
Initialize the engine with a configuration object. All options are optional except apiKey for authenticated endpoints.
| Option | Type | Default | Description |
|---|---|---|---|
| concurrency | number | 3 | Maximum parallel fetch operations per domain. |
| timeout | number | 8000 | Request timeout in milliseconds. |
| respectRobots | boolean | true | Parses and obeys robots.txt directives. |
| userAgent | string | SitemapBot/2.4 | Custom User-Agent header for requests. |
| outputFormat | enum | 'json' | Return format: 'json', 'sitemap-xml', or 'csv'. |
| retryAttempts | number | 2 | Automatic retries on 429/5xx responses. |
API Reference
engine.crawl(url, options?)
Initiates a recursive crawl starting from the provided seed URL. Returns a promise resolving to a structured payload containing discovered URLs, metadata, and crawl statistics.
engine.parse(html, url?)
Static HTML parser. Extracts links, meta tags, headings, and canonical URLs from raw HTML strings. Useful for pre-processed or cached content.
engine.submitToIndex(payload)
Pushes parsed results directly to the Sitemap.xml indexing pipeline via secure WebSocket or REST endpoint.
Best Practices
- Use shallow crawls for large sites: Set
maxDepth: 2to avoid resource exhaustion on e-commerce platforms. - Enable cache: The engine supports
cacheStrategy: 'lru'to skip recently parsed identical content. - Handle pagination: Use the
nextCursorfield in responses to resume interrupted crawls seamlessly. - Respect server load: Adjust
delayBetweenRequestswhen crawling legacy or resource-constrained origins.