Read Engine Documentation

The Read Engine is Sitemap.xml's high-performance, distributed content crawler designed to discover, parse, and structure web assets for rapid indexing.

Overview

The @sitemap/read-engine package provides a programmatic interface to crawl target domains, extract structured metadata, respect robots.txt directives, and return normalized payloads ready for pipeline ingestion. It handles dynamic content hydration, pagination, and error retries out of the box.

ℹ️ Note

Read Engine v2.4+ requires Node.js 18+ or Python 3.10+. Legacy v1.x branches are available in the Archive.

Installation

npm install @sitemap/read-engine
# or
yarn add @sitemap/read-engine
# or (Python)
pip install sitemap-read-engine

Basic Usage

import { ReadEngine } from '@sitemap/read-engine';

const engine = new ReadEngine({
  concurrency: 5,
  respectRobots: true,
  outputFormat: 'sitemap-xml'
});

// Discover & parse URLs
const results = await engine.crawl('https://example.com');
console.log(results.urls.length, 'pages indexed');

Configuration Reference

Initialize the engine with a configuration object. All options are optional except apiKey for authenticated endpoints.

Option	Type	Default	Description
concurrency	number	3	Maximum parallel fetch operations per domain.
timeout	number	8000	Request timeout in milliseconds.
respectRobots	boolean	true	Parses and obeys `robots.txt` directives.
userAgent	string	SitemapBot/2.4	Custom User-Agent header for requests.
outputFormat	enum	'json'	Return format: `'json'`, `'sitemap-xml'`, or `'csv'`.
retryAttempts	number	2	Automatic retries on 429/5xx responses.

API Reference

`engine.crawl(url, options?)`

Initiates a recursive crawl starting from the provided seed URL. Returns a promise resolving to a structured payload containing discovered URLs, metadata, and crawl statistics.

`engine.parse(html, url?)`

Static HTML parser. Extracts links, meta tags, headings, and canonical URLs from raw HTML strings. Useful for pre-processed or cached content.

`engine.submitToIndex(payload)`

Pushes parsed results directly to the Sitemap.xml indexing pipeline via secure WebSocket or REST endpoint.

⚠️ Rate Limiting

Unauthenticated requests are capped at 100 requests/minute. Professional tiers receive 10,000/min with exponential backoff handling built into the engine.

Best Practices

Use shallow crawls for large sites: Set maxDepth: 2 to avoid resource exhaustion on e-commerce platforms.
Enable cache: The engine supports cacheStrategy: 'lru' to skip recently parsed identical content.
Handle pagination: Use the nextCursor field in responses to resume interrupted crawls seamlessly.
Respect server load: Adjust delayBetweenRequests when crawling legacy or resource-constrained origins.