</>
Sitemap.xml
v2.4.1
Docs / Core Services / Read Engine

Read Engine Documentation

The Read Engine is Sitemap.xml's high-performance, distributed content crawler designed to discover, parse, and structure web assets for rapid indexing.

Overview

The @sitemap/read-engine package provides a programmatic interface to crawl target domains, extract structured metadata, respect robots.txt directives, and return normalized payloads ready for pipeline ingestion. It handles dynamic content hydration, pagination, and error retries out of the box.

ℹ️ Note
Read Engine v2.4+ requires Node.js 18+ or Python 3.10+. Legacy v1.x branches are available in the Archive.

Installation

npm install @sitemap/read-engine
# or
yarn add @sitemap/read-engine
# or (Python)
pip install sitemap-read-engine

Basic Usage

import { ReadEngine } from '@sitemap/read-engine';

const engine = new ReadEngine({
  concurrency: 5,
  respectRobots: true,
  outputFormat: 'sitemap-xml'
});

// Discover & parse URLs
const results = await engine.crawl('https://example.com');
console.log(results.urls.length, 'pages indexed');

Configuration Reference

Initialize the engine with a configuration object. All options are optional except apiKey for authenticated endpoints.

Option Type Default Description
concurrency number 3 Maximum parallel fetch operations per domain.
timeout number 8000 Request timeout in milliseconds.
respectRobots boolean true Parses and obeys robots.txt directives.
userAgent string SitemapBot/2.4 Custom User-Agent header for requests.
outputFormat enum 'json' Return format: 'json', 'sitemap-xml', or 'csv'.
retryAttempts number 2 Automatic retries on 429/5xx responses.

API Reference

engine.crawl(url, options?)

Initiates a recursive crawl starting from the provided seed URL. Returns a promise resolving to a structured payload containing discovered URLs, metadata, and crawl statistics.

engine.parse(html, url?)

Static HTML parser. Extracts links, meta tags, headings, and canonical URLs from raw HTML strings. Useful for pre-processed or cached content.

engine.submitToIndex(payload)

Pushes parsed results directly to the Sitemap.xml indexing pipeline via secure WebSocket or REST endpoint.

⚠️ Rate Limiting
Unauthenticated requests are capped at 100 requests/minute. Professional tiers receive 10,000/min with exponential backoff handling built into the engine.

Best Practices