ForgeCrawl — Web content forged into Markdown for LLMs

Web content forged into Markdown for LLMs

Self-hosted, authenticated web scraper that converts any webpage into clean, structured Markdown — optimized for RAG pipelines, fine-tuning, and LLM prompts.

100% free & self-hosted — no monthly fees, no usage limits, no API charges. You only pay for your own VPS.
$ git clone https://github.com/cschweda/forgecrawl && docker compose up -d

Features

Everything you need. Nothing you don't.

Self-Hosted

Your data never leaves your server. No third-party APIs, no usage tracking, no data sharing. You own everything.

Free Forever

No monthly fees. No per-page pricing. No usage caps. No API keys to buy. Self-hosted on your own VPS — the only cost is your server.

Authenticated

Built-in user auth with bcrypt + JWT. API key support for CLI scripting. Rate limiting. SSRF protection.

LLM-Ready Markdown

Clean Markdown with YAML frontmatter. Metadata, word counts, timestamps. Ready to paste into any LLM.

Zero Dependencies

SQLite database. No Redis, no Postgres, no message queue. One Docker command or PM2 start. That's it.

Smart Caching

Configurable result cache with TTL. Bypass on demand. Never scrape the same page twice unless you want to.

How It Works

Three steps to clean Markdown

01

Deploy

Clone the repo. Run docker compose up or pm2 start. Register your admin account.

02

Create an API Key

Log in to the web UI. Generate a Bearer token. Use it in scripts, curl, or any HTTP client.

03

Scrape & Build

POST a URL, get clean Markdown back. Feed it to your RAG pipeline, fine-tuning dataset, or LLM prompt.

Why Markdown?

HTML is for browsers. Markdown is for LLMs.

You could paste raw HTML into an LLM. But you'd be wasting tokens on noise, confusing the model with layout markup, and getting worse results. Here's why Markdown matters.

Raw HTML — 41,823 bytes
<!DOCTYPE html>
<html lang="en" dir="ltr" class="dark-mode js">
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width">
  <title>Getting Started | Acme Docs</title>
  <meta name="description" content="Quick start guide">
  <meta property="og:title" content="Getting Started">
  <meta property="og:image" content="/img/og.png">
  <link rel="stylesheet" href="/css/main.a8f2c4.css">
  <link rel="stylesheet" href="/css/vendor.3e91.css">
  <link rel="stylesheet" href="/css/prism.f4a2.css">
  <link rel="preconnect" href="https://fonts.gstatic.com">
  <link rel="stylesheet" href="https://fonts.googleapis...">
  <script defer src="/js/analytics.min.js"></script>
  <script>window.__CONFIG__={"theme":"dark"}</script>
  <script>!function(e,t){e.dataLayer=e.dataLayer||[]
    ;function n(){dataLayer.push(arguments)}n("js",
    new Date);n("config","G-XXXXXXXX")}()</script>
</head>
<body class="antialiased min-h-screen bg-white">
  <div id="__app">
    <header class="sticky top-0 z-50 w-full border-b
      border-gray-200 bg-white/80 backdrop-blur">
      <div class="mx-auto max-w-7xl px-4 sm:px-6">
        <div class="flex h-16 items-center justify-between">
          <a href="/" class="flex items-center gap-2">
            <img src="/logo.svg" alt="Acme" width="32">
            <span class="font-bold text-lg">Acme Docs</span>
          </a>
          <nav class="hidden md:flex items-center gap-6">
            <a href="/docs" class="text-sm font-medium
              text-gray-600 hover:text-gray-900">Docs</a>
            <a href="/api" class="text-sm font-medium
              text-gray-600 hover:text-gray-900">API</a>
            <a href="/blog" class="text-sm font-medium
              text-gray-600 hover:text-gray-900">Blog</a>
            <a href="/pricing" class="text-sm font-medium
              text-gray-600 hover:text-gray-900">Pricing</a>
          </nav>
          <div class="flex items-center gap-3">
            <button class="rounded-lg p-2 hover:bg-gray-100"
              aria-label="Toggle theme">
              <svg class="h-5 w-5">...</svg>
            </button>
            <a href="/login" class="rounded-lg bg-blue-600
              px-4 py-2 text-sm text-white">Sign In</a>
          </div>
        </div>
      </div>
    </header>
    <aside class="fixed left-0 top-16 w-64 border-r
      border-gray-200 overflow-y-auto h-[calc(100vh-4rem)]">
      <nav class="p-4 space-y-1">
        <a href="/docs/intro" class="block rounded-md px-3
          py-2 text-sm text-gray-600">Introduction</a>
        <a href="/docs/start" class="block rounded-md px-3
          py-2 text-sm font-medium bg-blue-50
          text-blue-700">Getting Started</a>
        <a href="/docs/config" class="block rounded-md px-3
          py-2 text-sm text-gray-600">Configuration</a>
        <a href="/docs/deploy" class="block rounded-md px-3
          py-2 text-sm text-gray-600">Deployment</a>
      </nav>
    </aside>
    <main class="ml-64 pt-16">
      <article class="prose prose-blue max-w-none px-8 py-12">
        <h1 id="getting-started">Getting Started</h1>
        <p>Install the SDK to get started with Acme.</p>
        <h2 id="installation">Installation</h2>
        <p>Run the following command:</p>
        <div class="code-block relative group">
          <button class="absolute right-2 top-2 opacity-0
            group-hover:opacity-100 rounded bg-gray-700 px-2
            py-1 text-xs text-white">Copy</button>
          <pre class="language-bash"><code>npm install
            @acme/sdk</code></pre>
        </div>
        <h2 id="configuration">Configuration</h2>
        <div class="overflow-x-auto">
          <table class="min-w-full divide-y divide-gray-200">
            <thead class="bg-gray-50">
              <tr>
                <th class="px-4 py-3 text-left text-xs
                  font-medium uppercase text-gray-500">
                  Option</th>
                <th class="px-4 py-3 text-left text-xs
                  font-medium uppercase text-gray-500">
                  Default</th>
                <th class="px-4 py-3 text-left text-xs
                  font-medium uppercase text-gray-500">
                  Description</th>
              </tr>
            </thead>
            <tbody class="divide-y divide-gray-200">
              <tr>
                <td class="px-4 py-3 text-sm">timeout</td>
                <td class="px-4 py-3 text-sm">30s</td>
                <td class="px-4 py-3 text-sm">Request
                  timeout</td>
              </tr>
              <tr>
                <td class="px-4 py-3 text-sm">retries</td>
                <td class="px-4 py-3 text-sm">3</td>
                <td class="px-4 py-3 text-sm">Max retry
                  count</td>
              </tr>
            </tbody>
          </table>
        </div>
      </article>
    </main>
    <footer class="ml-64 border-t border-gray-200 px-8 py-6">
      <div class="flex justify-between text-sm text-gray-500">
        <span>&copy; 2026 Acme Inc.</span>
        <div class="flex gap-4">
          <a href="/privacy">Privacy</a>
          <a href="/terms">Terms</a>
        </div>
      </div>
    </footer>
  </div>
  <script src="/js/vendor.8f3a.js"></script>
  <script src="/js/app.2c7e.js"></script>
  <script src="/js/prism.min.js"></script>
</body>
</html>
ForgeCrawl Markdown — 547 bytes
---
title: Getting Started
url: https://docs.acme.com/docs/start
description: Quick start guide
scraped_at: 2026-03-05T10:30:00Z
scraper: ForgeCrawl/1.0
word_count: 42
---

# Getting Started

Install the SDK to get started with Acme.

## Installation

Run the following command:

```bash
npm install @acme/sdk
```

## Configuration

| Option  | Default | Description     |
|---------|---------|-----------------|
| timeout | 30s     | Request timeout |
| retries | 3       | Max retry count |

Same content. 98.7% smaller. Zero noise.

~80% fewer tokens

A typical webpage is 50-200KB of HTML. After stripping nav, ads, scripts, and boilerplate, the Markdown is 2-10KB. That's 80-95% fewer tokens consumed by your LLM — which means lower cost and more room for actual context.

Signal, not noise

Raw HTML is full of <div>, <span>, class names, inline styles, tracking pixels, and aria attributes. None of that is content. Markdown strips it down to headings, paragraphs, lists, links, and code — exactly what the LLM needs to reason about.

Better LLM comprehension

LLMs are trained heavily on Markdown (GitHub, docs, README files). They parse Markdown structure natively — headings map to topics, lists map to enumeration, code blocks map to examples. HTML structure is ambiguous and model-dependent.

Structured metadata

ForgeCrawl adds YAML frontmatter with the source URL, title, description, scrape timestamp, and word count. This metadata is critical for RAG pipelines — you know where every chunk came from and when it was captured.

Consistent format

Every website has different HTML structure. ForgeCrawl normalizes all of them into the same clean Markdown format — headings, paragraphs, lists, tables, code blocks. Your downstream pipeline handles one format, not thousands.

Diffable and versionable

Markdown diffs cleanly in git. HTML doesn't. If you're tracking content changes over time (regulatory pages, docs, policies), Markdown lets you see exactly what changed in a human-readable diff.

API

Built for automation

Every action in ForgeCrawl is available through the REST API. Authenticate with a Bearer token and integrate with any workflow.

Your server, your domain

Enter your VPS domain below — every API example on this page updates in real time.

https://

ForgeCrawl is self-hosted — it runs on your infrastructure, not ours. No account needed.

Scrape a page

$ curl -X POST https://your-server.example.com/api/scrape \
  -H "Authorization: Bearer $FC_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response

{
  "job_id": "c7f3a1b2-9e4d-4c8a-b5f6-2d1e0a3b4c5d",
  "title": "Example Domain",
  "markdown": "---\ntitle: Example Domain\nurl: https://example.com\nscraped_at: 2026-03-05T14:22:08Z\nscraper: ForgeCrawl/1.0\nword_count: 28\n---\n\n# Example Domain\n\nThis domain is for use in illustrative examples in\ndocuments. You may use this domain in literature\nwithout prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)",
  "rawHtml": "<!doctype html>\n<html>\n<head>\n  <title>Example Domain</title>...",
  "wordCount": 28,
  "metadata": {
    "url": "https://example.com",
    "canonical": null,
    "excerpt": "This domain is for use in illustrative examples in documents.",
    "byline": null,
    "siteName": null,
    "language": "en",
    "ogImage": null,
    "publishedTime": null,
    "scrapedAt": "2026-03-05T14:22:08.431Z"
  },
  "cached": false
}

Node.js example

const res = await fetch('https://your-server.example.com/api/scrape', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.FC_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({ url: 'https://example.com' }),
})

const { markdown, title, wordCount } = await res.json()
console.log(`Scraped "${title}" (${wordCount} words)`)

All endpoints

GEThttps://your-server.example.com/api/healthpublic

Returns server health status and version info.

Request

$ curl https://your-server.example.com/api/health

Response

{
  "status": "ok",
  "version": "1.0.0",
  "uptime": 84623
}
POSThttps://your-server.example.com/api/scrapeBearer

Scrape a URL and return clean Markdown with metadata. Send {"url": "https://..."} in the request body.

Request

$ curl -X POST https://your-server.example.com/api/scrape \
  -H "Authorization: Bearer $FC_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response

{
  "job_id": "c7f3a1b2-9e4d-4c8a-...",
  "title": "Example Domain",
  "markdown": "---\ntitle: Example Domain\n...\n---\n\n# Example Domain\n...",
  "rawHtml": "<!doctype html>...",
  "wordCount": 28,
  "metadata": {
    "url": "https://example.com",
    "canonical": null,
    "excerpt": "This domain is for use in...",
    "language": "en",
    "scrapedAt": "2026-03-05T14:22:08.431Z"
  },
  "cached": false
}
GEThttps://your-server.example.com/api/scrapesBearer

List all scrape jobs for the authenticated user, ordered by most recent.

Request

$ curl https://your-server.example.com/api/scrapes \
  -H "Authorization: Bearer $FC_KEY"

Response

[
  {
    "id": "c7f3a1b2-9e4d-4c8a-...",
    "url": "https://example.com",
    "status": "completed",
    "createdAt": "2026-03-05T14:22:07Z",
    "completedAt": "2026-03-05T14:22:08Z"
  },
  {
    "id": "a8d2e5f1-3b7c-4a9e-...",
    "url": "https://docs.acme.com/start",
    "status": "completed",
    "createdAt": "2026-03-05T14:20:01Z",
    "completedAt": "2026-03-05T14:20:03Z"
  }
]
GEThttps://your-server.example.com/api/scrapes/:idBearer

Retrieve full results for a specific scrape job by ID.

Request

$ curl https://your-server.example.com/api/scrapes/c7f3a1b2-9e4d-4c8a-... \
  -H "Authorization: Bearer $FC_KEY"

Response

{
  "id": "c7f3a1b2-9e4d-4c8a-...",
  "url": "https://example.com",
  "status": "completed",
  "title": "Example Domain",
  "markdown": "---\ntitle: Example Domain\n...---\n\n# Example Domain\n...",
  "wordCount": 28,
  "metadata": { ... },
  "createdAt": "2026-03-05T14:22:07Z",
  "completedAt": "2026-03-05T14:22:08Z"
}
DELETEhttps://your-server.example.com/api/scrapes/:idBearer

Delete a scrape job and its results permanently.

Request

$ curl -X DELETE https://your-server.example.com/api/scrapes/c7f3a1b2-9e4d-4c8a-... \
  -H "Authorization: Bearer $FC_KEY"

Response

{
  "message": "Scrape job deleted"
}
POSThttps://your-server.example.com/api/auth/api-keysBearer

Create a new API key for Bearer token authentication. The key is only shown once.

Request

$ curl -X POST https://your-server.example.com/api/auth/api-keys \
  -H "Authorization: Bearer $FC_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "my-script"}'

Response

{
  "id": "e4f5a6b7-8c9d-4e0f-...",
  "name": "my-script",
  "key": "fc_a1b2c3d4e5f6...",
  "createdAt": "2026-03-05T14:25:00Z"
}
GEThttps://your-server.example.com/api/auth/api-keysBearer

List all API keys for the authenticated user. Keys are masked for security.

Request

$ curl https://your-server.example.com/api/auth/api-keys \
  -H "Authorization: Bearer $FC_KEY"

Response

[
  {
    "id": "e4f5a6b7-8c9d-4e0f-...",
    "name": "my-script",
    "lastFour": "f6a1",
    "createdAt": "2026-03-05T14:25:00Z",
    "lastUsedAt": "2026-03-05T15:10:33Z"
  }
]
DELETEhttps://your-server.example.com/api/auth/api-keys/:idBearer

Revoke an API key permanently. It cannot be recovered.

Request

$ curl -X DELETE https://your-server.example.com/api/auth/api-keys/e4f5a6b7-8c9d-4e0f-... \
  -H "Authorization: Bearer $FC_KEY"

Response

{
  "message": "API key revoked"
}

Security

Security is not an afterthought

ForgeCrawl scrapes arbitrary URLs from the internet. Every layer is hardened against abuse.

SSRF Protection

Private IPs, localhost, cloud metadata, DNS re-validation on redirects

bcrypt + JWT

12-round password hashing, HTTP-only secure cookies, configurable session expiry

API Key Auth

SHA-256 hashed Bearer tokens (fc_...) for CLI and scripting

Rate Limiting

5 failed logins per email per 15-minute window

Error Sanitization

No server paths, no stack traces, no user enumeration

Data Isolation

All queries scoped to authenticated user ID

81 integration tests covering auth, SSRF, rate limiting, error sanitization, and data isolation. Explore the API →

Get Started

Deploy in under a minute

No sign-ups. No subscriptions. No usage limits. Clone, deploy, and scrape — forever free on your own server.

Docker Compose

$ git clone https://github.com/cschweda/forgecrawl
$ cd forgecrawl
$ docker compose up -d
# Visit http://localhost:5150

Bare Metal (PM2)

$ git clone https://github.com/cschweda/forgecrawl
$ cd forgecrawl && pnpm install
$ cp .env.example .env
$ pnpm build && pm2 start ecosystem.config.cjs
# Visit http://localhost:5150

Tech Stack

Nuxt 4Vue 3SQLiteDrizzle ORMbcryptjose JWTReadabilityTurndownDockerPM2

Use Cases

Built for real workflows

RAG Pipelines

Scrape hundreds of pages from government, academic, or institutional sites. Get clean Markdown with metadata ready for vector embedding and retrieval.

LLM Context

Grab a long technical doc or policy page and paste the Markdown directly into Claude, GPT, or any LLM prompt. No more broken formatting.

Content Archiving

Archive blog posts, support docs, or product pages as Markdown with YAML frontmatter preserving dates, authors, and URLs.

Training Data

Generate consistently formatted Markdown suitable for fine-tuning or embedding generation from any public website.

Why Do I Need This?

Real people. Real problems. One tool.

Whether you're building AI products, managing content pipelines, or doing research — if you need clean web data, ForgeCrawl replaces fragile scripts and expensive SaaS.

Engineering Manager
"We needed web data for our AI features but couldn't send customer URLs to third-party APIs. ForgeCrawl runs on our own infra — compliance approved it in a day."
  • Data stays on-premise
  • No per-page costs to budget
  • Simple Docker deploy for the team
Developer
"I was copy-pasting docs into ChatGPT and losing formatting every time. Now I hit one endpoint and get perfect Markdown with frontmatter. It's in my shell aliases."
  • Clean API with Bearer auth
  • Markdown output ready for LLMs
  • curl-friendly — no SDK needed
Project Manager
"Our content team archives 200+ policy pages a quarter. ForgeCrawl replaced a brittle Python script and a $400/mo SaaS subscription."
  • Bulk scraping with caching
  • YAML metadata for organization
  • Free — no vendor lock-in
Researcher
"I'm building a RAG pipeline over government datasets. ForgeCrawl gives me structured Markdown with source URLs and timestamps — exactly what my embeddings need."
  • Consistent output format
  • Source attribution in frontmatter
  • Self-hosted for IRB compliance
Journalist
"I monitor 30+ agency newsrooms for breaking policy changes. ForgeCrawl lets me pull clean text on demand and diff it against last week's version — no more reading raw HTML in dev tools."
  • On-demand scraping of public pages
  • Git-friendly Markdown for diffing
  • No third-party data sharing
Technical Writer
"We migrated 500 pages from a legacy CMS to a docs-as-code workflow. ForgeCrawl scraped every page into Markdown with frontmatter — saved us weeks of manual conversion."
  • Bulk conversion to clean Markdown
  • Preserved headings, tables, and code blocks
  • YAML metadata for CMS import

Roadmap

What's coming next

ForgeCrawl is actively developed. Here's a look at what's on the horizon — including full sitemap crawling, JS rendering, and RAG-ready chunking.

Phase 2

JS Rendering & Document Support

  • Puppeteer engine for SPAs and JS-heavy pages
  • PDF and DOCX to Markdown extraction
  • Configurable storage: database, filesystem, or both
  • Wait-for-selector support for dynamic content

Phase 3

Site Crawling & Job Queue

  • Crawl entire sitemaps or subsections by URL pattern
  • Async job queue with real-time progress tracking
  • Depth control, max pages, include/exclude filters
  • robots.txt compliance and per-domain rate limiting
  • Pause, resume, and cancel active crawls

Phase 4

Multi-User & Usage Tracking

  • Admin user management with role enforcement
  • Per-user usage stats: scrapes, pages, storage
  • Per-user rate limits configurable by admin
  • Auto-generated API documentation

Phase 5

RAG Chunking & Advanced Features

  • Token-aware chunking with semantic boundaries
  • Chunk metadata: heading context, position, token count
  • Login-gated scraping via cookie injection
  • Export to JSON, JSONL, or zipped Markdown
  • Production monitoring and alerting

Want to follow progress or contribute? Star the repo on GitHub