ForgeCrawl — Web content forged into Markdown for LLMs

Web content forged into Markdown for LLMs

Self-hosted, authenticated web scraper that converts any webpage into clean, structured Markdown — optimized for RAG pipelines, fine-tuning, and LLM prompts.

100% free & self-hosted — no monthly fees, no usage limits, no API charges. You only pay for your own VPS.
$ git clone https://github.com/ICJIA/forgecrawl && docker compose up -d

Features

Everything you need. Nothing you don't.

Self-Hosted

Your data never leaves your server. No third-party APIs, no usage tracking, no data sharing. You own everything.

Free Forever

No monthly fees. No per-page pricing. No usage caps. No API keys to buy. Self-hosted on your own VPS — the only cost is your server.

Authenticated

Built-in user auth with bcrypt + JWT. API key support for CLI scripting. Rate limiting. SSRF protection.

LLM-Ready Markdown

Clean Markdown with YAML frontmatter. Metadata, word counts, timestamps. Ready to paste into any LLM.

Zero Dependencies

SQLite database. No Redis, no Postgres, no message queue. One Docker command or PM2 start. That's it.

Smart Caching

Configurable result cache with TTL. Bypass on demand. Never scrape the same page twice unless you want to.

How It Works

Three steps to clean Markdown

01

Deploy

Clone the repo. Run docker compose up or pm2 start. Register your admin account.

02

Create an API Key

Log in to the web UI. Generate a Bearer token. Use it in scripts, curl, or any HTTP client.

03

Scrape & Build

POST a URL, get clean Markdown back. Feed it to your RAG pipeline, fine-tuning dataset, or LLM prompt.

Why Markdown?

HTML is for browsers. Markdown is for LLMs.

You could paste raw HTML into an LLM. But you'd be wasting tokens on noise, confusing the model with layout markup, and getting worse results. Here's why Markdown matters.

Raw HTML (~42KB)
<!DOCTYPE html>
<html lang="en" class="dark-mode">
<head>
  <script>window.__NEXT_DATA__={...}</script>
  <link rel="stylesheet" href="/css/main.a8f2.css">
  <!-- 47 more link/script tags -->
</head>
<body>
  <nav class="flex items-center px-4...">
    <!-- 200 lines of navigation -->
  </nav>
  <div id="content" class="prose max-w-none">
    <h1>Getting Started</h1>
    <p>The actual content is buried here...</p>
  </div>
  <footer>...</footer>
  <script src="/js/bundle.f3e1.js"></script>
</body></html>
ForgeCrawl Markdown (~2KB)
---
title: Getting Started
url: https://example.com/docs/start
description: Quick start guide
scraped_at: 2026-03-05T10:30:00Z
scraper: ForgeCrawl/1.0
word_count: 847
---

# Getting Started

The actual content starts immediately.

## Installation

Run the following command:

```bash
npm install example-sdk
```

## Configuration

| Option  | Default | Description      |
|---------|---------|------------------|
| timeout | 30s     | Request timeout  |
| retries | 3       | Max retry count  |

~80% fewer tokens

A typical webpage is 50-200KB of HTML. After stripping nav, ads, scripts, and boilerplate, the Markdown is 2-10KB. That's 80-95% fewer tokens consumed by your LLM — which means lower cost and more room for actual context.

Signal, not noise

Raw HTML is full of <div>, <span>, class names, inline styles, tracking pixels, and aria attributes. None of that is content. Markdown strips it down to headings, paragraphs, lists, links, and code — exactly what the LLM needs to reason about.

Better LLM comprehension

LLMs are trained heavily on Markdown (GitHub, docs, README files). They parse Markdown structure natively — headings map to topics, lists map to enumeration, code blocks map to examples. HTML structure is ambiguous and model-dependent.

Structured metadata

ForgeCrawl adds YAML frontmatter with the source URL, title, description, scrape timestamp, and word count. This metadata is critical for RAG pipelines — you know where every chunk came from and when it was captured.

Consistent format

Every website has different HTML structure. ForgeCrawl normalizes all of them into the same clean Markdown format — headings, paragraphs, lists, tables, code blocks. Your downstream pipeline handles one format, not thousands.

Diffable and versionable

Markdown diffs cleanly in git. HTML doesn't. If you're tracking content changes over time (regulatory pages, docs, policies), Markdown lets you see exactly what changed in a human-readable diff.

API

Built for automation

Every action in ForgeCrawl is available through the REST API. Authenticate with a Bearer token and integrate with any workflow.

Your server, your domain

Enter your VPS domain below — every API example on this page updates in real time.

https://

ForgeCrawl is self-hosted — it runs on your infrastructure, not ours. No account needed.

Scrape a page

$ curl -X POST https://your-server.example.com/api/scrape \
  -H "Authorization: Bearer $FC_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Response

{
  "job_id": "a1b2c3...",
  "title": "Example Domain",
  "markdown": "---\ntitle: Example Domain\n...",
  "wordCount": 42,
  "cached": false
}

Node.js example

const res = await fetch('https://your-server.example.com/api/scrape', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${process.env.FC_KEY}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({ url: 'https://example.com' }),
})

const { markdown, title, wordCount } = await res.json()
console.log(`Scraped "${title}" (${wordCount} words)`)

All endpoints

GEThttps://your-server.example.com/api/healthpublic
POSThttps://your-server.example.com/api/scrapeBearer
GEThttps://your-server.example.com/api/scrapesBearer
GEThttps://your-server.example.com/api/scrapes/:idBearer
DELETEhttps://your-server.example.com/api/scrapes/:idBearer
POSThttps://your-server.example.com/api/auth/api-keysBearer
GEThttps://your-server.example.com/api/auth/api-keysBearer
DELETEhttps://your-server.example.com/api/auth/api-keys/:idBearer

Security

Security is not an afterthought

ForgeCrawl scrapes arbitrary URLs from the internet. Every layer is hardened against abuse.

SSRF Protection

Private IPs, localhost, cloud metadata, DNS re-validation on redirects

bcrypt + JWT

12-round password hashing, HTTP-only secure cookies, configurable session expiry

API Key Auth

SHA-256 hashed Bearer tokens (fc_...) for CLI and scripting

Rate Limiting

5 failed logins per email per 15-minute window

Error Sanitization

No server paths, no stack traces, no user enumeration

Data Isolation

All queries scoped to authenticated user ID

81 integration tests covering auth, SSRF, rate limiting, error sanitization, and data isolation. Explore the API →

Get Started

Deploy in under a minute

No sign-ups. No subscriptions. No usage limits. Clone, deploy, and scrape — forever free on your own server.

Docker Compose

$ git clone https://github.com/ICJIA/forgecrawl
$ cd forgecrawl
$ docker compose up -d
# Visit http://localhost:5150

Bare Metal (PM2)

$ git clone https://github.com/ICJIA/forgecrawl
$ cd forgecrawl && pnpm install
$ cp .env.example .env
$ pnpm build && pm2 start ecosystem.config.cjs
# Visit http://localhost:5150

Tech Stack

Nuxt 4Vue 3SQLiteDrizzle ORMbcryptjose JWTReadabilityTurndownDockerPM2

Use Cases

Built for real workflows

RAG Pipelines

Scrape hundreds of pages from government, academic, or institutional sites. Get clean Markdown with metadata ready for vector embedding and retrieval.

LLM Context

Grab a long technical doc or policy page and paste the Markdown directly into Claude, GPT, or any LLM prompt. No more broken formatting.

Content Archiving

Archive blog posts, support docs, or product pages as Markdown with YAML frontmatter preserving dates, authors, and URLs.

Training Data

Generate consistently formatted Markdown suitable for fine-tuning or embedding generation from any public website.

Why Do I Need This?

Real people. Real problems. One tool.

Whether you're building AI products, managing content pipelines, or doing research — if you need clean web data, ForgeCrawl replaces fragile scripts and expensive SaaS.

Engineering Manager
"We needed web data for our AI features but couldn't send customer URLs to third-party APIs. ForgeCrawl runs on our own infra — compliance approved it in a day."
  • Data stays on-premise
  • No per-page costs to budget
  • Simple Docker deploy for the team
Developer
"I was copy-pasting docs into ChatGPT and losing formatting every time. Now I hit one endpoint and get perfect Markdown with frontmatter. It's in my shell aliases."
  • Clean API with Bearer auth
  • Markdown output ready for LLMs
  • curl-friendly — no SDK needed
Project Manager
"Our content team archives 200+ policy pages a quarter. ForgeCrawl replaced a brittle Python script and a $400/mo SaaS subscription."
  • Bulk scraping with caching
  • YAML metadata for organization
  • Free — no vendor lock-in
Researcher
"I'm building a RAG pipeline over government datasets. ForgeCrawl gives me structured Markdown with source URLs and timestamps — exactly what my embeddings need."
  • Consistent output format
  • Source attribution in frontmatter
  • Self-hosted for IRB compliance