
Web content forged into Markdown for LLMs
Self-hosted, authenticated web scraper that converts any webpage into clean, structured Markdown — optimized for RAG pipelines, fine-tuning, and LLM prompts.
Features
Everything you need. Nothing you don't.
Self-Hosted
Your data never leaves your server. No third-party APIs, no usage tracking, no data sharing. You own everything.
Free Forever
No monthly fees. No per-page pricing. No usage caps. No API keys to buy. Self-hosted on your own VPS — the only cost is your server.
Authenticated
Built-in user auth with bcrypt + JWT. API key support for CLI scripting. Rate limiting. SSRF protection.
LLM-Ready Markdown
Clean Markdown with YAML frontmatter. Metadata, word counts, timestamps. Ready to paste into any LLM.
Zero Dependencies
SQLite database. No Redis, no Postgres, no message queue. One Docker command or PM2 start. That's it.
Smart Caching
Configurable result cache with TTL. Bypass on demand. Never scrape the same page twice unless you want to.
How It Works
Three steps to clean Markdown
Deploy
Clone the repo. Run docker compose up or pm2 start. Register your admin account.
Create an API Key
Log in to the web UI. Generate a Bearer token. Use it in scripts, curl, or any HTTP client.
Scrape & Build
POST a URL, get clean Markdown back. Feed it to your RAG pipeline, fine-tuning dataset, or LLM prompt.
Why Markdown?
HTML is for browsers. Markdown is for LLMs.
You could paste raw HTML into an LLM. But you'd be wasting tokens on noise, confusing the model with layout markup, and getting worse results. Here's why Markdown matters.
<!DOCTYPE html>
<html lang="en" dir="ltr" class="dark-mode js">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width">
<title>Getting Started | Acme Docs</title>
<meta name="description" content="Quick start guide">
<meta property="og:title" content="Getting Started">
<meta property="og:image" content="/img/og.png">
<link rel="stylesheet" href="/css/main.a8f2c4.css">
<link rel="stylesheet" href="/css/vendor.3e91.css">
<link rel="stylesheet" href="/css/prism.f4a2.css">
<link rel="preconnect" href="https://fonts.gstatic.com">
<link rel="stylesheet" href="https://fonts.googleapis...">
<script defer src="/js/analytics.min.js"></script>
<script>window.__CONFIG__={"theme":"dark"}</script>
<script>!function(e,t){e.dataLayer=e.dataLayer||[]
;function n(){dataLayer.push(arguments)}n("js",
new Date);n("config","G-XXXXXXXX")}()</script>
</head>
<body class="antialiased min-h-screen bg-white">
<div id="__app">
<header class="sticky top-0 z-50 w-full border-b
border-gray-200 bg-white/80 backdrop-blur">
<div class="mx-auto max-w-7xl px-4 sm:px-6">
<div class="flex h-16 items-center justify-between">
<a href="/" class="flex items-center gap-2">
<img src="/logo.svg" alt="Acme" width="32">
<span class="font-bold text-lg">Acme Docs</span>
</a>
<nav class="hidden md:flex items-center gap-6">
<a href="/docs" class="text-sm font-medium
text-gray-600 hover:text-gray-900">Docs</a>
<a href="/api" class="text-sm font-medium
text-gray-600 hover:text-gray-900">API</a>
<a href="/blog" class="text-sm font-medium
text-gray-600 hover:text-gray-900">Blog</a>
<a href="/pricing" class="text-sm font-medium
text-gray-600 hover:text-gray-900">Pricing</a>
</nav>
<div class="flex items-center gap-3">
<button class="rounded-lg p-2 hover:bg-gray-100"
aria-label="Toggle theme">
<svg class="h-5 w-5">...</svg>
</button>
<a href="/login" class="rounded-lg bg-blue-600
px-4 py-2 text-sm text-white">Sign In</a>
</div>
</div>
</div>
</header>
<aside class="fixed left-0 top-16 w-64 border-r
border-gray-200 overflow-y-auto h-[calc(100vh-4rem)]">
<nav class="p-4 space-y-1">
<a href="/docs/intro" class="block rounded-md px-3
py-2 text-sm text-gray-600">Introduction</a>
<a href="/docs/start" class="block rounded-md px-3
py-2 text-sm font-medium bg-blue-50
text-blue-700">Getting Started</a>
<a href="/docs/config" class="block rounded-md px-3
py-2 text-sm text-gray-600">Configuration</a>
<a href="/docs/deploy" class="block rounded-md px-3
py-2 text-sm text-gray-600">Deployment</a>
</nav>
</aside>
<main class="ml-64 pt-16">
<article class="prose prose-blue max-w-none px-8 py-12">
<h1 id="getting-started">Getting Started</h1>
<p>Install the SDK to get started with Acme.</p>
<h2 id="installation">Installation</h2>
<p>Run the following command:</p>
<div class="code-block relative group">
<button class="absolute right-2 top-2 opacity-0
group-hover:opacity-100 rounded bg-gray-700 px-2
py-1 text-xs text-white">Copy</button>
<pre class="language-bash"><code>npm install
@acme/sdk</code></pre>
</div>
<h2 id="configuration">Configuration</h2>
<div class="overflow-x-auto">
<table class="min-w-full divide-y divide-gray-200">
<thead class="bg-gray-50">
<tr>
<th class="px-4 py-3 text-left text-xs
font-medium uppercase text-gray-500">
Option</th>
<th class="px-4 py-3 text-left text-xs
font-medium uppercase text-gray-500">
Default</th>
<th class="px-4 py-3 text-left text-xs
font-medium uppercase text-gray-500">
Description</th>
</tr>
</thead>
<tbody class="divide-y divide-gray-200">
<tr>
<td class="px-4 py-3 text-sm">timeout</td>
<td class="px-4 py-3 text-sm">30s</td>
<td class="px-4 py-3 text-sm">Request
timeout</td>
</tr>
<tr>
<td class="px-4 py-3 text-sm">retries</td>
<td class="px-4 py-3 text-sm">3</td>
<td class="px-4 py-3 text-sm">Max retry
count</td>
</tr>
</tbody>
</table>
</div>
</article>
</main>
<footer class="ml-64 border-t border-gray-200 px-8 py-6">
<div class="flex justify-between text-sm text-gray-500">
<span>© 2026 Acme Inc.</span>
<div class="flex gap-4">
<a href="/privacy">Privacy</a>
<a href="/terms">Terms</a>
</div>
</div>
</footer>
</div>
<script src="/js/vendor.8f3a.js"></script>
<script src="/js/app.2c7e.js"></script>
<script src="/js/prism.min.js"></script>
</body>
</html>--- title: Getting Started url: https://docs.acme.com/docs/start description: Quick start guide scraped_at: 2026-03-05T10:30:00Z scraper: ForgeCrawl/1.0 word_count: 42 --- # Getting Started Install the SDK to get started with Acme. ## Installation Run the following command: ```bash npm install @acme/sdk ``` ## Configuration | Option | Default | Description | |---------|---------|-----------------| | timeout | 30s | Request timeout | | retries | 3 | Max retry count |
Same content. 98.7% smaller. Zero noise.
~80% fewer tokens
A typical webpage is 50-200KB of HTML. After stripping nav, ads, scripts, and boilerplate, the Markdown is 2-10KB. That's 80-95% fewer tokens consumed by your LLM — which means lower cost and more room for actual context.
Signal, not noise
Raw HTML is full of <div>, <span>, class names, inline styles, tracking pixels, and aria attributes. None of that is content. Markdown strips it down to headings, paragraphs, lists, links, and code — exactly what the LLM needs to reason about.
Better LLM comprehension
LLMs are trained heavily on Markdown (GitHub, docs, README files). They parse Markdown structure natively — headings map to topics, lists map to enumeration, code blocks map to examples. HTML structure is ambiguous and model-dependent.
Structured metadata
ForgeCrawl adds YAML frontmatter with the source URL, title, description, scrape timestamp, and word count. This metadata is critical for RAG pipelines — you know where every chunk came from and when it was captured.
Consistent format
Every website has different HTML structure. ForgeCrawl normalizes all of them into the same clean Markdown format — headings, paragraphs, lists, tables, code blocks. Your downstream pipeline handles one format, not thousands.
Diffable and versionable
Markdown diffs cleanly in git. HTML doesn't. If you're tracking content changes over time (regulatory pages, docs, policies), Markdown lets you see exactly what changed in a human-readable diff.
API
Built for automation
Every action in ForgeCrawl is available through the REST API. Authenticate with a Bearer token and integrate with any workflow.
Your server, your domain
Enter your VPS domain below — every API example on this page updates in real time.
ForgeCrawl is self-hosted — it runs on your infrastructure, not ours. No account needed.
Scrape a page
$ curl -X POST https://your-server.example.com/api/scrape \ -H "Authorization: Bearer $FC_KEY" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}'
Response
{
"job_id": "c7f3a1b2-9e4d-4c8a-b5f6-2d1e0a3b4c5d",
"title": "Example Domain",
"markdown": "---\ntitle: Example Domain\nurl: https://example.com\nscraped_at: 2026-03-05T14:22:08Z\nscraper: ForgeCrawl/1.0\nword_count: 28\n---\n\n# Example Domain\n\nThis domain is for use in illustrative examples in\ndocuments. You may use this domain in literature\nwithout prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)",
"rawHtml": "<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>...",
"wordCount": 28,
"metadata": {
"url": "https://example.com",
"canonical": null,
"excerpt": "This domain is for use in illustrative examples in documents.",
"byline": null,
"siteName": null,
"language": "en",
"ogImage": null,
"publishedTime": null,
"scrapedAt": "2026-03-05T14:22:08.431Z"
},
"cached": false
}Node.js example
const res = await fetch('https://your-server.example.com/api/scrape', { method: 'POST', headers: { 'Authorization': `Bearer ${process.env.FC_KEY}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ url: 'https://example.com' }), }) const { markdown, title, wordCount } = await res.json() console.log(`Scraped "${title}" (${wordCount} words)`)
All endpoints
GEThttps://your-server.example.com/api/healthpublic
Returns server health status and version info.
Request
$ curl https://your-server.example.com/api/healthResponse
{
"status": "ok",
"version": "1.0.0",
"uptime": 84623
}POSThttps://your-server.example.com/api/scrapeBearer
Scrape a URL and return clean Markdown with metadata. Send {"url": "https://..."} in the request body.
Request
$ curl -X POST https://your-server.example.com/api/scrape \
-H "Authorization: Bearer $FC_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'Response
{
"job_id": "c7f3a1b2-9e4d-4c8a-...",
"title": "Example Domain",
"markdown": "---\ntitle: Example Domain\n...\n---\n\n# Example Domain\n...",
"rawHtml": "<!doctype html>...",
"wordCount": 28,
"metadata": {
"url": "https://example.com",
"canonical": null,
"excerpt": "This domain is for use in...",
"language": "en",
"scrapedAt": "2026-03-05T14:22:08.431Z"
},
"cached": false
}GEThttps://your-server.example.com/api/scrapesBearer
List all scrape jobs for the authenticated user, ordered by most recent.
Request
$ curl https://your-server.example.com/api/scrapes \
-H "Authorization: Bearer $FC_KEY"Response
[
{
"id": "c7f3a1b2-9e4d-4c8a-...",
"url": "https://example.com",
"status": "completed",
"createdAt": "2026-03-05T14:22:07Z",
"completedAt": "2026-03-05T14:22:08Z"
},
{
"id": "a8d2e5f1-3b7c-4a9e-...",
"url": "https://docs.acme.com/start",
"status": "completed",
"createdAt": "2026-03-05T14:20:01Z",
"completedAt": "2026-03-05T14:20:03Z"
}
]GEThttps://your-server.example.com/api/scrapes/:idBearer
Retrieve full results for a specific scrape job by ID.
Request
$ curl https://your-server.example.com/api/scrapes/c7f3a1b2-9e4d-4c8a-... \
-H "Authorization: Bearer $FC_KEY"Response
{
"id": "c7f3a1b2-9e4d-4c8a-...",
"url": "https://example.com",
"status": "completed",
"title": "Example Domain",
"markdown": "---\ntitle: Example Domain\n...---\n\n# Example Domain\n...",
"wordCount": 28,
"metadata": { ... },
"createdAt": "2026-03-05T14:22:07Z",
"completedAt": "2026-03-05T14:22:08Z"
}DELETEhttps://your-server.example.com/api/scrapes/:idBearer
Delete a scrape job and its results permanently.
Request
$ curl -X DELETE https://your-server.example.com/api/scrapes/c7f3a1b2-9e4d-4c8a-... \
-H "Authorization: Bearer $FC_KEY"Response
{
"message": "Scrape job deleted"
}POSThttps://your-server.example.com/api/auth/api-keysBearer
Create a new API key for Bearer token authentication. The key is only shown once.
Request
$ curl -X POST https://your-server.example.com/api/auth/api-keys \
-H "Authorization: Bearer $FC_KEY" \
-H "Content-Type: application/json" \
-d '{"name": "my-script"}'Response
{
"id": "e4f5a6b7-8c9d-4e0f-...",
"name": "my-script",
"key": "fc_a1b2c3d4e5f6...",
"createdAt": "2026-03-05T14:25:00Z"
}GEThttps://your-server.example.com/api/auth/api-keysBearer
List all API keys for the authenticated user. Keys are masked for security.
Request
$ curl https://your-server.example.com/api/auth/api-keys \
-H "Authorization: Bearer $FC_KEY"Response
[
{
"id": "e4f5a6b7-8c9d-4e0f-...",
"name": "my-script",
"lastFour": "f6a1",
"createdAt": "2026-03-05T14:25:00Z",
"lastUsedAt": "2026-03-05T15:10:33Z"
}
]DELETEhttps://your-server.example.com/api/auth/api-keys/:idBearer
Revoke an API key permanently. It cannot be recovered.
Request
$ curl -X DELETE https://your-server.example.com/api/auth/api-keys/e4f5a6b7-8c9d-4e0f-... \
-H "Authorization: Bearer $FC_KEY"Response
{
"message": "API key revoked"
}Security
Security is not an afterthought
ForgeCrawl scrapes arbitrary URLs from the internet. Every layer is hardened against abuse.
SSRF Protection
Private IPs, localhost, cloud metadata, DNS re-validation on redirects
bcrypt + JWT
12-round password hashing, HTTP-only secure cookies, configurable session expiry
API Key Auth
SHA-256 hashed Bearer tokens (fc_...) for CLI and scripting
Rate Limiting
5 failed logins per email per 15-minute window
Error Sanitization
No server paths, no stack traces, no user enumeration
Data Isolation
All queries scoped to authenticated user ID
81 integration tests covering auth, SSRF, rate limiting, error sanitization, and data isolation. Explore the API →
Get Started
Deploy in under a minute
No sign-ups. No subscriptions. No usage limits. Clone, deploy, and scrape — forever free on your own server.
Docker Compose
$ git clone https://github.com/cschweda/forgecrawl $ cd forgecrawl $ docker compose up -d # Visit http://localhost:5150
Bare Metal (PM2)
$ git clone https://github.com/cschweda/forgecrawl $ cd forgecrawl && pnpm install $ cp .env.example .env $ pnpm build && pm2 start ecosystem.config.cjs # Visit http://localhost:5150
Tech Stack
Use Cases
Built for real workflows
RAG Pipelines
Scrape hundreds of pages from government, academic, or institutional sites. Get clean Markdown with metadata ready for vector embedding and retrieval.
LLM Context
Grab a long technical doc or policy page and paste the Markdown directly into Claude, GPT, or any LLM prompt. No more broken formatting.
Content Archiving
Archive blog posts, support docs, or product pages as Markdown with YAML frontmatter preserving dates, authors, and URLs.
Training Data
Generate consistently formatted Markdown suitable for fine-tuning or embedding generation from any public website.
Why Do I Need This?
Real people. Real problems. One tool.
Whether you're building AI products, managing content pipelines, or doing research — if you need clean web data, ForgeCrawl replaces fragile scripts and expensive SaaS.
"We needed web data for our AI features but couldn't send customer URLs to third-party APIs. ForgeCrawl runs on our own infra — compliance approved it in a day."
- Data stays on-premise
- No per-page costs to budget
- Simple Docker deploy for the team
"I was copy-pasting docs into ChatGPT and losing formatting every time. Now I hit one endpoint and get perfect Markdown with frontmatter. It's in my shell aliases."
- Clean API with Bearer auth
- Markdown output ready for LLMs
- curl-friendly — no SDK needed
"Our content team archives 200+ policy pages a quarter. ForgeCrawl replaced a brittle Python script and a $400/mo SaaS subscription."
- Bulk scraping with caching
- YAML metadata for organization
- Free — no vendor lock-in
"I'm building a RAG pipeline over government datasets. ForgeCrawl gives me structured Markdown with source URLs and timestamps — exactly what my embeddings need."
- Consistent output format
- Source attribution in frontmatter
- Self-hosted for IRB compliance
"I monitor 30+ agency newsrooms for breaking policy changes. ForgeCrawl lets me pull clean text on demand and diff it against last week's version — no more reading raw HTML in dev tools."
- On-demand scraping of public pages
- Git-friendly Markdown for diffing
- No third-party data sharing
"We migrated 500 pages from a legacy CMS to a docs-as-code workflow. ForgeCrawl scraped every page into Markdown with frontmatter — saved us weeks of manual conversion."
- Bulk conversion to clean Markdown
- Preserved headings, tables, and code blocks
- YAML metadata for CMS import