
Web content forged into Markdown for LLMs
Self-hosted, authenticated web scraper that converts any webpage into clean, structured Markdown — optimized for RAG pipelines, fine-tuning, and LLM prompts.
Features
Everything you need. Nothing you don't.
Self-Hosted
Your data never leaves your server. No third-party APIs, no usage tracking, no data sharing. You own everything.
Free Forever
No monthly fees. No per-page pricing. No usage caps. No API keys to buy. Self-hosted on your own VPS — the only cost is your server.
Authenticated
Built-in user auth with bcrypt + JWT. API key support for CLI scripting. Rate limiting. SSRF protection.
LLM-Ready Markdown
Clean Markdown with YAML frontmatter. Metadata, word counts, timestamps. Ready to paste into any LLM.
Zero Dependencies
SQLite database. No Redis, no Postgres, no message queue. One Docker command or PM2 start. That's it.
Smart Caching
Configurable result cache with TTL. Bypass on demand. Never scrape the same page twice unless you want to.
How It Works
Three steps to clean Markdown
Deploy
Clone the repo. Run docker compose up or pm2 start. Register your admin account.
Create an API Key
Log in to the web UI. Generate a Bearer token. Use it in scripts, curl, or any HTTP client.
Scrape & Build
POST a URL, get clean Markdown back. Feed it to your RAG pipeline, fine-tuning dataset, or LLM prompt.
Why Markdown?
HTML is for browsers. Markdown is for LLMs.
You could paste raw HTML into an LLM. But you'd be wasting tokens on noise, confusing the model with layout markup, and getting worse results. Here's why Markdown matters.
<!DOCTYPE html>
<html lang="en" class="dark-mode">
<head>
<script>window.__NEXT_DATA__={...}</script>
<link rel="stylesheet" href="/css/main.a8f2.css">
<!-- 47 more link/script tags -->
</head>
<body>
<nav class="flex items-center px-4...">
<!-- 200 lines of navigation -->
</nav>
<div id="content" class="prose max-w-none">
<h1>Getting Started</h1>
<p>The actual content is buried here...</p>
</div>
<footer>...</footer>
<script src="/js/bundle.f3e1.js"></script>
</body></html>--- title: Getting Started url: https://example.com/docs/start description: Quick start guide scraped_at: 2026-03-05T10:30:00Z scraper: ForgeCrawl/1.0 word_count: 847 --- # Getting Started The actual content starts immediately. ## Installation Run the following command: ```bash npm install example-sdk ``` ## Configuration | Option | Default | Description | |---------|---------|------------------| | timeout | 30s | Request timeout | | retries | 3 | Max retry count |
~80% fewer tokens
A typical webpage is 50-200KB of HTML. After stripping nav, ads, scripts, and boilerplate, the Markdown is 2-10KB. That's 80-95% fewer tokens consumed by your LLM — which means lower cost and more room for actual context.
Signal, not noise
Raw HTML is full of <div>, <span>, class names, inline styles, tracking pixels, and aria attributes. None of that is content. Markdown strips it down to headings, paragraphs, lists, links, and code — exactly what the LLM needs to reason about.
Better LLM comprehension
LLMs are trained heavily on Markdown (GitHub, docs, README files). They parse Markdown structure natively — headings map to topics, lists map to enumeration, code blocks map to examples. HTML structure is ambiguous and model-dependent.
Structured metadata
ForgeCrawl adds YAML frontmatter with the source URL, title, description, scrape timestamp, and word count. This metadata is critical for RAG pipelines — you know where every chunk came from and when it was captured.
Consistent format
Every website has different HTML structure. ForgeCrawl normalizes all of them into the same clean Markdown format — headings, paragraphs, lists, tables, code blocks. Your downstream pipeline handles one format, not thousands.
Diffable and versionable
Markdown diffs cleanly in git. HTML doesn't. If you're tracking content changes over time (regulatory pages, docs, policies), Markdown lets you see exactly what changed in a human-readable diff.
API
Built for automation
Every action in ForgeCrawl is available through the REST API. Authenticate with a Bearer token and integrate with any workflow.
Your server, your domain
Enter your VPS domain below — every API example on this page updates in real time.
ForgeCrawl is self-hosted — it runs on your infrastructure, not ours. No account needed.
Scrape a page
$ curl -X POST https://your-server.example.com/api/scrape \ -H "Authorization: Bearer $FC_KEY" \ -H "Content-Type: application/json" \ -d '{"url": "https://example.com"}'
Response
{
"job_id": "a1b2c3...",
"title": "Example Domain",
"markdown": "---\ntitle: Example Domain\n...",
"wordCount": 42,
"cached": false
}Node.js example
const res = await fetch('https://your-server.example.com/api/scrape', { method: 'POST', headers: { 'Authorization': `Bearer ${process.env.FC_KEY}`, 'Content-Type': 'application/json', }, body: JSON.stringify({ url: 'https://example.com' }), }) const { markdown, title, wordCount } = await res.json() console.log(`Scraped "${title}" (${wordCount} words)`)
All endpoints
https://your-server.example.com/api/healthpublichttps://your-server.example.com/api/scrapeBearerhttps://your-server.example.com/api/scrapesBearerhttps://your-server.example.com/api/scrapes/:idBearerhttps://your-server.example.com/api/scrapes/:idBearerhttps://your-server.example.com/api/auth/api-keysBearerhttps://your-server.example.com/api/auth/api-keysBearerhttps://your-server.example.com/api/auth/api-keys/:idBearerSecurity
Security is not an afterthought
ForgeCrawl scrapes arbitrary URLs from the internet. Every layer is hardened against abuse.
SSRF Protection
Private IPs, localhost, cloud metadata, DNS re-validation on redirects
bcrypt + JWT
12-round password hashing, HTTP-only secure cookies, configurable session expiry
API Key Auth
SHA-256 hashed Bearer tokens (fc_...) for CLI and scripting
Rate Limiting
5 failed logins per email per 15-minute window
Error Sanitization
No server paths, no stack traces, no user enumeration
Data Isolation
All queries scoped to authenticated user ID
81 integration tests covering auth, SSRF, rate limiting, error sanitization, and data isolation. Explore the API →
Get Started
Deploy in under a minute
No sign-ups. No subscriptions. No usage limits. Clone, deploy, and scrape — forever free on your own server.
Docker Compose
$ git clone https://github.com/ICJIA/forgecrawl $ cd forgecrawl $ docker compose up -d # Visit http://localhost:5150
Bare Metal (PM2)
$ git clone https://github.com/ICJIA/forgecrawl $ cd forgecrawl && pnpm install $ cp .env.example .env $ pnpm build && pm2 start ecosystem.config.cjs # Visit http://localhost:5150
Tech Stack
Use Cases
Built for real workflows
RAG Pipelines
Scrape hundreds of pages from government, academic, or institutional sites. Get clean Markdown with metadata ready for vector embedding and retrieval.
LLM Context
Grab a long technical doc or policy page and paste the Markdown directly into Claude, GPT, or any LLM prompt. No more broken formatting.
Content Archiving
Archive blog posts, support docs, or product pages as Markdown with YAML frontmatter preserving dates, authors, and URLs.
Training Data
Generate consistently formatted Markdown suitable for fine-tuning or embedding generation from any public website.
Why Do I Need This?
Real people. Real problems. One tool.
Whether you're building AI products, managing content pipelines, or doing research — if you need clean web data, ForgeCrawl replaces fragile scripts and expensive SaaS.
"We needed web data for our AI features but couldn't send customer URLs to third-party APIs. ForgeCrawl runs on our own infra — compliance approved it in a day."
- Data stays on-premise
- No per-page costs to budget
- Simple Docker deploy for the team
"I was copy-pasting docs into ChatGPT and losing formatting every time. Now I hit one endpoint and get perfect Markdown with frontmatter. It's in my shell aliases."
- Clean API with Bearer auth
- Markdown output ready for LLMs
- curl-friendly — no SDK needed
"Our content team archives 200+ policy pages a quarter. ForgeCrawl replaced a brittle Python script and a $400/mo SaaS subscription."
- Bulk scraping with caching
- YAML metadata for organization
- Free — no vendor lock-in
"I'm building a RAG pipeline over government datasets. ForgeCrawl gives me structured Markdown with source URLs and timestamps — exactly what my embeddings need."
- Consistent output format
- Source attribution in frontmatter
- Self-hosted for IRB compliance