AI stacks
Web data ingestion for LlamaIndex with Piloterr
Load live web content into LlamaIndex pipelines via Piloterr REST APIs. Structured JSON and Markdown from protected sites, ready for chunking, embedding, and retrieval.
- Custom readers and tools over Piloterr endpoints
- Clean JSON/Markdown, no HTML cleanup step
- Anti-bot bypass for production RAG
- Works with any vector store LlamaIndex supports
At a glance
Readers
custom loaders
JSON
structured input
400+
web sources
REST
HTTP API
Why connect LlamaIndex
Custom data loaders
Build LlamaIndex readers that fetch pages via Piloterr and return Document objects with clean text metadata.
Query engines
Combine scraped data with LlamaIndex query engines for grounded Q&A over live web content.
Skip HTML parsing
Piloterr returns structured fields, title, body, price, metadata, without BeautifulSoup preprocessing.
Production reliability
Anti-bot bypass and managed proxies mean your ingestion pipeline doesn't break when targets add Cloudflare.
LlamaIndex + Piloterr patterns
From one-off research to scheduled index refresh.
Document ingestion
Fetch JSON, map fields to Document text and metadata, index into vector store.
Scheduled refresh
Cron or workflow triggers re-scrape and upsert changed documents.
Multi-source indexes
Combine SERP, news, and product data in a single LlamaIndex index.
Tool-augmented query
Query engines call Piloterr on-the-fly for questions needing fresh data.
Why not use SimpleWebPageReader alone?
| Approach | DIY | Piloterr |
|---|---|---|
| SimpleWebPageReader | Blocked on protected sites | Managed bypass |
| Raw HTML | Noisy chunks, poor retrieval | Structured text fields |
| JS-heavy SPAs | Empty content | Headless rendering |
| Maintenance | Per-site scraper logic | 400+ managed endpoints |
Connect LlamaIndex in four steps
Step 1
Install LlamaIndex
pip install llama-index llama-index-llms-openai llama-index-embeddings-openai requests
Step 2
Get your API key
Set PILOTERR_API_KEY in your environment.
Get your API keyStep 3
Create a custom reader
Subclass BaseReader or use a function that calls Piloterr and returns Documents.
Step 4
Build index and query
VectorStoreIndex.from_documents() then query_engine.query().
Workflow recipes
Competitive intel index
Daily scrape of competitor pages → chunk → embed → Q&A over pricing and features.
News monitoring RAG
Google News reader refreshes index hourly for industry keyword tracking.
Product catalog search
E-commerce API data indexed for semantic product discovery.
Help center Q&A index
Ingest help docs via Piloterr reader, refresh nightly, and power semantic search for support.
LlamaIndex vs LangChain vs CrewAI
Scenario
RAG and document indexing
Recommendation: LlamaIndex
Scenario
Tool-calling agents
Recommendation: LangChain
Scenario
Multi-agent teams
Recommendation: CrewAI
Scenario
Simple HTTP ETL
Recommendation: Python SDK
LlamaIndex reader example
Load Google News articles into a vector index via Piloterr.
import os
import requests
from llama_index.core import Document
from llama_index.core.readers.base import BaseReader
class PiloterrNewsReader(BaseReader):
def __init__(self, api_key: str | None = None):
self.api_key = api_key or os.environ["PILOTERR_API_KEY"]
self.base = "https://api.piloterr.com/v2"
def load_data(self, query: str, location: str = "Paris, FR") -> list[Document]:
response = requests.post(
f"{self.base}/google/news",
headers={"x-api-key": self.api_key, "Content-Type": "application/json"},
json={"query": query, "location": location, "page": 1},
timeout=60,
)
response.raise_for_status()
data = response.json()
docs = []
for item in data.get("organic_results", []):
docs.append(Document(
text=f"{item.get('title', '')}\n\n{item.get('snippet', '')}",
metadata={"url": item.get("link"), "source": item.get("source")},
))
return docsTransparent credit pricing
Pay only for successful requests. Start with +500 credits, then scale with plans from $49/mo.
Premium
$49/mo
18,000 credits
Premium+
$99/mo
40,000 credits
Startup
$249/mo
110,000 credits