Can LlamaIndex ingest Markdown from Piloterr?

Yes. Use endpoints that return Markdown-friendly fields or map JSON text fields directly to Document objects.

How often should I refresh the index?

Match refresh frequency to data freshness needs. News: hourly. Pricing: daily. Use scheduled jobs calling Piloterr then upserting changed documents.

Does this work with LlamaIndex Cloud?

Yes. Custom readers run in your pipeline regardless of where the index is hosted.

AI stacks

Web data ingestion for LlamaIndex with Piloterr

Load live web content into LlamaIndex pipelines via Piloterr REST APIs. Structured JSON and Markdown from protected sites, ready for chunking, embedding, and retrieval.

Start free (+500 credits)View pricing

Custom readers and tools over Piloterr endpoints
Clean JSON/Markdown, no HTML cleanup step
Anti-bot bypass for production RAG
Works with any vector store LlamaIndex supports

At a glance

Readers

custom loaders

JSON

structured input

400+

web sources

REST

HTTP API

Why connect LlamaIndex

Custom data loaders
Build LlamaIndex readers that fetch pages via Piloterr and return Document objects with clean text metadata.
Query engines
Combine scraped data with LlamaIndex query engines for grounded Q&A over live web content.
Skip HTML parsing
Piloterr returns structured fields, title, body, price, metadata, without BeautifulSoup preprocessing.
Production reliability
Anti-bot bypass and managed proxies mean your ingestion pipeline doesn't break when targets add Cloudflare.

LlamaIndex + Piloterr patterns

From one-off research to scheduled index refresh.

Document ingestion
Fetch JSON, map fields to Document text and metadata, index into vector store.
Scheduled refresh
Cron or workflow triggers re-scrape and upsert changed documents.
Multi-source indexes
Combine SERP, news, and product data in a single LlamaIndex index.
Tool-augmented query
Query engines call Piloterr on-the-fly for questions needing fresh data.

Why not use SimpleWebPageReader alone?

Approach	DIY	Piloterr
SimpleWebPageReader	Blocked on protected sites	Managed bypass
Raw HTML	Noisy chunks, poor retrieval	Structured text fields
JS-heavy SPAs	Empty content	Headless rendering
Maintenance	Per-site scraper logic	400+ managed endpoints

Connect LlamaIndex in four steps

Step 1
Install LlamaIndex
pip install llama-index llama-index-llms-openai llama-index-embeddings-openai requests
Step 2
Get your API key
Set PILOTERR_API_KEY in your environment.
Get your API key
Step 3
Create a custom reader
Subclass BaseReader or use a function that calls Piloterr and returns Documents.
Step 4
Build index and query
VectorStoreIndex.from_documents() then query_engine.query().

Workflow recipes

Competitive intel index
Daily scrape of competitor pages → chunk → embed → Q&A over pricing and features.
News monitoring RAG
Google News reader refreshes index hourly for industry keyword tracking.
Product catalog search
E-commerce API data indexed for semantic product discovery.
Help center Q&A index
Ingest help docs via Piloterr reader, refresh nightly, and power semantic search for support.

LlamaIndex vs LangChain vs CrewAI

Scenario
RAG and document indexing
Recommendation: LlamaIndex
Scenario
Tool-calling agents
Recommendation: LangChain
Scenario
Multi-agent teams
Recommendation: CrewAI
Scenario
Simple HTTP ETL
Recommendation: Python SDK

LlamaIndex reader example

Load Google News articles into a vector index via Piloterr.

Browse API library

import os
import requests
from llama_index.core import Document
from llama_index.core.readers.base import BaseReader

class PiloterrNewsReader(BaseReader):
    def __init__(self, api_key: str | None = None):
        self.api_key = api_key or os.environ["PILOTERR_API_KEY"]
        self.base = "https://api.piloterr.com/v2"

    def load_data(self, query: str, location: str = "Paris, FR") -> list[Document]:
        response = requests.post(
            f"{self.base}/google/news",
            headers={"x-api-key": self.api_key, "Content-Type": "application/json"},
            json={"query": query, "location": location, "page": 1},
            timeout=60,
        )
        response.raise_for_status()
        data = response.json()
        docs = []
        for item in data.get("organic_results", []):
            docs.append(Document(
                text=f"{item.get('title', '')}\n\n{item.get('snippet', '')}",
                metadata={"url": item.get("link"), "source": item.get("source")},
            ))
        return docs

Transparent credit pricing

Pay only for successful requests. Start with +500 credits, then scale with plans from $49/mo.

Premium

$49/mo

18,000 credits

Premium+

$99/mo

40,000 credits

Startup

$249/mo

110,000 credits

View all plans and credit modes →Estimate your monthly usage →

Ready to get started?

Your web scraping API is one click away. Start with +500 credits, no infrastructure to set up, no proxies to manage, and no credit card required.

Start free (+500 credits)

Web data ingestion for LlamaIndex with Piloterr

Why connect LlamaIndex

Custom data loaders

Query engines

Skip HTML parsing

Production reliability

LlamaIndex + Piloterr patterns

Document ingestion

Scheduled refresh

Multi-source indexes

Tool-augmented query

Why not use SimpleWebPageReader alone?

Connect LlamaIndex in four steps

Install LlamaIndex

Get your API key

Create a custom reader

Build index and query

Workflow recipes

Competitive intel index

News monitoring RAG

Product catalog search

Help center Q&A index

LlamaIndex vs LangChain vs CrewAI

RAG and document indexing

Tool-calling agents

Multi-agent teams

Simple HTTP ETL