Skip to main content
Piloterr
LlamaIndex logo

AI stacks

Web data ingestion for LlamaIndex with Piloterr

Load live web content into LlamaIndex pipelines via Piloterr REST APIs. Structured JSON and Markdown from protected sites, ready for chunking, embedding, and retrieval.

  • Custom readers and tools over Piloterr endpoints
  • Clean JSON/Markdown, no HTML cleanup step
  • Anti-bot bypass for production RAG
  • Works with any vector store LlamaIndex supports

At a glance

Readers

custom loaders

JSON

structured input

400+

web sources

REST

HTTP API

Why connect LlamaIndex

  • Custom data loaders

    Build LlamaIndex readers that fetch pages via Piloterr and return Document objects with clean text metadata.

  • Query engines

    Combine scraped data with LlamaIndex query engines for grounded Q&A over live web content.

  • Skip HTML parsing

    Piloterr returns structured fields, title, body, price, metadata, without BeautifulSoup preprocessing.

  • Production reliability

    Anti-bot bypass and managed proxies mean your ingestion pipeline doesn't break when targets add Cloudflare.

LlamaIndex + Piloterr patterns

From one-off research to scheduled index refresh.

  • Document ingestion

    Fetch JSON, map fields to Document text and metadata, index into vector store.

  • Scheduled refresh

    Cron or workflow triggers re-scrape and upsert changed documents.

  • Multi-source indexes

    Combine SERP, news, and product data in a single LlamaIndex index.

  • Tool-augmented query

    Query engines call Piloterr on-the-fly for questions needing fresh data.

Why not use SimpleWebPageReader alone?

ApproachDIYPiloterr
SimpleWebPageReaderBlocked on protected sitesManaged bypass
Raw HTMLNoisy chunks, poor retrievalStructured text fields
JS-heavy SPAsEmpty contentHeadless rendering
MaintenancePer-site scraper logic400+ managed endpoints

Connect LlamaIndex in four steps

  1. Step 1

    Install LlamaIndex

    pip install llama-index llama-index-llms-openai llama-index-embeddings-openai requests

  2. Step 2

    Get your API key

    Set PILOTERR_API_KEY in your environment.

    Get your API key
  3. Step 3

    Create a custom reader

    Subclass BaseReader or use a function that calls Piloterr and returns Documents.

  4. Step 4

    Build index and query

    VectorStoreIndex.from_documents() then query_engine.query().

Workflow recipes

  • Competitive intel index

    Daily scrape of competitor pages → chunk → embed → Q&A over pricing and features.

  • News monitoring RAG

    Google News reader refreshes index hourly for industry keyword tracking.

  • Product catalog search

    E-commerce API data indexed for semantic product discovery.

  • Help center Q&A index

    Ingest help docs via Piloterr reader, refresh nightly, and power semantic search for support.

LlamaIndex vs LangChain vs CrewAI

  • Scenario

    RAG and document indexing

    Recommendation: LlamaIndex

  • Scenario

    Tool-calling agents

    Recommendation: LangChain

  • Scenario

    Multi-agent teams

    Recommendation: CrewAI

  • Scenario

    Simple HTTP ETL

    Recommendation: Python SDK

LlamaIndex reader example

Load Google News articles into a vector index via Piloterr.

Python
import os
import requests
from llama_index.core import Document
from llama_index.core.readers.base import BaseReader

class PiloterrNewsReader(BaseReader):
    def __init__(self, api_key: str | None = None):
        self.api_key = api_key or os.environ["PILOTERR_API_KEY"]
        self.base = "https://api.piloterr.com/v2"

    def load_data(self, query: str, location: str = "Paris, FR") -> list[Document]:
        response = requests.post(
            f"{self.base}/google/news",
            headers={"x-api-key": self.api_key, "Content-Type": "application/json"},
            json={"query": query, "location": location, "page": 1},
            timeout=60,
        )
        response.raise_for_status()
        data = response.json()
        docs = []
        for item in data.get("organic_results", []):
            docs.append(Document(
                text=f"{item.get('title', '')}\n\n{item.get('snippet', '')}",
                metadata={"url": item.get("link"), "source": item.get("source")},
            ))
        return docs

Transparent credit pricing

Pay only for successful requests. Start with +500 credits, then scale with plans from $49/mo.

Premium

$49/mo

18,000 credits

Premium+

$99/mo

40,000 credits

Startup

$249/mo

110,000 credits

Ready to get started?

Your web scraping API is one click away. Start with +500 credits, no infrastructure to set up, no proxies to manage, and no credit card required.

Start free (+500 credits)