XPath Agent: Automating Web Scraping with LLM‑Built XPaths

extraction
web-extraction
llm
Author

Ryan Lee

Published

June 19, 2025

Motivation

Writing durable XPath for web scraping is time-consuming and brittle. A good XPath must work across multiple page variants, not just one. XPath Agent (Yu Li, Bryce Wang, Xinyu Luan) introduces a LLM-based two-stage system to generate XPath queries from a few sample web pages and a simple LLM prompt.


Method

Stage 1: Information Extraction

  1. Sanitize HTML: Strip irrelevant tags, attributes, whitespace to reduce the input size by ~10–20%.
  2. Prompt for target info and cue text: Prompt LLM to extract key cue text and target values.
    • Cue text is asked to handle cases where target information exists in locations that are hard to query with XPath.

Stage 2: XPath Programming

  1. Condense HTML by only preserving relevant nodes and replacing all other text with “…”.
    • Only texts below a certain “distance” from the target text is kept.
  2. Prompt a stronger LLM (GPT‑4o, Claude 3.5, etc.) with the condensed HTML + prompt to generate an XPath.
  3. Conversational Evaluator: Iteratively refine the XPath via LLM feedback.

📈 Results

Model comparison (SWDE benchmark):

Model F1 Score
DeepSeek V2.5 0.7337
GPT‑4o mini 0.7961
GPT‑4o 0.8514
Claude 3.5 0.7647

🧠 Contributions

  1. Two‑stage pipeline: weak LLM for extraction, strong LLM for XPath.
  2. HTML pruning: minimizes noise and tokens.
  3. Cue‑based prompting: boosts XPath generalizability.
  4. Iterative refinement loop: enhances robustness in real-world deployment.

👀 Why It Matters

  • Token‑efficient: Lower LLM usage lowers cost and improves speed.
  • Generalizable XPaths: One expression works across page variants, which is needed for scaling web scraping.

❓ Remaining Questions

The arXiv preprint is missing the last 2 sections, which may have answered these questions.

  1. Distance Function in Stage 2
    The pruning step keeps nodes within a certain “distance” from the cue text.
    • What exact DOM-based distance metric is used?
    • How sensitive is the XPath generation to this threshold?
  2. Handling of Non-Extractive or Multi-Hop Scenarios
    The system assumes that the answer exists as visible text near some cue.
    • Can it handle cases where values must be inferred, combined, or extracted from multiple DOM elements?
    • What about implied or computed values?
  3. Model Assignment: Weaker for Extraction, Stronger for XPath
    The pipeline uses a smaller LLM for cue extraction and a larger one for XPath synthesis.
    • This seems counterintuitive — XPath generation is arguably simpler.
    • Would performance or efficiency improve if model roles were swapped?
  4. Need for the Iterative Refinement Loop
    A conversational evaluator is used to refine XPath generalization.
    • How essential is this loop in practice?
    • Can an LLM generate general XPath expressions directly with a good prompt?
  5. Comparison with Static XPath Algorithms
    A rule-based XPath generator is mentioned but not compared in detail.
    • How does it perform relative to the LLM-based method?
    • Could a hybrid approach (e.g., static rules augmented by LLM edits) be more robust?