XPath Agent: Automating Web Scraping with LLM‑Built XPaths

Motivation

Writing durable XPath for web scraping is time-consuming and brittle. A good XPath must work across multiple page variants, not just one. XPath Agent (Yu Li, Bryce Wang, Xinyu Luan) introduces a LLM-based two-stage system to generate XPath queries from a few sample web pages and a simple LLM prompt.

Method

Stage 1: Information Extraction

Sanitize HTML: Strip irrelevant tags, attributes, whitespace to reduce the input size by ~10–20%.
Prompt for target info and cue text: Prompt LLM to extract key cue text and target values.
- Cue text is asked to handle cases where target information exists in locations that are hard to query with XPath.

Stage 2: XPath Programming

Condense HTML by only preserving relevant nodes and replacing all other text with “…”.
- Only texts below a certain “distance” from the target text is kept.
Prompt a stronger LLM (GPT‑4o, Claude 3.5, etc.) with the condensed HTML + prompt to generate an XPath.
Conversational Evaluator: Iteratively refine the XPath via LLM feedback.

📈 Results

Model comparison (SWDE benchmark):

Model	F1 Score
DeepSeek V2.5	0.7337
GPT‑4o mini	0.7961
GPT‑4o	0.8514
Claude 3.5	0.7647

🧠 Contributions

Two‑stage pipeline: weak LLM for extraction, strong LLM for XPath.
HTML pruning: minimizes noise and tokens.
Cue‑based prompting: boosts XPath generalizability.
Iterative refinement loop: enhances robustness in real-world deployment.

👀 Why It Matters

Token‑efficient: Lower LLM usage lowers cost and improves speed.
Generalizable XPaths: One expression works across page variants, which is needed for scaling web scraping.

❓ Remaining Questions

The arXiv preprint is missing the last 2 sections, which may have answered these questions.

Distance Function in Stage 2
The pruning step keeps nodes within a certain “distance” from the cue text.
- What exact DOM-based distance metric is used?
- How sensitive is the XPath generation to this threshold?
Handling of Non-Extractive or Multi-Hop Scenarios
The system assumes that the answer exists as visible text near some cue.
- Can it handle cases where values must be inferred, combined, or extracted from multiple DOM elements?
- What about implied or computed values?
Model Assignment: Weaker for Extraction, Stronger for XPath
The pipeline uses a smaller LLM for cue extraction and a larger one for XPath synthesis.
- This seems counterintuitive — XPath generation is arguably simpler.
- Would performance or efficiency improve if model roles were swapped?
Need for the Iterative Refinement Loop
A conversational evaluator is used to refine XPath generalization.
- How essential is this loop in practice?
- Can an LLM generate general XPath expressions directly with a good prompt?
Comparison with Static XPath Algorithms
A rule-based XPath generator is mentioned but not compared in detail.
- How does it perform relative to the LLM-based method?
- Could a hybrid approach (e.g., static rules augmented by LLM edits) be more robust?