Motivation
Writing durable XPath for web scraping is time-consuming and brittle. A good XPath must work across multiple page variants, not just one. XPath Agent (Yu Li, Bryce Wang, Xinyu Luan) introduces a LLM-based two-stage system to generate XPath queries from a few sample web pages and a simple LLM prompt.
Method
Stage 1: Information Extraction
- Sanitize HTML: Strip irrelevant tags, attributes, whitespace to reduce the input size by ~10–20%.
- Prompt for target info and cue text: Prompt LLM to extract key cue text and target values.
- Cue text is asked to handle cases where target information exists in locations that are hard to query with XPath.
Stage 2: XPath Programming
- Condense HTML by only preserving relevant nodes and replacing all other text with “…”.
- Only texts below a certain “distance” from the target text is kept.
- Prompt a stronger LLM (GPT‑4o, Claude 3.5, etc.) with the condensed HTML + prompt to generate an XPath.
- Conversational Evaluator: Iteratively refine the XPath via LLM feedback.
📈 Results
Model comparison (SWDE benchmark):
Model | F1 Score |
---|---|
DeepSeek V2.5 | 0.7337 |
GPT‑4o mini | 0.7961 |
GPT‑4o | 0.8514 |
Claude 3.5 | 0.7647 |
🧠 Contributions
- Two‑stage pipeline: weak LLM for extraction, strong LLM for XPath.
- HTML pruning: minimizes noise and tokens.
- Cue‑based prompting: boosts XPath generalizability.
- Iterative refinement loop: enhances robustness in real-world deployment.
👀 Why It Matters
- Token‑efficient: Lower LLM usage lowers cost and improves speed.
- Generalizable XPaths: One expression works across page variants, which is needed for scaling web scraping.
❓ Remaining Questions
The arXiv preprint is missing the last 2 sections, which may have answered these questions.
- Distance Function in Stage 2
The pruning step keeps nodes within a certain “distance” from the cue text.- What exact DOM-based distance metric is used?
- How sensitive is the XPath generation to this threshold?
- What exact DOM-based distance metric is used?
- Handling of Non-Extractive or Multi-Hop Scenarios
The system assumes that the answer exists as visible text near some cue.- Can it handle cases where values must be inferred, combined, or extracted from multiple DOM elements?
- What about implied or computed values?
- Model Assignment: Weaker for Extraction, Stronger for XPath
The pipeline uses a smaller LLM for cue extraction and a larger one for XPath synthesis.- This seems counterintuitive — XPath generation is arguably simpler.
- Would performance or efficiency improve if model roles were swapped?
- This seems counterintuitive — XPath generation is arguably simpler.
- Need for the Iterative Refinement Loop
A conversational evaluator is used to refine XPath generalization.- How essential is this loop in practice?
- Can an LLM generate general XPath expressions directly with a good prompt?
- How essential is this loop in practice?
- Comparison with Static XPath Algorithms
A rule-based XPath generator is mentioned but not compared in detail.- How does it perform relative to the LLM-based method?
- Could a hybrid approach (e.g., static rules augmented by LLM edits) be more robust?
- How does it perform relative to the LLM-based method?