Summary
The engineer attempted to build a lightweight web-to-PDF archiving pipeline using requests, BeautifulSoup, and WeasyPrint. The implementation suffered from a fundamental architectural mismatch: it treated highly dynamic, modern web ecosystems as static HTML documents. This resulted in three critical failure modes: selector fragility, JavaScript blindness, and document layout degradation. The attempt to solve these issues via manual selector lists led to a “whack-a-mole” maintenance cycle that is mathematically impossible to scale.
Root Cause
The failure stems from three distinct layers of technical debt:
- Lack of a Headless Browser Engine: Using
requests.get()only retrieves the initial DOM. Modern sites are Single Page Applications (SPAs) or use hydration, where the content only exists after JavaScript execution. - Heuristic Fragility: The approach relied on Hardcoded CSS Selectors. Web design is non-standardized; relying on specific class names like
.article-bodyignores the reality of utility-first CSS (Tailwind) and obfuscated class names used by modern frameworks. - Parsing vs. Rendering: The engineer attempted to perform Content Extraction and PDF Rendering as two separate, uncoordinated steps. This creates a disconnect where the “clean” HTML lacks the necessary CSS context to render images and layouts correctly in WeasyPrint.
Why This Happens in Real Systems
In production, web environments are adversarial and highly variable. This problem occurs because:
- DOM Mutation: Modern websites are not static files; they are state machines. Content is injected dynamically based on scroll position (lazy loading) or user interaction.
- Semantic Inconsistency: There is no global standard for what constitutes an “article” in HTML5 that is strictly followed by developers.
- The “Junk” Problem: Advertisements, trackers, and modals are often injected into the DOM after the initial load, making them invisible to simple scrapers but visible to the user (and the PDF engine).
Real-World Impact
- High Maintenance Overhead: Every website update or new site added requires manual code changes, breaking the automation promise.
- Data Loss: Critical research information (images, charts, data tables) is lost because they are loaded via lazy-loading scripts that
requestscannot trigger. - Unreliable Archiving: The output is functionally useless for professional research if it contains broken layouts, orphaned headings, or missing context, leading to low trust in the automated system.
Example or Code
Instead of manual selectors, a senior approach uses Readability algorithms and Headless Browser orchestration.
import asyncio
from playwright.async_api import async_playwright
from readability import parse as readability_parse
import weasyprint
async def archive_article(url):
async with async_playwright() as p:
# Launch a real browser to handle JS/Lazy Loading
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
# Navigate and wait for network idle to ensure JS has run
await page.goto(url, wait_until="networkidle")
# Scroll to trigger lazy-loaded images
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await asyncio.sleep(2)
# Get the full rendered HTML
content = await page.content()
await browser.close()
# Use Readability to extract the 'core' content automatically
# This ignores ads, navbars, and sidebars via heuristic analysis
parsed = readability_parse(content)
clean_html = parsed.to_html()
title = parsed.title
# Render the clean HTML to PDF
weasyprint.HTML(string=clean_html).write_pdf(f"{title}.pdf")
if __name__ == "__main__":
asyncio.run(archive_article("https://example-news-site.com/article-1"))
How Senior Engineers Fix It
Senior engineers move away from imperative scraping (telling the script how to find data) toward declarative extraction (telling the script what the data looks like).
- Orchestration over Requests: Use Playwright or Puppeteer. These tools control a real Chromium instance, allowing you to handle cookies, JavaScript, and lazy loading naturally.
- Heuristic-Based Extraction: Instead of writing 30 selectors, use libraries like
python-readability. These use mathematical density heuristics (calculating text-to-tag ratios) to identify the “meat” of a page regardless of class names. - CSS Print Media Queries: To fix layout issues, inject a
<style>block during the PDF generation phase that uses@media printrules to force page breaks, hide unwanted elements, and ensure images scale correctly. - Sanitization Pipeline: Implement a pipeline: Render $\rightarrow$ Extract (Readability) $\rightarrow$ Sanitize (HTML cleaner) $\rightarrow$ Render (PDF).
Why Juniors Miss It
- The “Selector Trap”: Juniors view web scraping as a pattern-matching problem. They believe if they just find the “right” selector, the problem is solved. They don’t realize the “right” selector is a moving target.
- Ignoring the Lifecycle: They often assume the HTML they see in “View Source” is the same HTML the browser actually renders. They miss the execution layer of the web.
- Underestimating Layout Complexity: They treat PDF generation as a simple “Save as” function, failing to account for the complex interaction between CSS layout engines (Flexbox/Grid) and the fixed-page constraints of the PDF format.