Skip to content

Feature: Add Chrome support, rate-limit retry logic, and update docs#40

Open
theFuribundi wants to merge 1 commit intotimf34:mainfrom
theFuribundi:feature/chrome-and-stability
Open

Feature: Add Chrome support, rate-limit retry logic, and update docs#40
theFuribundi wants to merge 1 commit intotimf34:mainfrom
theFuribundi:feature/chrome-and-stability

Conversation

@theFuribundi
Copy link

This PR refactors substack_scraper.py to add cross-browser support and improve stability when scraping large archives.

Key Changes:

  • Dual Browser Support: Added support for Google Chrome via webdriver_manager. The script now accepts a --browser argument (chrome, edge, or auto) and attempts to auto-detect the available browser if one fails.
  • Rate Limit Handling: Implemented exponential backoff and retry logic. If a 429 Too Many Requests or empty template is detected, the script now waits and retries instead of crashing or skipping the post.
  • Smart Skipping: Moved the sleep timer inside the existence check. The script now verifies if a local file exists before pausing, allowing it to "fast-forward" through already downloaded posts without unnecessary delays.
  • Human-like Behavior: Replaced fixed sleep timers with randomized "jitter" (10-20s) and added wait times for JS rendering to prevent detection and ensure content loads fully.
  • Improved Logging: Switched print statements to tqdm.write to prevent interference with the progress bar.
  • Documentation: Updated README.md with the correct clone URL and ensured requirements.txt includes webdriver_manager.

Tested on Ubuntu 24.04.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant