This Python script scrapes the website and generates a PDF containing the content from all pages.
- Extracts content including titles, paragraphs, images, and code blocks
- Removes sidebar elements from the scraped pages
- Generates a single PDF file containing all the extracted content
- Python 3.x
- Required Python packages:
- requests
- beautifulsoup4
- lxml
- pdfkit
- tqdm
- Clone this repository or download the
main.py
file. - Install the required packages:
pip install -r requirements.txt
- Install wkhtmltopdf (required by pdfkit):
- For Windows: Download and install from wkhtmltopdf.org
- For macOS:
brew install wkhtmltopdf
- For Linux:
sudo apt-get install wkhtmltopdf
- Open
main.py
and modify theurl
variable if you want to start from a different page of the Django documentation. - Run the script:
python main.py
- The script will display progress as it scrapes the website and generates the PDF.
- Once complete, you'll find a
output.pdf
file in the same directory as the script.
- The script may take a while to run, depending on the number of pages in the documentation.
- Some complex page layouts or dynamic content may not be captured perfectly in the PDF.
- The script doesn't handle pagination or create a table of contents for the PDF.
Feel free to fork this repository and submit pull requests to improve the functionality or add new features.
This project is open-source and available under the MIT License.