Back to list

Unlocking the Power of Web Page Archiving: A Guide for Developers and Marketers

Date

December 12, 2024

Author

Jakub Tatulinski

Unlocking the Power of Web Page Archiving: A Guide for Developers and Marketers

Introduction

Imagine you’re working on a marketing campaign, and the landing page that inspired you suddenly goes offline. Or perhaps you’re a developer tasked with analyzing a live website without risking unintentional changes. These scenarios highlight the importance of archiving web pages—a process often overlooked but incredibly valuable for developers, marketers, and researchers alike.

In this article, we’ll dive into the art of web page archiving using automation, explore its benefits, and walk through a step-by-step tutorial to demonstrate how a simple Python script can streamline the process.

The Role of Automation in Web Archiving

Web page archiving manually can be time-consuming and prone to errors, especially when handling multiple pages and their associated assets. Automation simplifies this process by:

Saving Time: Downloading web pages and their resources in bulk takes a fraction of the time compared to manual methods.
Ensuring Completeness: Automated scripts identify and download all linked assets, including images, stylesheets, and JavaScript files, ensuring an accurate replica of the web page.
Reducing Errors: Scripts handle repetitive tasks consistently, minimizing the risk of missing or incorrectly saving resources.

A Python script like the one we’ll discuss below is a powerful tool to accomplish this seamlessly.

Step-by-Step Tutorial: Automating Web Page Archiving with Python

Let’s walk through the process of setting up and running a Python script to archive web pages.

1. Prerequisites

To follow this tutorial, you’ll need:

Python Installed: Make sure Python 3.x is installed on your system.
Required Libraries: Install the following libraries using pip:

2. Prepare Your CSV File

Create a CSV file named landing_pages.csv with two columns:

name: A unique name for each page.
link: The URL of the page to archive.

For example:

3. The Python Script

Here’s the Python script to automate the archiving process:

import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import pandas as pd

# Input CSV with "name" and "link"
input_csv = "landing_pages.csv"
output_dir = "saved_pages"

# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

# Load CSV
data = pd.read_csv(input_csv)

# List of file extensions to download
valid_extensions = [".jpg", ".jpeg", ".png", ".gif", ".svg",
                    ".css", ".js", ".mp4", ".webm", ".ogg",
                    ".mp3", ".pdf", ".doc", ".docx", ".xlsx", 
                    ".ppt", ".pptx", ".zip"]

def is_valid_url(url):
    """Check if the URL points to a downloadable asset."""
    return any(url.lower().endswith(ext) for ext in valid_extensions)

def save_asset(url, save_dir):
    """Download and save a single asset."""
    try:
        headers = {'User-Agent': 'Mozilla/5.0'}
        response = requests.get(url, stream=True, headers=headers)
        response.raise_for_status()

        # Extract the file name from URL
        file_name = os.path.basename(url.split("?")[0])
        file_path = os.path.join(save_dir, file_name)

        # Write the file
        with open(file_path, "wb") as file:
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)
        
        print(f"Downloaded: {url} -> {file_name}")
        return file_name
    except Exception as e:
        print(f"Failed to download: {url} -> {e}")
        return None

def save_page(name, url):
    try:
        # Request the main page
        headers = {'User-Agent': 'Mozilla/5.0'}
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()

        # Parse HTML
        soup = BeautifulSoup(response.text, 'html.parser')

        # Create a directory for the page
        page_dir = os.path.join(output_dir, name)
        os.makedirs(page_dir, exist_ok=True)

        # Save HTML file
        html_file_path = os.path.join(page_dir, f"{name}.html")

        # Download assets
        for tag, attr in [("img", "src"), ("link", "href"), ("script", "src"), 
                          ("video", "src"), ("audio", "src"), ("source", "src"), 
                          ("a", "href")]:
            for resource in soup.find_all(tag):
                resource_url = resource.get(attr)
                if resource_url:
                    # Make URL absolute
                    resource_url = urljoin(url, resource_url)

                    # Check if it's a valid asset
                    if is_valid_url(resource_url):
                        # Download the asset
                        downloaded_file = save_asset(resource_url, page_dir)
                        if downloaded_file:
                            # Update the HTML tag to point to the local file
                            resource[attr] = downloaded_file

        # Save the updated HTML file
        with open(html_file_path, "w", encoding="utf-8") as html_file:
            html_file.write(soup.prettify())

        print(f"Page '{name}' saved successfully.")
    except requests.exceptions.RequestException as e:
        print(f"Failed to fetch page '{name}' at '{url}': {e}")
    except Exception as e:
        print(f"Error processing page '{name}': {e}")

# Process each row in the CSV
for _, row in data.iterrows():
    name = row["name"]
    url = row["link"]

4. Run the Script

Save the script as web_archiver.py.
Place it in the same directory as your landing_pages.csv file.
Run the script in the terminal:

Use Cases for Web Archiving

1. Preserving Marketing Campaigns

Scenario: A marketer wants to preserve landing pages for reference or inspiration.
Benefit: By archiving these pages, they can be revisited offline, providing a reliable way to document campaign strategies and creative elements.

2. Web Development and Testing

Scenario: A developer needs to test design changes on a live site without affecting the production environment.
Benefit: Archiving the site allows developers to work locally, experimenting with updates and debugging without impacting the live version.

3. Data Collection for Research

Scenario: A researcher studying user interface design needs to analyze various websites for patterns.
Benefit: Archiving pages ensures they can be reviewed and compared later, even if the live sites change or go offline.

Actionable Takeaways

For Developers: Automate web archiving to create a reliable offline copy of live sites for testing and debugging.
For Marketers: Preserve landing pages and campaigns for future reference and inspiration.
For Researchers: Build a local repository of web pages for comprehensive analysis and study.

By leveraging automation, web page archiving becomes a quick and efficient process, enabling you to focus on what matters most: insights, development, and creativity.

Conclusion

Web page archiving may not sound glamorous, but it’s a practical skill that pays dividends in various fields. Whether you’re a marketer safeguarding your campaign assets, a developer testing changes offline, or a researcher analyzing web design trends, the ability to automate this process can save time and ensure accuracy.

Ready to take your web archiving to the next level? Follow the steps outlined in this tutorial and see how it transforms your workflow.

Happy archiving!

December 13, 2024

How to Generate an Updated List of Mailable Prospects in Pardot: A Workaround

Read

December 13, 2024

How to Generate an Updated List of Mailable Prospects in Pardot: A Workaround

Read

December 13, 2024

How to Generate an Updated List of Mailable Prospects in Pardot: A Workaround

Read

December 13, 2024

The Magic of Clipboard Automation: A Game-Changer for Developers and Marketers

Read

December 13, 2024

The Magic of Clipboard Automation: A Game-Changer for Developers and Marketers

Read

December 13, 2024

The Magic of Clipboard Automation: A Game-Changer for Developers and Marketers

Read

All posts

Got questions?

I’m always excited to collaborate on innovative and exciting projects!

E-mail

jakub.tatulinski@gmail.com

Phone

+49 172 5407 309

Schedule a call

Got questions?

I’m always excited to collaborate on innovative and exciting projects!

E-mail

jakub.tatulinski@gmail.com

Phone

+49 172 5407 309

Schedule a call

Got questions?

I’m always excited to collaborate on innovative and exciting projects!

E-mail

jakub.tatulinski@gmail.com

Phone

+49 172 5407 309

Schedule a call

Unlocking the Power of Web Page Archiving: A Guide for Developers and Marketers

Unlocking the Power of Web Page Archiving: A Guide for Developers and Marketers

Introduction

The Role of Automation in Web Archiving

Step-by-Step Tutorial: Automating Web Page Archiving with Python

1. Prerequisites

2. Prepare Your CSV File

3. The Python Script

4. Run the Script

Use Cases for Web Archiving

1. Preserving Marketing Campaigns

2. Web Development and Testing

3. Data Collection for Research

Actionable Takeaways

Conclusion

Related posts

How to Generate an Updated List of Mailable Prospects in Pardot: A Workaround

How to Generate an Updated List of Mailable Prospects in Pardot: A Workaround

How to Generate an Updated List of Mailable Prospects in Pardot: A Workaround

The Magic of Clipboard Automation: A Game-Changer for Developers and Marketers

The Magic of Clipboard Automation: A Game-Changer for Developers and Marketers

The Magic of Clipboard Automation: A Game-Changer for Developers and Marketers

Got questions?

Got questions?

Got questions?