New Archiver WordPress Plugin Auto-Generates Wayback Machine Snapshots

wayback-machine

During a recent NerdWallet hackathon, WordPress plugin developer Mickey Kay and his colleague John Lee came up with an idea for creating a visual archive for the site’s content that would allow them to look back at previous versions and associate SEO and performance shifts with content changes. WordPress powers a large portion of NerdWallet in addition to a number of Node/React apps and various Python micro-services.

As WordPress’ revision system doesn’t create a visual archive, Kay and Lee looked outside of the platform for a solution. They landed on the Wayback Machine, a non-profit tool dedicated to building a digital library of Internet sites and other cultural artifacts in digital form. The tool provides an interface that makes it easy to browse previous versions of a site. Unfortunately, the Wayback Machine is sporadic at best when it comes to crawling websites. The calendar view maps it displays show the number of times a site was crawled, not the number of times a site was updated.

Kay decided to build a solution that would work with Wayback Machine to create a more steady, reliable archive that can be easily accessed from WordPress. His new Archiver plugin auto-generates Wayback Machine snapshots of the site whenever content changes.

archiver-demo

Archiver does the following things:

  • Automatically creates a Wayback Machine snapshot when you update your content
  • Allows you to manually trigger a snapshot of any page on your site using the admin
  • Allows you to easily view your site’s Wayback Machine archives (all snapshots) for any page on your site
  • Adds an “Archives” metabox to the admin edit screen of specific content types that can be used to easily view existing snapshots

The plugin works by posting to the Wayback Machine’s publicly available endpoint (https://web.archive.org/save/) and reads existing snapshots from (https://web.archive.org/cdx).

Archiver works on posts, pages, custom post types, categories, tags, custom taxonomies, and users. Existing snapshots for each content type are available in the editing screen in an archives metabox.

archives-metabox

I tested the Archiver plugin and found that it works as expected. When content is updated, a new screenshot is automatically generated. Manually triggering a screenshot works instantly.

Kay said that the NerdWallet team is working to incorporate the WP REST API to integrate across systems to surface WordPress content to their React-powered apps. The Archiver plugin is not yet used in production, but they have it slated for an upcoming code sprint.

Archiver can be useful for understanding the impact of content changes on marketing, SEO, and e-commerce sales, but it also helps preserve the history of web pages as they evolve over time. The best part is that it sends the snapshots automatically and doesn’t use up space on your server. The only drawback is that if someday the Wayback Machine were to disappear, the snapshots would no longer be available.

Archiver is available on WordPress.org and contributions and suggestions are welcome on GitHub. Usage of the Wayback Machine is free but its maintainers estimate that permanent storage costs them approximately $2.00 USD per gigabyte. If you’re depending heavily on the Wayback Machine’s snapshots, you might consider a donation to help keep the digital library up and running.

4

4 responses to “New Archiver WordPress Plugin Auto-Generates Wayback Machine Snapshots”

  1. You both bring up good points, and I had the same questions/concerns when I first considered building this plugin. I reached out to info@archive.org numerous times with this exact question, however I never received a response. I also searched as much of their documentation as I could find for any mention of throttling and/or blacklisting based on save frequency, however I found no mention of anything along these lines. Like all services, The Wayback Machine is not meant to be abused, and I highly discourage users from setting up any functionality that would cause overly frequent pinging of TWM. That said, if you look at the sheer volume of data being cached by TWM continuously, around the clock, in a highly automated fashion, the normal frequency of saving posts on even a high traffic site is significantly less (e.g. reddit.com was cached 387 yesterday). This isn’t to say that the questions you raise aren’t valid – they completely are, and I wish I’d received a response from TWM clarifying this exact issue. That said, TWM’s visual form interface is in reality their public API. Whether you submit your site via their actual form front-end, or by posting to their form’s action endpoint, the service on their end knows no difference.

    With all this said, if you know any more about these questions than I was able to uncover, by all means I would love to know. Thanks!

Newsletter

Subscribe Via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.