Photocopy the Web
How to create a digital stockpile in a time of disappearing data.
Another day, another successful de-platforming campaign, but when the Internet Archive purges its eponymous cache of the offending texts, a different kind of slight on one's natural rights is taking place.
In the first instance, those who own or otherwise control the machines that serve up the apparition unmistakably have the moral and legal right to kick out a guest, though naturally subject to rights pursuant to an orderly eviction. In the second case, ostensibly professional organizations of archivists are choosing what portions of online history are worth preserving, and which should be banished to the flames, according to some opaque and invariably ideologically driven principle.
A happy fact of technology, however, is that we can use it to free ourselves from total preoccupation with political or philosophical questions. At a certain point, one chooses to exercise prudence, and construct a first attempt. They see what happens, make tweaks, and see what works. (And more broadly I may suggest, discover what aids in the cultivation of true virtues within and across online communities.)
In this stage of technological development, little is achieved by a mob insistent on burning books at the local library. Campaigns run by ACLU attorneys to suppress particular books on Amazon can have only a minimal and transitory effect. These are stochastic acts, and producing a kind of fear is their effect if not their intended purpose. If demoralization, or acceptance of defeat, follows then further strictures on access or production of texts can be formalized, and the process repeats.
For all this smoke and mirrors – and to be clear, these are quite dangerous, one is the chief cause of mortality in house fires, and the other may inflict life-threatening grievous wounds – any person reasonably competent with technology can assemble and share their own library today. We knew this as early as listservs and Usenet, Kazaa, and BitTorrent, and all the rest. We know it with eBooks and home media servers.
But naturally it's true also of webpages or texts. For many reading this article, you will spend more time reading blogs, forum threads (group blogs), and Substack articles (blogs), than all the rest, perhaps combined. And in the end, which is more likely: That you find some anon's article purged from existence, or that Clint Eastwood films will be forever lost?
So let's depart history and philosophy and pick up the technē. As I mentioned, it's a happy accident that – despite widespread over-engineering – the typical webpage is a few megabytes in size. And if this is cleaned up and reduced to text and essential images, as many "readability" apps do, the file size may be one-tenth of that. Which is to say that one can comfortably store a few thousand articles – providing a few months of continuous reading time – in the space it takes to store one movie. It's clear we won't face serious physical or digital constraints here.
Is your library personal, private, or public?
We are seeing a resurgence of decentralized or semi-autonomous communities on the web that resemble the early semi-commercial blogosphere. There are analogies between the web-ring and something like Substack, or more abstractly, the emergent associations of the podcasting and social media world. Whatever can be said, it is far more interesting than the Big Three, so we should count our blessings.
Whether you're looking to curate links and articles with a broader community of mutuals, collect webpages to share with a couple friends, or are just stockpiling articles for yourself, you have a few different options.
Save a page to disk
When it comes to saving web pages – that's right, just File and Save – you obtain something substantially incomplete. Many browsers won't preserve media, layout, or style from the original. Others may generate a bunch of files and attempt to represent locally some small slice of the web. Safari produces exports of the webarchive format, embedding all of a webpage's key assets into one file. Another option is to save webpages as PDFs, or to use specialized programs to extract text.
You'll end up with a number of files, which you can organize sensibly into folders, and your operating system's search functionality is probably good enough for you to locate what you need. Maybe this is all you need.
Host text files
Plain text has its benefits, if you're not afraid of copy and paste. A savvy user can host them quite easily with any web server, or use static hosting like GitHub Pages to share them across the web.
Large archives house text-only manifestos, essays, and other delusions from creative networks predating the blogosphere: email chains, BBS, newsgroups, listservs, and others. And some newer primitives have entered the text file enthusiast space; today you can join a decentralized microblogging network driven by text, or even run your own. Text files are a kind of primitive in creative computing, and you can develop your own projects around hosting them on your own server, or use them to get into artificial intelligence and natural language processing. Don't discount them.
Use an indie bookmarking site
If you're looking for something that goes beyond a lonely folder on your drive, involves less work, and you're comfortable delegating control to a small outfit, Pinboard is a simple paid bookmarking tool that will allow you to assemble a public or private collection of links. For a few extra bucks, they will also save copies of the sites you bookmark.
Other options include Pocket and Instapaper, among many many others. Many of these services blur the line with the quaint RSS feed reader, worthy of its own tools and article.
Launch your own web archive
Thanks to open source software, with just a bit more work, you can launch your own bookmarking site. Here, Shiori is an excellent option and you can run it on your own infrastructure in a matter of minutes. If you're comfortable with Docker, this guide will help you get off and running and web accessible.
You can also run it locally, and try it out in a few steps. Download the Shiori binary for your platform from Github, and move it to your applications folder or update your system path.
Check out what you can do with Shiori on the command line, which includes importing URLs and exporting bookmark data, by running:
Then, run the web application locally with the command:
Once properly started, you can visit it locally at
http://localhost:8080 , and log in with the default credentials:
username: shiori password: gopher
This will give you a sense of the interface, and it's fully functional. However, since you're running this web application locally, it won't be accessible to others, and it won't be accessible from your browser when the program isn't running.
In practice, you will want to follow a similar procedure on your own server to enable yourself and others to access this application, and to keep it running day and night.
When you log in, you'll be met with a simple UI displaying your collection of bookmarks:
You can add new URLs with a click, and archive these pages by default, which will result in two copies of the webpage being saved to your server: A text copy processed to improve readability, and an archive of HTML and assets replicating the original site.
Settings are kept fairly minimal. Most parameters for bookmarks will be pulled in automatically, including its title, excerpt, and tags, though they can be manually adjusted. The application has a list and tile view, and you can add other users to your site, granting read-only access if you like.
Developers can leverage Shiori for their own creations using its open APIs, and an open source browser extension for Firefox and Chrome is also available for convenient use.
From there, you can create and share accounts with friends, or extend Shiori through its API to authenticate your mutuals and give them access. The web app is now yours to make your own.
Seek even more control?
ArchiveBox is also open-source software, but enables greater configuration and is more suitable to larger collections shared across very large numbers of users. Think of it as powering less a library of archived bookmarks, but your own Internet Archive or WayBack Machine. You can set it to continuously archive pages and feeds as HTML files, PDFs, image screenshots, and more, including embedded media. This is option is an excellent choice for power users and developers, and is well-documented.