Wpull is a Python-based alternative to Wget, designed specifically for site archival and preservation. It crawls the internet and produces Internet Archive-compatible WARC files. It was created primarily by Christopher Foo, of the ArchiveTeam, and is now used as the crawling engine inside ArchiveBot (a super-cool IRC-controlled web crawler).
Using Wpull, you can crawl the internet and create your own WARC files. These files can then be stored locally, manipulated, or (better yet) contributed to the Internet Archive themselves so everyone can benefit.
The internet is very transient. Web pages are removed just as quickly as they are created, it seems. Rarely does a day go by when a major website doesn’t close down. This is why it’s important to archive your work (and the work of others, just for good measure).
Here’s how to get started running Wpull to archive a website into a WARC.
1. Install Wpull
Before you can get started, you need to install Wpull on your machine. I typically do this on a fresh Ubuntu Server VM, so if I make a mess it doesn’t matter.
Wpull is written in Python and distributed via PIP. You can run these commands to easily install it:
sudo apt-get install pip3
pip3 install wpull
You should now be able to run the command “wpull” and see some output (it’ll show you the usage options).
2. Commands for Archiving a Website with Wpull
To appropriately archive a site, there are quite a few commands you need to use. Here’s an example set of commands you can use to archive a full domain:
wpull http://example.com/
–no-check-certificate
–no-robots
–page-requisites
–no-parent
–sitemaps
–inet4-only
–timeout 20
–tries 3
–waitretry 5
–recursive
–level inf
–span-hosts
–retry-connrefused
–retry-dns-error
–delete-after
–warc-append
-U “MyAmazingUserAgent (Change This)”
-o sitearchive-example.com.log
–database sitearchive-example.com.db
–warc-file “sitearchive-example.com”
–warc-header “operator: Put your name here”
–warc-header “downloaded-by: MyAmazingUserAgent (Change This)”
–domains example.com
–exclude-domains example.net
–reject-regex “/(ads.example.com)/”
–concurrent 4
Option | Information |
---|---|
–no-check-certificate | Tells the crawler to ignore certificate errors |
–no-robots | Ignores the robots.txt file |
–page-requisites | Ensures page prerequisites are downloaded, such as javascript and stylesheets. |
–no-parent | Doesn’t crawl “up” to parent directories. Not necessary when crawling an entire domain. |
–sitemaps | Download the sitemap to find more links |
–inet4-only | Limits connections to IPv4 only. Remove this if you want to crawl via IPv6. |
–timeout 20 | The maximum time (in seconds) to be spent on DNS lookups, connections and data read operations. |
–tries 3 | The number of retries on transient errors. |
–waitretry 5 | The number of seconds before it retries transient errors. |
–recursive | Follow all links and download them |
–level inf | Limit recursion depth to either a specific number or “inf” for infinite |
–span-hosts | Allow downloading resources from off-site (different domains) |
–retry-connrefused | Keep retrying even if the server refuses your connection |
–retry-dns-error | Keep retrying even if the DNS lookup fails |
–delete-after | Deletes the files from disk once they have been downloaded. Only use this option in conjunction with the WARC file (otherwise you don’t end up saving the site anywhere). |
–warc-append | Keep using an existing WARC file if one exists with the same filename (otherwise, it overwrites your existing file) |
-U “MyAmazingUserAgent (Change This)” | The HTTP User Agent to identify yourself as. Change this to something unique. |
-o sitearchive-example.com.log | Log files get outputted to here. |
–database sitearchive-example.com.db | A SQLite file is written here, tracking the progress of your crawl. Allows you to stop and resume later. |
–warc-file “sitearchive-example.com” | The name of the compressed WARC file (it will append “warc.gz”) |
–warc-header “operator: Put your name here” | Put your name in here so people know who created this WARC file. |
–warc-header “downloaded-by: MyAmazingUserAgent (Change This)” | Put your HTTP User Agent here so people know which User Agent the website was fetched by. |
–domains example.com,example.com.au | Only download from these specified domains |
–exclude-domains media.example.com | Don’t download from these specified domains (comma-separated) |
–reject-regex “/(ads.example.com)/” | Reggie rules to exclude certain URLs |
–concurrent 4 | How many downloads should be running at once? Keep an eye on your Bandwidth, CPU, Memory and Disk I/O |
A full explanation of the available options can be accessed in the Wpull online documentation. My options shown are based on the Wpull examples, as well as the ArchiveBot parameters.
3. Running Wpull in the background
If you preface your “wpull” command with “screen” it’ll run in a screen session you can disconnect from (Ctrl+A then press the letter ‘d’).
4. What Now?
After running Wpull, it’s worth considering what you can do with the resultant file. Firstly, you can open it up and read/grep/search it. Once the GZ compression has been removed it’s essentially a text file.
Secondly, you can run your own Wayback Machine. Everything you need is open source. How cool!
Thirdly, you can contribute it back to Archive.org for everyone to benefit (This is what ArchiveBot does). If you sign up for an account on their website, you can upload your very own WARC files (and basically any other bit of content you have the rights to). BUT, when you upload it yourself it won’t get ingested into the Wayback Machine. You can ping someone at #ArchiveTeam on EFNet (IRC) to chat about this and what your options are.
When to archive
Unless your site already has very good coverage in the Wayback Machine, it’s worth doing a site archive before you do a major upgrade or take down a website. Most average websites, while they are in the Wayback Machine, don’t have very deep coverage by default. I also personally like to crawl a site if I think it’s “under threat” (and not just the site – government programs and community groups have their funding threatened all the time).
Other Wpull Tips
- If you Ctrl+C to quit, it will do so gracefully. If you run that command again it will force quit.
- If you use the “–warc-append” option, you can restart where you left-off without loosing any work
- Keep an eye on it, especially initially, to ensure you don’t start sucking up huge quantities of unnecessary content. Many websites are poorly structured so you can end up with massive amounts of duplicates. I also like to take a look at the SQLite file to see what has been queued. If it’s getting out of control, quit gracefully and modify the ignore rules.
- There are Python scripting hooks, so you can modify the behaviour without hacking the core (if you do hack at the core, consider doing a pull request to contribute it back to the community)
What cool things have you done with Wpull? How have you put it to use? Leave a comment below – I’m keen to find out.