From time to time there is a need to prepare the complete copy of the website to share it with someone or to archive it for further offline viewing. Such an archive should contain anything that is visible on the site. This includes CSS style sheets, images, also attached documents such as PDF files. One of the useful tools to achieve this is wget – an application to retrieve files through HTTP, HTTPS, FTP, FTPS protocols. This is a very powerful tool, yet it requires some knowledge to take advantage of its possibilities.
“Basic” wget usage
The most common statement we are using to mirror the whole website is:
wget --mirror --convert-links --adjust-extension --page-requisites http://www.mywebsite.com/
What does the options above mean? Let’s take a look at them one by one:
–mirror – this one is turning on recursive browsing and infinite recursion depth. This is the crucial one if you want to obtain the whole website.
–convert-links – forces wget to rewrite links within the downloaded pages to point to the downloaded resources. Instead of domain names or absolute paths they will be rewritten to relative equivalent.
–adjust-extension – as you probably noticed, the websites right now are not using the .html extension in the URL. This leads to the issue when viewing the offline version of the site – the files are not opened in the browser. This one forces wget to add proper extensions to downloaded files.
–page-requisites – causes wget to download all files required to properly display the page. This includes images referenced in stylesheets, related files and even sounds.
Subdomain related issues
Some websites are utilizing subdomains to provide media files. Images are often served from the content delivery network. This means that the basic wget configuration will not retrieve them along with the page. This is caused by the default configuration which refuses to visit hosts with a different domain than the one specified. There are two options useful in such case:
–span-hosts – enables host spanning which means that wget will follow each link to the other domain. The problem is that this may cause wget to download the whole Internet to your disk drive. We want to limit this somehow…
-D – gives us an opportunity to select domains we want to retrieve from. We can provide a coma-separated list of domains we want to include.
–exclude-domains – as the opposite to the above, this one gives us an opportunity to provide the list of domains to exclude.
Given the above, our retrieval string is now a little bit more complex:
wget --mirror --convert-links --adjust-extension --page-requisites --span-hosts -D www.mywebsite.com,files.mywebsite.com,images.somewhere.com http://www.mywebsite.com/
Additional useful options
Some of the websites we are archiving are displaying a different (visually simplified) version of the website for the screen readers and other devices used by people with disabilities. If we want to retrieve the full version of the website we are using -U followed by the user agent string. In the simplified version it looks like this:
This option alone is enough in most cases. Sometimes we also want to skip tracking cookies from being created – we want each page to act as it was visited directly, without previous interaction with the other pages. In such case we are using:
Last of the options we are using frequently is the one which is turning off the retrieval of robots.txt files. Wget is respecting entries in the robots.txt file by default, which means that sometimes it is not downloading files we need. In order to skip this function we are using:
This is not the complete guide for wget. I only mentioned most useful options used by us in our daily work. The complete command we are using most often is:
wget --mirror --convert-links --adjust-extension --page-requisites --span-hosts -U Mozilla -e robots=off --no-cookies -D www.mywebsite.com,files.mywebsite.com,images.somewhere.com http://www.mywebsite.com/
If you are using other useful options for wget, feel free to share them in the comments.