Wednesday, April 21, 2010

LINUX: Download / Copy Entire Websites

There is a Linux command, wget, that allows for getting webpages. Sometimes using wget in recursive mode will not allow you to get more than one page. In order to get a whole site, first edit the /etc/wgetrc file to turn robot=off, then use the following command:
wget --no-parent --wait=20 --limit-rate=20K -r -p -U Mozilla http://mxr.mozilla.org/mozilla/source/webtools/bonsai/index.html

What this will do is tell the receiving server that we are using a Mozilla browser (not a script), and will wait in between each fetch to simulate a human user. The no-parent switch will prevent it from following a bunch of links and going all over the place.

NOTE: this should only be used to obtain something you are allowed to obtain.

No comments:

Post a Comment