发新话题
打印

镜像网站的制作

镜像网站的制作

使用shell中的wget命令行创建网站镜像的方法。此方法将所有文件(包括图片、CSS等)都下载下来,并把网页中的链接改为相对链接,这样就避免了镜像中的链接仍旧指向原来的网站而不能正常地工作了。

此方法只需一条命令行:
复制内容到剪贴板
代码:
$ wget -mk -w 20 http://www.example.com/
命令行中的20代表间隔20秒下载一个文件,这样可以避免网站的访问过于频繁。你可以调小点,但当你是备份别人的站时,还是为别人的服务器考虑下吧。
乐乎设计,乐乎生活~

TOP

http://fosswire.com/post/2008/04 ... -website-with-wget/

GNU's wget command line program for downloading is very popular, and not without reason. While you can use it simply to retrieve a single file from a server, it is much more powerful than that and offers many more features.

One of the more advanced features in wget is the mirror feature. This allows you to create a complete local copy of a website, including any stylesheets, supporting images and other support files. All the (internal) links will be followed and downloaded as well (and their resources), until you have a complete copy of the site on your local machine.

In its most basic form, you use the mirror functionality like so:
复制内容到剪贴板
代码:
$ wget -m http://www.example.com/
There are several issues you might have with this approach, however.

First of all, it's not very useful for local browsing, as the links in the pages themselves still point to the real URLs and not your local downloads. What that means is that, if, say, you downloaded http://www.example.com/, the link on that page to http://www.example.com/page2.html would still point to example.com's server and so would be a right pain if you're trying to browse your local copy of the site while being offline for some reason.

To fix this, you can use the -k option in conjunction with the mirror option:
复制内容到剪贴板
代码:
$ wget -mk http://www.example.com/
Now, that link I talked about earlier will point to the relative page2.html. The same happens with all images, stylesheets and resources, so you should be able to now get an authentic offline browsing experience.

There's one other major issue I haven't covered here yet - bandwidth. Disregarding the bandwidth you'll be using on your connection to pull down a whole site, you're going to be putting some strain on the remote server. You should think about being kind and reduce the load on them (and you) especially if the site is small and bandwidth comes at a premium. Play nice.

One of the ways in which you can do this is to deliberately slow down the download by placing a delay between requests to the server.
复制内容到剪贴板
代码:
$ wget -mk -w 20 http://www.example.com/
This places a delay of 20 seconds between requests. Replace that number, and optionally you can add a suffix of m for minutes, h for hours, and d for ... yes, days, if you want to slow down the mirror even further.

Now if you want to make a backup of something, or download your favourite website for viewing when you're offline, you can do so with wget's mirror feature. To delve even further into this, check out wget's man page (man wget) where there are further options, such as random delays, setting a custom user agent, sending cookies to the site and lots more.
乐乎设计,乐乎生活~

TOP

利用wget批量下载http目录下文件                                   
                                        原理:下载你需要down的目录页面的index.html,可能名字不是如此!!!
之后用wget下载该文件里包含的所有链接!

例如:
复制内容到剪贴板
代码:
wget -vE -rLnp -nH --tries=20 --timeout=40 --wait=5        http://mirrors.163.com/gentoo/distfiles/
或者简单点:
复制内容到剪贴板
代码:
wget -m    http://mirrors.163.com/gentoo/distfiles/
你会得到distfiles页面的index.html文件,该文件内容当然不用说了,里面包含distfiles目录下的所有源码包的链接,要得到这个文件很容易,一个-m参数也可以,但该文件比较大,为防止超时,等待等问题,可以加tries,timeout,wait等参数解决。
复制内容到剪贴板
代码:
wget  -nc -B  http://mirrors.163.com/gentoo/distfiles/ -F -nH --cut-dirs=3 -i index.html
Ok!!!

后来决定用tom的镜像来同步下载,但是发现tom竟然不允许浏览访问他们gentoo镜像页面当然也就得不到distfiles的index.html,于是尝试用163得到的index.html代替,毕竟里面存放的都是相对路径,所以只需要用tom的distfiles目录代替163的路径,同样可以从tom下载163的index.html里列出来的镜像文件!

参数解释:
-B 给指定的文件里的URLs增加路径前缀
-nc:下载时跳过已经存在的文件
-nH:不创建主机名目录
-i : 下载所有在i参数后面指定的文件中列出的URLs.
-v : 显示信息
E : 强制以html 保存
-r : 递归, 就是抓取子目录的子目录
L : 相对路径
np : 不跳到父目录
-erobots=off :临时绕过robot.txt
还很多乱七八遭的参数,比如制定目录啊,过滤啊,等等,自己研究把

ftp协议发布的文件的话就比较简单了,可以用 -r 参数 加通配符*来替代,完全可以实现递归下载!

PS:163的后来在加入了Robot Exclusion,而wget是遵循Robot Exclusion标准的,所以不修改wgetrc文件的话会下载到163的robot.txt,可以通过修改wgetrc里的robot开关来绕过robot或者通过”-erobots=off“参数来临时绕过robot.txt文件进行递归.

[ 本帖最后由 happyfan 于 2012-2-26 15:15 编辑 ]
乐乎设计,乐乎生活~

TOP

发新话题