当用同一个ip不断地去爬取一个网站的时候,很可能封掉,这时候就需要用到代理来分散请求。
分两种情况
- 代理设置了dns,调用函数crawl_page2()
- 代理没有设置dns,调用函数crawl_page(),这种情况稍微复杂点,先要获取url域名对应的ip地址,给http请求包加上dest_ip字段
代码如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | #!/usr/bin/env python
# encoding:utf-8
import sys
import time
import urllib2
import socket
import struct
def get_dest_ip(domain):
ip_addr = socket.gethostbyname(domain) #just for ipv4
#ip_addr = socket.getaddrinfo(domain, 'http')[0][4][0]
#[(2, 1, 6, '', ('14.215.177.38', 80)), (2, 2, 17, '', ('14.215.177.38', 80)), (2, 1, 6, '', ('14.215.177.37', 80)), (2, 2, 17, '', ('14.215.177.37', 80))]
uint32_binary_str = socket.inet_aton(str(ip_addr))
unpack_result = struct.unpack("!I", uint32_binary_str)
ip_int = socket.htonl(unpack_result[0])
return ip_int
def crawl_page(url, dest_ip, cur_proxy):
'''
'''
content = ''
myheaders = {"User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)",
"Proxy-Connection": "Keep-Alive",
"dest_ip": dest_ip}
p = "http://%s" % cur_proxy
h = urllib2.ProxyHandler({"http": p})
o = urllib2.build_opener(h, urllib2.HTTPHandler)
o.addheaders = myheaders.items()
try:
r = o.open(url, timeout=5)
content = r.read()
except urllib2.HTTPError, e:
print "Error Code:", e.code
if e.code == 404:
print "No page: %s" % url
except urllib2.URLError, e:
print "Error Reason:",e.reason
except Exception as e:
print "Error",str(e)
if len(content) > 10:
print "Good\t%s" % p
else:
print "Bad\t%s" % p
return content
def crawl_page2(url, cur_proxy=''):
'''
get
'''
# print "-->crawl comment: %s" % url
# print "-->cur_proxy: %s" % cur_proxy
myheaders = {"User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)"}
content = ''
try:
if cur_proxy:
proxy_handler = urllib2.ProxyHandler({'http': cur_proxy})
opener = urllib2.build_opener(proxy_handler)
f = opener.open(url, timeout = 5)
content = f.read()
else:
req = urllib2.Request(url, headers = myheaders)
f = urllib2.urlopen(req, timeout = 5)
content = f.read()
except urllib2.HTTPError, e:
print "Error Code:", e.code
if e.code == 404:
print "No page: %s" % url
except urllib2.URLError, e:
print "Error Reason:",e.reason
except Exception as e:
print "Error",str(e)
time.sleep(1)
return content
if __name__ == '__main__': # open proxys
proxy_list = open(sys.argv[1]).readlines()
url = "http://www.baidu.com"
host = "www.baidu.com"
dest_ip = str(get_dest_ip(host))
for p in proxy_list:
p = p.strip()
n1 = len(crawl_page(url, dest_ip, p))
print "crawl_page len:", n1
n2 = len(crawl_page2(url, p))
print "crawl_page2 len:", n2
|
测试输出
输入文件为两个代理ip,一个配置了dns,一个没有配置dns
Good http://10.183.27.147:32810
crawl_page len: 10811
crawl_page2 len: 10811
Good http://10.184.16.44:32810
crawl_page len: 10811
Error Reason: timed out
crawl_page2 len: 0
其他说明
获取dest_ip是访问的dns服务,并不是访问的原网站,而dns是带本地cache,所以频繁访问应该是没有问题的。
函数get_dest_ip()中调用了socket模块的相关函数:
- socket.gethostbyname() 获取域名对应ip,只支持ipv4,若需要支持ipv6,可使用函数socket.getaddrinfo()
- socket.inet_aton() 转换ip地址(192.168.1.10)为32位打包二进制字符串,只支持ipv4,若需要支持ipv6,可使用函数socket.inet_pton()
- socket.htonl() 将32位整数从主机字节序转换成网络字节序
参考
- http://www.programgo.com/article/11342723643/
- http://www.cnblogs.com/gala/archive/2011/09/22/2184801.html