当你老了
  • Home
  • Categories
  • Tags
  • Archives

python代理下载

当用同一个ip不断地去爬取一个网站的时候,很可能封掉,这时候就需要用到代理来分散请求。

分两种情况

  1. 代理设置了dns,调用函数crawl_page2()
  2. 代理没有设置dns,调用函数crawl_page(),这种情况稍微复杂点,先要获取url域名对应的ip地址,给http请求包加上dest_ip字段

代码如下

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
#!/usr/bin/env python
# encoding:utf-8
import sys
import time
import urllib2
import socket
import struct

def get_dest_ip(domain):
    ip_addr = socket.gethostbyname(domain) #just for ipv4
    #ip_addr = socket.getaddrinfo(domain, 'http')[0][4][0]
    #[(2, 1, 6, '', ('14.215.177.38', 80)), (2, 2, 17, '', ('14.215.177.38', 80)), (2, 1, 6, '', ('14.215.177.37', 80)), (2, 2, 17, '', ('14.215.177.37', 80))]
    uint32_binary_str = socket.inet_aton(str(ip_addr))
    unpack_result = struct.unpack("!I", uint32_binary_str)
    ip_int = socket.htonl(unpack_result[0]) 
    return ip_int

def crawl_page(url, dest_ip, cur_proxy):
    '''
    '''
    content = ''
    myheaders = {"User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)", 
        "Proxy-Connection": "Keep-Alive",
        "dest_ip": dest_ip}
    p = "http://%s" % cur_proxy
    h = urllib2.ProxyHandler({"http": p})
    o = urllib2.build_opener(h, urllib2.HTTPHandler)
    o.addheaders  = myheaders.items()
    try:
        r = o.open(url, timeout=5)
        content = r.read()
    except urllib2.HTTPError, e:
        print "Error Code:", e.code
        if e.code == 404:
            print "No page: %s" % url
    except urllib2.URLError, e:
        print "Error Reason:",e.reason
    except Exception as e:
        print "Error",str(e)
    if len(content) > 10:
        print "Good\t%s" % p
    else:
        print "Bad\t%s" % p
    return content

def crawl_page2(url, cur_proxy=''):
    '''
    get
    '''
#    print "-->crawl comment: %s" % url
#    print "-->cur_proxy: %s" % cur_proxy
    myheaders = {"User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)"}
    content = ''
    try:
        if cur_proxy:
            proxy_handler = urllib2.ProxyHandler({'http': cur_proxy})
            opener = urllib2.build_opener(proxy_handler)
            f = opener.open(url, timeout = 5)
            content = f.read()
        else:
            req = urllib2.Request(url, headers = myheaders)
            f = urllib2.urlopen(req, timeout = 5)
            content = f.read()
    except urllib2.HTTPError, e:
        print "Error Code:", e.code
        if e.code == 404:
            print "No page: %s" % url
    except urllib2.URLError, e:
        print "Error Reason:",e.reason
    except Exception as e:
        print "Error",str(e)
    time.sleep(1)
    return content

if __name__ == '__main__': # open proxys
    proxy_list = open(sys.argv[1]).readlines()
    url = "http://www.baidu.com"
    host = "www.baidu.com"
    dest_ip = str(get_dest_ip(host))
    for p in proxy_list:
        p = p.strip()
        n1 = len(crawl_page(url, dest_ip, p))
        print "crawl_page len:", n1
        n2 = len(crawl_page2(url, p))
        print "crawl_page2 len:", n2

测试输出

输入文件为两个代理ip,一个配置了dns,一个没有配置dns

Good    http://10.183.27.147:32810
crawl_page len: 10811
crawl_page2 len: 10811
Good    http://10.184.16.44:32810
crawl_page len: 10811
Error Reason: timed out
crawl_page2 len: 0

其他说明

获取dest_ip是访问的dns服务,并不是访问的原网站,而dns是带本地cache,所以频繁访问应该是没有问题的。

函数get_dest_ip()中调用了socket模块的相关函数:

  • socket.gethostbyname() 获取域名对应ip,只支持ipv4,若需要支持ipv6,可使用函数socket.getaddrinfo()
  • socket.inet_aton() 转换ip地址(192.168.1.10)为32位打包二进制字符串,只支持ipv4,若需要支持ipv6,可使用函数socket.inet_pton()
  • socket.htonl() 将32位整数从主机字节序转换成网络字节序

参考

  • http://www.programgo.com/article/11342723643/
  • http://www.cnblogs.com/gala/archive/2011/09/22/2184801.html

  • « LDA算法理解
  • Eclat算法实现 »

Published

7月 22, 2016

Category

Tech

Tags

  • 代理 1
  • python 1
  • 下载 1

Stay in Touch

  • 当你老了 by JimmyTang is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
  • Powered by Pelican. Theme: Elegant by Talha Mansoor