python 百度原图爬虫高清图 ~按分辨率爬取，解析简单加密。

前几天被公司临时安排了一个爬取百度图片的人物。我一开始以为很简单。后来发现只要分辨率大于500*438左右就是缩略图。导致了我经历了很多的坑。下面带你们感受下从坑到填满坑。

百度图片百度图片

因为我要按照分辨来进行爬取，所以先自定义好分辨率 在这里插入图片描述
先给大家排个坑，这个页面是什么看不到什么数据的，所以要去其他页面。点击图片进入下一个页面！

第一个坑~
因为当时我想方便一点。就想直接用这个下载的url,因为python的requests可以直接用下载的链接也可以直接下载下来。
但是！！
这个页面是js加载的，requests这个页面没办法出来这个下载的按钮的数据参数。当然可以使用selenium 自动化是可以做到的。但是对于大量爬虫不现实。 这个方法pass掉

第二个坑
会点爬虫的都会找到这个页面的json数据
在这里插入图片描述

写爬虫访问这个页面有2个方法，1.可以用url拼接(format参数)

1	page_url = url.format(urllib.parse.quote(word), num * page_num,width,height) #这样拼接参数

2.或者使用 response = requests.get(detail_url, params=params)

1
2
3
4
5
6
7
8
9

params = {
"word": word,
"di": item['di'],
"tn": "baiduimagedetail",
"cs": item['cs'],
"os": item['os'],
}
detail_url = "http://image.baidu.com/search/detail"
response = requests.get(detail_url, params=params)

使用什么看自己个人喜爱。当然我建议第二种。
参数就在下面的
在这里插入图片描述
这里面有几个参数要改一下。
{pn=0 这是图片的起始数}
{rn= 30 这是百度json数据每次最多可以出30条}
{word = ‘’ 搜索的关键词}

接下来就是找原图了
在这里插入图片描述
第三坑
这几个url 的图片在分辨率小于500*438是原图，但是分辨率大一点就是缩略图。这个坑很关键。

但是当我们去看Objurl的时候发现有时候是原图有时候却是无法访问页面这就是说明这个objurl有可能是我们要找到。
然后我们去js中找一下，可能有点多，但是你细心会找到点东西。

在这里插入图片描述
后面我发现这个个是Objurl的加密。

可能搜索出来有点多，但是不妨碍，一个一个看下就好了。
然后将这个放入代码中就可以了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69

import requests
import re
import time
import os
import urllib.parse
import json

#每次json出的数据量
page_num = 30
photo_dir = 'c/ph' #path 地址

def getDetailImage(word,width,height):
num = 0
url = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord={0}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=&hd=&latest=&copyright=&word={0}&s=&se=&tab=&width={2}&height={3}&face=0&istype=2&qc=&nc=1&fr=&expermode=&force=&pn={1}&rn=" + str(
page_num) + "&gsm=1e&1552975216767="
#页数
while num < 3:
page_url = url.format(urllib.parse.quote(word), num * page_num,width,height)
print(page_url)
response = requests.get(page_url)
regex = re.compile(r'\\(?![/u"])')
try:
json_data = json.loads(regex.sub(r"\\\", response.text)) # 转义
for item in json_data['data']:
URL = item['objURL']
type = item['type']
pic_url = baidtu_uncomplie(URL)
print(pic_url)
html = requests.get(pic_url, timeout=5)

with open(word_dir2 + '\' + str(time.time()).replace('.', '1') + '.'+type, 'wb')as f:
f.write(html.content)
except:
pass

num = num + 1
time.sleep(10)

#解密
def baidtu_uncomplie(url):
res = ''
c = ['_z2C$q', '_z&e3B', 'AzdH3F']
d= {'w':'a', 'k':'b', 'v':'c', '1':'d', 'j':'e', 'u':'f', '2':'g', 'i':'h', 't':'i', '3':'j', 'h':'k', 's':'l', '4':'m', 'g':'n', '5':'o', 'r':'p', 'q':'q', '6':'r', 'f':'s', 'p':'t', '7':'u', 'e':'v', 'o':'w', '8':'1', 'd':'2', 'n':'3', '9':'4', 'c':'5', 'm':'6', '0':'7', 'b':'8', 'l':'9', 'a':'0', '_z2C$q':':', '_z&e3B':'.', 'AzdH3F':'/'}
if(url==None or 'http' in url):
return url
else:
j= url
for m in c:
j=j.replace(m,d[m])
for char in j:
if re.match('^[a-w\d]+$',char):
char = d[char]
res= res+char
return res

if __name__ == "__main__":
words= ['如意']
for word in words:
word_dir = os.path.join(photo_dir, word)
if not os.path.exists(word_dir):
os.makedirs(word_dir)
widths=['1920',]
heights=['1080',]
for width,height in zip(widths,heights):
word_dir2 = word_dir +'\'+(width+'x'+height)
if not os.path.exists(word_dir2):
os.makedirs(word_dir2)
getDetailImage(word,width,height)

这个可以加很多东西，ip代理，多线程。

网络好的话 ip多的话，利用多线程的话。一天几百万张图片也是可以的。

如果对你有帮助麻烦的话，请点赞，评论下谢谢。

爬虫的路的不好走，继续加油共勉！！！

如果对你有帮助麻烦的话，请点赞，评论下 谢谢。

如果对你有帮助麻烦的话，请点赞，评论下谢谢。