python抓取京东数据(利用python爬取京东产品信息-简化版)
python抓取京东数据(利用python爬取京东产品信息-简化版)
使用python制作爬虫功能获取想要的数据,这是python中最基本的功能,我也是从学习爬虫开始学习python的,闲来无事做了一个简化爬取京东产品信息的功能,之所以是简化版,是因为并没有爬取多页和所有信息,仅拿了书名、价格、出版社信息。# encoding = gbk
import requests
import re
from lxml import etree
class getInfo():
"""
获取京东产品信息(部分)
"""
def __init__(self keyword):
# 用来存放信息
self.msg = ""
if len(keyword) == 0:
self.msg = "缺少关键字"
return
# 搜索关键字
self.keyword = keyword
# 放置返回信息
self.infos = []
# headers头,requests时用
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/74.'
}
self.run()
# 获取信息
def run(self):
url = "https://search.jd.com/Search?keyword={}".format(self.keyword)
html = requests.get(url headers=self.headers verify=False)
res = etree.HTML(html.text)
div_lists = res.xpath("//div[@class='gl-i-wrap']")
for div in div_lists:
try:
# 书名
name = self.parseContent(div.xpath("div[@class='p-name']/a/em")[0]).strip()
# 出版社
shopnum = re.sub("\s" "" self.parseContent(div.xpath("div[@class='p-shopnum']/a")[0]))
# 价格
price = re.sub("\s" "" self.parseContent(div.xpath("div[@class='p-price']")[0]))
self.infos.append({
"name":name
"shopnum":shopnum
"price":price
})
except:
continue
for info in self.infos:
print(info)
def __str__(self):
return self.infos
def parseContent(self element):
content = etree.tostring(element encoding="gb2312").decode("gb2312")
content = re.sub("<.*?>" "" content)
content = re.sub("(^\\n|\\n$|&.{ 4};)" "" content)
return content
if __name__ == '__main__':
keyword = "python"
getInfo(keyword)
获取到的信息展示