python爬虫热点项目设计，PythonSpider:爬取项目外包网站TaskCity

威哥 2023-01-22 21:14:46 724

python爬虫热点项目设计，PythonSpider:爬取项目外包网站TaskCitySpider.pyfrom urllib import Request import re import time import random from bs4 import BeautifulSoup import Disposer as Dp # 定义变量：URL 与 headers base_url = 'http://www.taskcity.com' url = 'http://www.taskcity.com/projects?utf8=✓&keywords=上海&enter=项目&commit=搜索' #向测试网站发送请求 #重构请求头，伪装成 Mac火狐浏览器访问，可以使用上表中任意浏览器的UA信息 headers = { 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac

对于想要做软件外包的小伙伴们来说，每天登陆外包网站，并且搜索合适的项目，是很累人并且十分枯燥的事情。使用Python爬虫，可以自动对外包网站上面最新的软件外包项目进行搜索、提取和保存。可以大大提高大家的工作效率。

该项目使用Python爬虫，实现对项目外包网站TaskCity的项目信息抓取。并将信息保存到Excel文件中。主要包括两部分： 网页抓取 和 信息处理 。

网页抓取：采用urllib抓取读取网页，使用re正则提取有用的信息。

信息处理：使用openpyxl和pandas，将信息保存到Excel文件，网址使用超链接格式。

python爬虫热点项目设计，PythonSpider:爬取项目外包网站TaskCity(1)

Spider.py

from urllib import Request import re import time import random from bs4 import BeautifulSoup import Disposer as Dp # 定义变量：URL 与 headers base_url = 'http://www.taskcity.com' url = 'http://www.taskcity.com/projects?utf8=✓&keywords=上海&enter=项目&commit=搜索' #向测试网站发送请求 #重构请求头，伪装成 Mac火狐浏览器访问，可以使用上表中任意浏览器的UA信息 headers = { 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:65.0) Gecko/20100101 Firefox/65.0'} data = {'名称': [] '发布时间': [] '预算': [] '链接': []} while(None != url): # 1、创建请求对象，包装ua信息 req = request.Request(url=url headers=headers) # 2、发送请求，获取响应对象 res = request.urlopen(req) # 3、提取响应内容 html = res.read().decode('utf-8') # 寻找HTML规律，书写正则表达式，使用正则表达式分组提取信息 pattern = re.compile(r'<a title="(.*?)".*?项目发布时间：(.*?)</span><br/>.*?项目预算：(.*?)<br/>' re.S) r_list=pattern.findall(html) is_find = 0 title_name = Dp.get_last_project_name("TaskCity") print(title_name) for info in r_list: if (info[0] != title_name): data['名称'].append(info[0]) data['发布时间'].append(info[1]) data['预算'].append(info[2]) data['链接'].append(url) print(info) else: print("Get the last information name: " info[0]) is_find = 1 break if (is_find == 1): break # 每爬取一个页面随机休眠1-2秒钟的时间 time.sleep(random.randint(1 2)) soup = BeautifulSoup(html "html.parser") url = soup.find('a' attrs={'rel': 'next'}) if (url != None): url = url.get('href') url = base_url str(url) Dp.insert_data_to_sheet_head(data "TaskCity")

Disposer.py

import openpyxl import pandas as pd filename = '软件外包项目汇总.xlsx' def make_hyperlink(value): return '=HYPERLINK("%s" "%s")' % (value "网址") def get_last_project_name(sheet_name): wb = openpyxl.load_workbook(filename) ws = wb.get_sheet_by_name(sheet_name) return ws.cell(row = 2 column = 1).value def insert_data_to_sheet_head(dict_data sheet_name): # Returns a DataFrame df = pd.read_excel(filename sheet_name) df_dict = pd.DataFrame.from_dict(dict_data) df_dict['链接'] = df_dict['链接'].apply(lambda x: make_hyperlink(x)) wb = openpyxl.load_workbook(filename) ws = wb.get_sheet_by_name(sheet_name) links = [] for i in range(2 ws.max_row 1): # 2nd arg in range() not inclusive so add 1 links.append(ws.cell(i 4).value) df['链接'] = links df = pd.concat([df_dict df]) # Reassign the index labels df.index = [*range(df.shape[0])] df.to_excel(filename sheet_name index=False) def insert_data_to_sheet_tail(dict_data sheet_name): # Returns a DataFrame df = pd.read_excel(filename sheet_name) df_dict = pd.DataFrame.from_dict(dict_data) df_dict['链接'] = df_dict['链接'].apply(lambda x: make_hyperlink(x)) wb = openpyxl.load_workbook(filename) ws = wb.get_sheet_by_name(sheet_name) links = [] for i in range(2 ws.max_row 1): # 2nd arg in range() not inclusive so add 1 links.append(ws.cell(i 4).value) df['链接'] = links df_dict = pd.concat([df df_dict]) # Reassign the index labels df_dict.index = [*range(df_dict.shape[0])] df_dict.to_excel(filename sheet_name index=False)

python爬虫热点项目设计，PythonSpider:爬取项目外包网站TaskCity(2)

python爬虫热点项目设计，PythonSpider:爬取项目外包网站TaskCity(3)

网站首页

返回栏目

python爬虫热点项目设计，PythonSpider:爬取项目外包网站TaskCity

猜您喜欢：

相关文章