如何爬取知乎中问题的回答以及评论的数据?
我们以爬取“为什么中医没有得到外界认可?”为例来讨论一下如何爬取知乎中问题的回答以及评论的数据。
爬取网页数据通常情况下会经历以下三个步骤。
第一步:网页分析,确认自己所要数据的真正存储地址,以及这些url地址的规律。
第二步:爬取网页数据,并对这些数据进行清洗和整理变成结构化数据。
第三步:存储数据,以便于后面的分析。
下面我们分别来详细介绍。
一、网页分析
我们利用Chrome浏览器,打开所要爬取的网页:
https://www.zhihu.com/question/370697253
按F12查看元素,点击“Network”,再点击“XHR”选项。
先按左边的小圆圈清空列表,方便后面查找请求链接,再按“F5”刷新一下网页,如下图所示:
在列表中找到存储回答数据的url地址,点击后在“Preview”面板可以看到Josn格式的数据。
观察每一页数据对应的url地址。
第1页:https://www.zhihu.com/api/v4/questions/370697253/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=0&platform=desktop&sort_by=default
第2页:https://www.zhihu.com/api/v4/questions/370697253/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=5&platform=desktop&sort_by=default
第3页:https://www.zhihu.com/api/v4/questions/370697253/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=10&platform=desktop&sort_by=default
我们发现,除了offset属性对应的取值不同,其余部分全部相同。而且offset属性对应的取值从0开始,每一页相差5。最后一页Json中的 paging -> is_end属性为false。
以上是问题回答的网页分析。我们再分析一下针对每个回答的评论。
跟上面的步骤相同,找到这些评论存储的真正网络地址。
观察每一页数据对应的url地址如下:
第1页:https://www.zhihu.com/api/v4/answers/1014424784/root_comments?order=normal&limit=20&offset=0&status=open
第2页:https://www.zhihu.com/api/v4/answers/1014424784/root_comments?order=normal&limit=20&offset=20&status=open
第3页:https://www.zhihu.com/api/v4/answers/1014424784/root_comments?order=normal&limit=20&offset=40&status=open
“1014424784”是该回答的id,不同的回答该id值不同。上面的url是针对同一回答的评论,这些url地址除了offset属性对应的取值不同,其余部分全部相同。而且offset属性对应的取值从0开始,每一页相差20。最后一页Json中的 paging -> is_end属性为false。
二、常用库介绍
(1)requests
requests的作用就是发送网络请求,返回响应数据。
官方文档如下:
https://docs.python-requests.org/zh_CN/latest/user/quickstart.html
(2)json
Json 是一种轻量级的数据交换格式,完全独立于任何程序语言的文本格式。一般,后台应用程序将响应数据封装成Json格式返回。
官方文档如下:
https://docs.python.org/zh-cn/3.7/library/json.html
(3)lxml
lxml 是一个HTML/XML的解析器,主要功能是解析和提取 HTML/XML 数据。
官方文档如下:
https://lxml.de/index.html
由于本图文的篇幅有限,后面会另写图文分别介绍上面这些跟爬虫相关的库。
三、完整代码
GetAnswers方法用于爬取问题的回答数据。
回答数据结构化之后的属性有:帖子的ID(answer_id)、作者名称(author)、发表时间(created_time)、帖子内容(content)。
GetComments方法用于爬取每个问题的评论数据。
评论数据结构化之后的属性有:评论的ID(answer_id_comment_id)、作者名称(author)、发表时间(created_time)、评论内容(content)。
这些数据全部存储在“知乎评论.csv”这个文件中,需要注意的是该文件用Excel打开之后出现中文乱码,解决方法可以参考前面的一篇图文如何解决Python3写入CSV出现'gbk' codec can't encode的错误?import requests
import json
import time
import csv
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36',
}
csvfile = open('知乎评论.csv', 'w', newline='', encoding='utf-8')
writer = csv.writer(csvfile)
writer.writerow(['id', 'created_time', 'author', 'content'])
def GetAnswers():
i = 0
while True:
url = 'https://www.zhihu.com/api/v4/questions/370697253/answers' \
'?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%' \
'2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%' \
'2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%' \
'2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%' \
'2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%' \
'2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%' \
'2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset={0}&platform=desktop&' \
'sort_by=default'.format(i)
state=1
while state:
try:
res = requests.get(url, headers=headers, timeout=(3, 7))
state=0
except:
continue
res.encoding = 'utf-8'
jsonAnswer = json.loads(res.text)
is_end = jsonAnswer['paging']['is_end']
for data in jsonAnswer['data']:
l = list()
answer_id = str(data['id'])
l.append(answer_id)
l.append(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(data['created_time'])))
l.append(data['author']['name'])
l.append(''.join(etree.HTML(data['content']).xpath('//p//text()')))
writer.writerow(l)
print(l)
if data['admin_closed_comment'] == False and data['can_comment']['status'] and data['comment_count'] > 0:
GetComments(answer_id)
i += 5
print('打印到第{0}页'.format(int(i / 5)))
if is_end:
break
time.sleep(1)
def GetComments(answer_id):
j = 0
while True:
url = 'https://www.zhihu.com/api/v4/answers/{0}/root_comments?order=normal&limit=20&offset={1}&status=open'.format(
answer_id, j)
state=1
while state:
try:
res = requests.get(url, headers=headers, timeout=(3, 7))
state=0
except:
continue
res.encoding = 'utf-8'
jsonComment = json.loads(res.text)
is_end = jsonComment['paging']['is_end']
for data in jsonComment['data']:
l = list()
comment_id = str(answer_id) + "_" + str(data['id'])
l.append(comment_id)
l.append(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(data['created_time'])))
l.append(data['author']['member']['name'])
l.append(''.join(etree.HTML(data['content']).xpath('//p//text()')))
writer.writerow(l)
print(l)
for child_comments in data['child_comments']:
l.clear()
l.append(str(comment_id) + "_" + str(child_comments['id']))
l.append(time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(child_comments['created_time'])))
l.append(child_comments['author']['member']['name'])
l.append(''.join(etree.HTML(child_comments['content']).xpath('//p//text()')))
writer.writerow(l)
print(l)
j += 20
if is_end:
break
time.sleep(1)
GetAnswers()
csvfile.close()
四、总结