爬取商品评论

Reads: 5290 Edit

1 问题描述

多数网购网站的商品评论同样是以列表的形式展示，且往往存在多页的信息。除此之外，评论数据大多是动态网页，即需要用户交互，点击评论按钮来获取并显示数据。此时采用rvest包将无法爬取数据，需要借助RCurl包来爬取数据！

首先，我们先提取一个页面上的评论信息，接在再采用循环语句来提取多页的信息。只提取文字信息，不提取图像数据。

2 提取单个页面评论数据

2.1 导入所需模块

#-*- coding:utf-8 -*-
from urllib.request import Request,urlopen
import pandas as pd
from bs4 import BeautifulSoup
import importlib,sys
importlib.reload(sys)
import json

2.2 定义该网页的网址

注意：网页评论的地址并不是商品网页的地址，需要采用如下形式获取网址信息。

第一步：打开浏览器的“检查元素”功能，切换到network选项。

第二步：刷新整个商品网页

第三步：点击页面上的商品评价按钮

第四步：点击“检查元素”工具栏以“productPageComments...”开头的条目（这一步不同购物网站不一样，需要尝试、摸索、寻找）。右侧的Request URL就是商品评价的网络地址信息。

# 定义网址
url='https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100031406046&score=0&sortType=5&page=0&pageSize=10&isShadowSku=0&fold=1'

说明：该网址中有“fetchJSON”的字样，因而其请求的不是html的网页内容，而是JSON数据文件（JSON是存放数据的一种文件格式）。

2.3 爬取并解析网页内容

# header是模拟浏览器的参数，解决反爬虫问题，如果要爬取的网站没有反爬虫，可以不加hear！
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36 Edg/92.0.902.55'}
req = Request(url, headers=header)
html = urlopen(req).read().decode('gbk')

# 解析保存的网页内容
soup = BeautifulSoup(html, 'lxml')

2.4 获取评论数据

comment_str = soup.get_text()

# 提取出内容后，需要去除前后多余的字符”fetchJSON_comment98();“，才能变成标准的json格式。    comment_dict=json.loads(comment_str.lstrip('fetchJSON_comment98(').rstrip(');'))
comments=comment_dict["comments"]

2.5 提取具体的评论内容

这里只输出评论文本内容和评分的内容。

contents=[]
scores=[]
for comment in comments:
        nickname = comment['nickname']
        content = comment['content'].replace('\n', '').replace('\r', '')
        g_uid = comment['guid']
        creationTime = comment['creationTime']
        is_Top = comment['isTop']
        plus = comment['plusAvailable']
        referenceTime = comment['referenceTime']
        score = comment['score']
        days = comment['days']
        userClient = comment['userClient']

        contents.append(content)
        scores.append(score)

print(contents)
print(scores)

3 采用循环语句提取多页的评论数据

从网页中可以看出，一共有5万+的评论信息，我们这里只演示提取前20页的评论数据。

在前面获取的评论网站地址中，“page=0”表示提取第1页的评论信息。我们可以在前面网址的基础上构造“page=1”、“page=2”，...来获取第2页，第3页等的评论信息！

# 采用循环语句提取多个页面的评论
goodscomment=[]
goodsscore=[]
for i in range(0,21):
    url_half1='https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=100031406046&score=0&sortType=5&page='
    url_half2='&pageSize=10&isShadowSku=0&fold=1'
    url=url_half1+str(i)+url_half2

    req = Request(url, headers=header)
    html = urlopen(req).read().decode('gbk')
    soup = BeautifulSoup(html, 'lxml')
    comment_str = soup.get_text()

    comment_dict = json.loads(comment_str.lstrip('fetchJSON_comment98(').rstrip(');'))
    comments = comment_dict["comments"]

    for comment in comments:
        nickname = comment['nickname']
        content = comment['content'].replace('\n', '').replace('\r', '')
        g_uid = comment['guid']
        creationTime = comment['creationTime']
        is_Top = comment['isTop']
        plus = comment['plusAvailable']
        referenceTime = comment['referenceTime']
        score = comment['score']
        days = comment['days']
        userClient = comment['userClient']

        goodscomment.append(content)
        goodsscore.append(score)

print(len(goodscomment))
print(goodsscore)

可以看出，一共提取了210条评论内容，评分基本上都是5分。

获取案例数据和源代码，请关注微信公众号并回复:`Python_dt31`