从京东商品到评论分析：Python数据链路全流程

在电商行业中，商品评论是用户购买决策的重要依据之一。通过分析京东等电商平台上的商品信息与用户评论，我们不仅可以了解市场趋势，还能为产品优化、品牌营销、客户服务等提供数据支持。本文将带你了解如何从京东抓取商品数据、评论数据，进行清洗和分析，最后提取出有价值的洞察。

一、环境准备

在开始爬取数据之前，首先要确保你有Python环境并且安装了爬虫所需的库。以下是基础环境的搭建。

1.1 安装必要的库


pip install requests beautifulsoup4 pandas matplotlib seaborn jieba

requests：用于发送HTTP请求，抓取网页数据。beautifulsoup4：用于解析HTML。pandas：用于数据处理。matplotlib和seaborn：用于数据可视化。jieba：用于中文文本分词。

二、商品数据爬取

首先，我们需要从京东抓取商品信息，如商品名称、价格、销量等。

2.1 获取京东商品信息

京东的商品页面数据是通过AJAX动态加载的，因此，我们需要模拟浏览器请求，获取商品信息。在实际操作时，可能需要使用headers来模拟浏览器请求，防止被反爬虫机制屏蔽。

示例代码：抓取商品的名称、价格和销量


import requests
from bs4 import BeautifulSoup
import pandas as pd

# 商品页面URL（例如，某个商品的详细页）
url = 'https://item.jd.com/100012043978.html'

# 模拟浏览器请求头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# 发送请求并获取页面内容
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# 提取商品名称、价格、销量等信息
product_name = soup.find('div', class_='sku-name').text.strip()
price = soup.find('span', class_='price').text.strip()
sales = soup.find('div', class_='product-parameter').find_all('li')[2].text.strip()

# 打印结果
print(f'商品名称: {product_name}')
print(f'商品价格: {price}')
print(f'销量: {sales}')

在上面的代码中，我们通过BeautifulSoup解析HTML，提取了商品的名称、价格和销量信息。

2.2 存储商品数据

抓取到的数据可以存储为CSV文件，方便后续的数据处理。


# 存储商品信息到CSV文件
data = {'product_name': [product_name], 'price': [price], 'sales': [sales]}
df = pd.DataFrame(data)
df.to_csv('jd_product_info.csv', index=False)

三、评论数据爬取

除了商品信息，评论数据也是非常重要的。在京东的商品页面上，评论通常分为不同的标签，例如好评、中评、差评等。我们需要抓取这些评论并进行分析。

3.1 获取评论页面的URL

京东的评论通常通过一个单独的API接口进行加载。我们可以通过商品页面的URL构建评论API接口URL。以下是获取评论API接口的示例：


# 评论API的URL（可以通过商品页的网络请求获取）
comment_url = 'https://sclub.jd.com/comment/productPageComments.action?productId=100012043978&score=0&sortType=5&page=0&pageSize=10'

3.2 抓取评论数据

我们使用requests发送请求，获取评论数据。通常，评论数据是JSON格式的，因此需要对JSON数据进行解析。


import requests
import json

# 评论API接口
comment_url = 'https://sclub.jd.com/comment/productPageComments.action?productId=100012043978&score=0&sortType=5&page=0&pageSize=10'

# 发送请求获取评论数据
response = requests.get(comment_url, headers=headers)
comments_data = response.json()

# 提取评论内容
comments = []
for comment in comments_data['comments']:
    content = comment['content']
    comments.append(content)

# 显示前10条评论
print(comments[:10])

在上面的代码中，我们通过response.json()将获取的评论数据转换为字典格式，然后提取出评论内容。

3.3 存储评论数据

我们将抓取到的评论数据存储为CSV文件，以便后续进行分析。


# 存储评论数据到CSV文件
comment_df = pd.DataFrame({'comment': comments})
comment_df.to_csv('jd_product_comments.csv', index=False)

四、评论数据分析

抓取到评论数据后，我们可以对评论进行清洗和分析。我们可以使用词云分析用户评论的关键词，或者进行情感分析，判断评论是积极的、消极的，还是中立的。

4.1 数据清洗

评论数据通常包含一些无关的符号、停用词等，需要进行清洗。使用jieba进行中文分词，并移除停用词。


import jieba

# 加载停用词
stopwords = set(['的', '了', '在', '是', '我', '他', '她', '它'])

# 中文分词
def clean_comment(comment):
    words = jieba.cut(comment)
    return [word for word in words if word not in stopwords]

# 清洗评论数据
cleaned_comments = [clean_comment(comment) for comment in comments]

# 打印前5条清洗后的评论
print(cleaned_comments[:5])

4.2 词云分析

通过wordcloud生成词云，帮助我们更好地理解用户对商品的关注点。


pip install wordcloud


from wordcloud import WordCloud
import matplotlib.pyplot as plt

# 合并所有评论
all_comments = ' '.join([' '.join(comment) for comment in cleaned_comments])

# 生成词云
wordcloud = WordCloud(font_path='/path/to/your/font.ttf', width=800, height=400).generate(all_comments)

# 显示词云
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

4.3 情感分析

我们可以使用情感分析库，如SnowNLP，来分析评论的情感倾向，判断评论是正面的还是负面的。


pip install snownlp


from snownlp import SnowNLP

# 情感分析函数
def sentiment_analysis(comment):
    s = SnowNLP(comment)
    return s.sentiments  # 返回情感得分，0为负面，1为正面

# 对前5条评论进行情感分析
sentiments = [sentiment_analysis(comment) for comment in comments[:5]]
print(sentiments)

情感分析的输出是一个0到1之间的值，接近1表示正面情感，接近0表示负面情感。

五、商品与评论的关联分析

在分析了商品信息和评论数据后，我们可以进行关联分析。例如，分析价格、销量与用户评论之间的关系，看看价格高的商品是否获得了更多正面评价。

5.1 合并商品信息和评论数据


# 读取商品信息
product_df = pd.read_csv('jd_product_info.csv')

# 将商品信息与评论数据合并
# 假设商品信息只有一个商品，这里用简单的方式合并
product_df['comments'] = ' '.join(comments)

# 保存合并后的数据
product_df.to_csv('jd_product_with_comments.csv', index=False)

5.2 价格与情感分析的关系

我们可以绘制散点图，查看价格与评论情感得分之间的关系。


# 价格与情感得分的关系
product_df['sentiment_score'] = [sentiment_analysis(comment) for comment in comments]

plt.figure(figsize=(10, 6))
plt.scatter(product_df['price'], product_df['sentiment_score'])
plt.title('Price vs Sentiment Score')
plt.xlabel('Price')
plt.ylabel('Sentiment Score')
plt.show()