Skip to content

Commit

Permalink
Finish
Browse files Browse the repository at this point in the history
  • Loading branch information
pengshiqi committed May 16, 2018
1 parent a011564 commit 9185fd6
Show file tree
Hide file tree
Showing 15 changed files with 125 additions and 76 deletions.
124 changes: 97 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,25 @@

这是对"懂球帝"App的数据爬虫与分析。



**技术栈:**

首先使用 **软件Charlse** 抓包懂球帝App的各个API:

1. 球队信息API: http://api.dongqiudi.com/catalogshttp://api.dongqiudi.com/catalog/channels/{id}
2. 获取 Article ID 的 API: http://api.dongqiudi.com/app/tabs/iphone/1.json
3. 获取评论用户的API: http://api.dongqiudi.com/v2/article/{article_id}/comment?sort=down&version=600
4. 获取用户信息的API:https://api.dongqiudi.com/users/profile/{}

先获取最近5000页的10w篇article,然后获取这些article评论区的用户,再爬取这些用户的个人数据。

使用 **Requests** 库来爬取数据,其中爬取评论区的用户ID时间比较长,而且只能串行不可并行,需要有断点继续机制。爬取到60w用户ID后,可以并行来获取这些用户的个人数据。

所有数据存储在本地的sqlite3数据库中。

使用PyEcharts来进行数据可视化,其中分词部分使用**[jieba](https://github.com/fxsjy/jieba)**

----

### 1. 数据准备
Expand All @@ -16,26 +35,16 @@

2. 近期的文章列表。

- ~~一共提取了最近 20 页的 **2000** 篇文章。~~

- 一共提取了最近 5000 页的 **99889** 篇文章。

3. 用户id列表。

- ~~2000 篇文章下的 **249685** 个评论用户(未去重),存储在 article_comment_user 表。~~

- 99889 篇文章下的(约) **5891037** 个评论用户(未去重),存储在 article_comment_user 表。

4. 用户信息。

- ~~去重后共 **103300** 个用户,存储在 user 表。~~

- 去重后共 **610803** 个用户,存储在 user 表。 (去重去掉了90%...)

5. 评论内容。

- todo

----

### 2. 数据分析结果
Expand All @@ -46,7 +55,7 @@

| Region | Count |
| :----: | :---: |
| 广东 广州 | 10287 |
| 广东 广州 | 10287 |
| 四川 成都 | 8446 |
| 北京 东城区 | 6960 |
| 海外 其他 | 6866 |
Expand All @@ -56,38 +65,99 @@
| 北京 海淀区 | 5163 |
| 山东 济南 | 4884 |
| 广东 深圳 | 4827 |
| 湖北 武汉 | 4566 |
| 湖北 武汉 | 4566 |

3. Top 10 club team:

| Team | Count |
| :----: | :---: |
| 巴塞罗那 | 87400 |
| 皇家马德里 | 86600 |
| 曼联 | 41716 |
| 拜仁慕尼黑 | 29428 |
| 阿森纳 | 19728 |
| AC米兰 | 19373 |
| 广州恒大淘宝 | 18762 |
| 利物浦 | 15589 |
| 切尔西 | 14601 |
| 国际米兰 | 12332 |
| 曼联 | 41716 |
| 拜仁慕尼黑 | 29428 |
| 阿森纳 | 19728 |
| AC米兰 | 19373 |
| 广州恒大淘宝 | 18762 |
| 利物浦 | 15589 |
| 切尔西 | 14601 |
| 国际米兰 | 12332 |


4. Top 5 national team:

| Team | Count |
| :----: | :---: |
| Team | Count |
| :--: | :---: |
| 中国 | 24134 |
| 德国 | 3655 |
| 阿根廷 | 2354 |
| 德国 | 3655 |
| 阿根廷 | 2354 |
| 意大利 | 1723 |
| 巴西 | 1617 |
| 巴西 | 1617 |

----

### 3. 数据可视化

~~使用 d3.js / E-Charts .~~
使用 [**PyEcharts**](https://github.com/pyecharts/pyecharts),强烈推荐。



3.1 性别

![性别](./img/性别.png)



3.2 国内分布

![国内懂球帝分布](./img/国内懂球帝分布.png)



3.3 海外分布

![海外懂球帝分布](./img/海外懂球帝分布.png)



3.4 国家队

![国家队](./img/国家队.png)



3.5 英超

![英超](./img/英超.png)



3.6 意甲

![意甲](./img/意甲.png)



3.7 中超

![中超](./img/中超.png)



3.8 俱乐部

![俱乐部](./img/俱乐部.png)



3.9 名字词云

![加入时间](./img/echarts.png)



3.10 加入时间

![加入时间](./img/加入时间.png)



使用 **PyEcharts**,强烈推荐。
32 changes: 22 additions & 10 deletions analysis.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# -*- coding:utf-8 -*-

import json
import sqlite3
import time
import datetime

import jieba
Expand All @@ -12,6 +10,17 @@


def process_league_data(league_name, cursor, data, env, figure_type='pie', limit=None):
"""
画某个联赛的饼图。
:param league_name: 联赛名
:param cursor:
:param data:
:param env:
:param figure_type: pie 饼图, bar 柱状图
:param limit: 画柱状图时的俱乐部数量
:return: None
"""
# 先获取球队id和名字对应关系字典
if league_name == '俱乐部':
cursor.execute("SELECT * FROM team where league != '国家队'")
Expand All @@ -32,6 +41,7 @@ def process_league_data(league_name, cursor, data, env, figure_type='pie', limit
else:
team_dict[x[7]] = 1

# 排序
team_list = [(k, v) for k, v in team_dict.items()]
team_list = sorted(team_list, key=lambda x: x[1], reverse=True)

Expand All @@ -57,7 +67,13 @@ def process_league_data(league_name, cursor, data, env, figure_type='pie', limit
print('-----------------------------------------------------------------------------------')


def count(calculate_word_cloud=True):
def analyse(calculate_word_cloud=True):
"""
分析 data.db -> user 的用户数据。
:param calculate_word_cloud: 是否计算词云,这个比较花费时间。
:return:
"""
conn = sqlite3.connect('data.db')

cursor = conn.cursor()
Expand Down Expand Up @@ -214,12 +230,11 @@ def normalize(l):
value_list = [(k, v) for k, v in word_dict.items()]
value_list = sorted(value_list, key=lambda x: x[1], reverse=True)

# 去掉单字
attr = [x[0] for x in value_list if len(x[0]) > 1]
value = [x[1] for x in value_list if len(x[0]) > 1]

# attr = [x[0] for x in value_list]
# value = [x[1] for x in value_list]

# 取 top 100
attr = attr[:100]
value = value[:100]

Expand Down Expand Up @@ -249,9 +264,6 @@ def normalize(l):
attr = [t - datetime.timedelta(days=x) for x in attr]
attr = [f'{x.year}-{x.month}-{x.day}' for x in attr]

print(attr[:10])
print(v[:10])

bar = Bar(f"加入时间")
bar.add("", attr, v, is_stack=True)

Expand Down Expand Up @@ -311,4 +323,4 @@ def normalize(l):


if __name__ == '__main__':
count(calculate_word_cloud=False)
analyse(calculate_word_cloud=False)
37 changes: 2 additions & 35 deletions crawl.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,6 @@
import time
import multiprocessing
import argparse
import asyncio
import aiohttp

from util import get_user_info, get_comment_user, get_articles_id

Expand All @@ -18,6 +16,8 @@ def write_team_info():
:return:
"""

# 使用时需要将 Cookie 替换。
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
'Cookie': 'laravel_session=eyJpdiI6ImtucXJlaTdDdnlCWHJOaDl6Q3pnNlZkcUgxU0FpVE5IZDBuWGt1a3pha2c9IiwidmFsdWUiOiJNQ0ZzNXZla2hsaWtENEEraERQQW1adXFRdUdCUlBGV25MQ09SMW5jek1EV2xPaG5sV05VSGFHMUkxSDVEM1pBVWJsWFBZMUQ1SnRCQnREZlBrRUJ5dz09IiwibWFjIjoiOGE3NzY2YWE3NTlmYjIyODg5M2U4ZjBlMDc4NzU5NzgzYmM2NDIwOTY2MTU0NmI4Zjc5OTFjMWM5YmQ1YzZmMSJ9; expires=Sat, 12-May-2018 13:55:57 GMT; Max-Age=7200; path=/; domain=dongqiudi.com; httponly'
Expand Down Expand Up @@ -182,10 +182,6 @@ def write_article_comment_user(page_num, obtain_article=True, multi_process=Fals
pool.close()
pool.join()

# loop = asyncio.get_event_loop()
# loop.run_until_complete(async_get_comment_user(loop, article_id_list))
# loop.close()

print(f'User id set obtained, there are total {total_users} users.')

toc2 = time.time()
Expand Down Expand Up @@ -240,15 +236,6 @@ def write_user_info(begin, end):
conn = sqlite3.connect('data.db')

cursor = conn.cursor()
# cursor.execute('select * from article_comment_user')
# value = cursor.fetchall()
#
# user_id_set = set()
# for x in value:
# d = eval(x[2])
# user_id_set.update(d)
#
# print(f'There are {len(user_id_set)} users in total.')

with open('user_id_set.txt', 'rb') as F:
d = F.readlines()
Expand All @@ -259,16 +246,6 @@ def write_user_info(begin, end):
print(f'Part 1 finish, cost time {toc1 - tic} second.')

# 2. 获取用户信息,写入user表
# cursor.execute("select * from sqlite_master where type = 'table' and name = 'user'")
# value = cursor.fetchall()
# if value:
# cursor.execute('DROP TABLE user')
#
# cursor.execute('CREATE TABLE user (id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, user_id VARCHAR(20), '
# 'user_name VARCHAR(50), gender VARCHAR(10), created_at VARCHAR(30), region_id INTEGER, '
# 'region_phrase VARCHAR(15), team_id VARCHAR(10), introduction VARCHAR(20), timeline_total INTEGER, '
# 'post_total INTEGER, reply_total INTEGER, up_total VARCHAR(10), following_total VARCHAR(10), '
# 'followers_total VARCHAR(10))')

insert_data = list()
count = 0
Expand Down Expand Up @@ -296,15 +273,6 @@ def write_user_info(begin, end):
print(f'Part 2 costs time: {toc2 - toc1} second.')


def async_write_user_info():
"""
异步爬取用户信息。
:return:
"""
pass


if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Parse arguments.')
parser.add_argument('--begin', type=int, default=0,
Expand All @@ -323,5 +291,4 @@ def async_write_user_info():

# write_user_info(args.begin, args.end)

# async_write_user_info()

Binary file added img/echarts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/中超.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/俱乐部.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/加入时间.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/国内懂球帝分布.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/国家队.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/性别.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/意甲.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/海外懂球帝分布.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/英超.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
pyecharts
requests
jieba
echarts-countries-pypkg
4 changes: 0 additions & 4 deletions util.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,7 @@

import requests
import json
import sqlite3
import time
import asyncio

from pprint import pprint


def get_articles_id(page_num):
Expand Down

0 comments on commit 9185fd6

Please sign in to comment.