By AkagawaTsurunaki.
A spider script which can crawl the comments under videos from Bilibili. Supports automatic checkpoint recovery of historical records, basic data cleaning, and basic data statistics.
2023/10/19
Now we support collect comments from Xiaohongshu and save data as a tree structure.
You can download and install Firefox browser through the following link:
You can download Firefox geckodriver through the following link:
https://github.com/mozilla/geckodriver/releases
Remember that the version of browser and driver should correspond.
Use pip command to install all of required libs.
We suppose you have installed Python in your computer. Then you should install all dependencies through this command.
pip install -r requirements.txt
You can find the configuration file in .\config\chat_spider_config.py
,
firefox_profile_dir
: path of the profile folder of Firefox,
which contains your personal information,
for instance, your login cookies.
Without login, you may only check 3 comments below for each specified video in Bilibili.
So you should login first and then start BiliChatSpider.
firefox_driver_dir
: path of the driver of Firefox.
To launch Selenium successfully,
you must download driver with corresponding Firefox version.
save_path
: path where the comments data will be stored. We default it to .\dataset
.
sleep_time_before_job_launching
: the seconds a process will sleep before a job launches.
sleep_time_after_job_launching
: the seconds a process will sleep after a job launched.
max_parallel_job_num
: BiliChatSpider uses multiprocess,
this argument specifies how many processes will be started in parallel.
For instance, you want to crawl 4 hosts, which names are 嘉然今天吃什么
, 向晚大魔王
, 乃琳Queen
and 珈乐Carol
.
In terminal please use command like following:
python ./bili.py `
-l 672328094 672346917 672342685 351609538`
And BiliChatSpider will automatically start and save comments data.
-l
or --list
: A string list about the uid you want to crawl.
-f
or --force
: Force the script to update the history cache, default to 'N'
, opposite to Y
.
BiliChatSpider will first open a page to get up_name
and then traverse the whole space to get the video list and store it in default path .\dataset\history.json
.
-t
or --time
: Set the time BiliChatSpider will start automatically.
If you want to crawl the videos related to specified uid. In terminal please use command like following:
python ./bili.py `
-u 672328094 `
-b BV14m4y1V7oj BV1Mm4y1V7R BV1Jw41117Zk
-u
or --uid
: The ID of an UP.
-b
or --bv
: A string list of bv number related to the videos you want to crawl.
You should clean data with using command following
python clean.py
Remember we set ./dataset
as default save path!
The cleaned data will be written in default directory
./data_clean/train.json
.
Use command following to print what you have collected from Bilibili.
python ./bili_statistic.py
The output format will be as follows.
{index} ({uid}): {number_of_comments}
...
Total: {total_number_of_all_comments}
In previous version of BiliChatSpider
, I found that Chrome browser will cause Memory Leaky
while Firefox will not.
Therefore, I suppose you use Firefox to run this script program.
E-mail: [email protected]
Github: https://github.com/AkagawaTsurunaki
If you have any questions, please raise them in the issue or contact me via E-mail. Thank you for your support and contributions to this project.
This program can only be used for study.
You will not be liable for any legal consequences arising from your use of this product!
APACHE LICENSE, VERSION 2.0
For more details, please refer to https://www.apache.org/licenses/LICENSE-2.0.html