Skip to content

danielmlow/reddit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data and code for "Natural language processing reveals vulnerable mental health groups and heightened health anxiety on Reddit during COVID-19"

1. Data

Available at Open Science Framework: https://osf.io/7peyq/

Also available through Zenodo: https://zenodo.org/record/3941387#.YFfi3EhJHL8

Please cite if you use the data:

Low, D. M., Rumker, L., Talker, T., Torous, J., Cecchi, G., & Ghosh, S. S. Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit during COVID-19: An Observational Study. Journal of medical Internet research. doi: 10.2196/22635

@article{low2020natural,
  title={Natural Language Processing Reveals Vulnerable Mental Health Support Groups and Heightened Health Anxiety on Reddit During COVID-19: Observational Study},
  author={Low, Daniel M and Rumker, Laurie and Talkar, Tanya and Torous, John and Cecchi, Guillermo and Ghosh, Satrajit S},
  journal={Journal of medical Internet research},
  volume={22},
  number={10},
  pages={e22635},
  year={2020},
  publisher={JMIR Publications Inc., Toronto, Canada}
}

License: This dataset is made available under the Public Domain Dedication and License v1.0 whose full text can be found at: http://www.opendatacommons.org/licenses/pddl/1.0/ It was downloaded using pushshift API. Re-use of this data is subject to Reddit API terms.

1.1. Reddit mental health dataset

find in data/input/reddit_mental_health_dataset/

Posts and text features for the following timeframes from 28 mental health and non-mental health subreddits:

  • 15 specific mental health support groups (r/EDAnonymous, r/addiction, r/alcoholism, r/adhd, r/anxiety, r/autism, r/bipolarreddit, r/bpd, r/depression, r/healthanxiety, r/lonely, r/ptsd, r/schizophrenia, r/socialanxiety, and r/suicidewatch)
  • 2 broad mental health subreddits (r/mentalhealth, r/COVID19_support)
  • 11 non-mental health subreddits (r/conspiracy, r/divorce, r/fitness, r/guns, r/jokes, r/legaladvice, r/meditation, r/parenting, r/personalfinance, r/relationships, r/teaching).

Downloaded using pushshift API. Re-use of this data is subject to Reddit API terms. Cite TODO if using this dataset.

filenames and corresponding timeframes:

  • post: Jan 1 to April 20, 2020 (called "mid-pandemic" in manuscript; r/COVID19_support appears)
  • pre: Dec 2018 to Dec 2019. A full year which provides more data for a baseline of Reddit posts
  • 2019: Jan 1 to April 20, 2019 (r/EDAnonymous appears). A control for seasonal fluctuations to match post data.
  • 2018: Jan 1 to April 20, 2018. A control for seasonal fluctuations to match post data.

See Supplementary Materials for more information.

Note: if subsampling (e.g., to balance subreddits), we recommend bootstrapping analyses for unbiased results.

1.2. COVID-19 mention dataset (Figure 1)

find in data/input/covid19_counts/

Same posts as in post above for 15 mental health subreddits.

Counting these tokens: 'corona','virus','viral','covid', 'sars','influenza','pandemic', 'epidemic', 'quarantine','lockdown', 'distancing', 'national emergency', 'flatten', 'infect','ventilator', 'mask','symptomatic', 'epidemiolog', 'immun', 'incubation', 'transmission','vaccine'

  • One column covid19_boolean: if one of these words appears at least once (Figure 1)
  • One column covid19_total: total count of words
  • One column covid19_weighed_words: total count of words normalized by the amount of words (n_words) in a post (Figure S3).

1.3. COVID-19 cases

Confirmed COVID-19 cases obtained from ourworldindata.org/covid-cases (source: European CDC).

2. Reproduce

All .ipynb can run on Google Colab (for which data should be on Google Drive; code to load data from Google Drive is available in scripts) or on Jupter Notebook.

To run the .py or .ipynb on Jupter Notebook, create a virtual environment and install the requirements.txt:

  • conda create --name reddit --file requirements.txt
  • conda activate reddit

2.1. Preprocessing

  • reddit_data_extraction.ipynb download data
  • reddit_feature_extraction.ipynb feature extraction for classification (TF-IDF was re-done separately on train set), trend analysis, and supervised dimensionality reduction.
  • See below for preprocessing for topic modeling and unsupervised clustering

2.2. Analyses

Classification
  • Clone catpro from https://github.com/danielmlow/catpro/ and change path in run.py sys.path.append('./../../catpro') accordingly
  • config.py set paths, subreddits to run, and sample size
  • N is the model (0=SGD L1, 1=SGD EN, 2=SVM, 3=ET, 4=XGB)
  • Run remotely: run_v8_<N>.sh runs run.py on cluster running each binary classifier on different nodes through --job_array_task_id set to one of range(0,15)
  • Run locally (set --job_array_task_id and --run_modelN accordingly):
python3 -i run.py --job_array_task_id=1 --run_modelN=0 --run_version_number=8 
  • classification_results.py: figure 5-a, summarize results, extract important features, and visualize testing on COVID19_support (psychological profiler), run (change paths accordingly)
Trend Analysis
  • reddit_descriptive.ipynb: figures 1 and 2
Unsupervised clustering
  • Unsupervised_Clustering_Pipeline.ipynb: figures 3 and 5-c
Topic Modeling
  • reddit_lda_pipeline.ipynb: figure 4 and 5-b
Supervised dimensionality reduction
  • reddit_cluster.ipynb: figure 6
  • reddit_cluster.py: UMAP on 50 random subsamples of 2019 (pre) data to determine sensor precision
    • run remotely: run_umap.sh
    • run locally (--job_array_task_id will run a single subsample):
    python3 reddit_cluster.py --job_array_task_id=0 --plot=True --pre_or_post='pre'
    

About

analysis of mental health support groups on Reddit

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages