A project combining network analysis with text-based clustering with scientific literature to identify waxing and waning scientific disciplines and other interesting questions for research.
For more details on how and why the contents of this repo were built, please see our paper!
Should be as easy as pip install -r requirements.txt
and then, possibly, you'll also need to do pip install .
from the repo root to install the vespid
package.
Note that a lot of this work was done in a Linux-based Docker container environment, so if any details look particularly Linux-y to you, that's why.
This project directory structure is a variant on that provided by cookiecutter data science.
├── LICENSE
├── README.md <- The top-level README for developers using this project.
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
|
├── vespid <- Source code for use in this project.
├── __init__.py <- Makes src a Python module
│
├── data <- Scripts to download or generate data
│
├── features <- Scripts to turn raw data into features for modeling
│
├── models <- Scripts to train models and then use trained models to make
│ predictions
│
└── visualization <- Scripts to create exploratory and results oriented visualizations
If actively developing code for the package, we recommend installing it via pip install -e .
when in the project root directory. This will make it be in editable mode so any changes should be auto-installed on your machine. Make sure any notebook you may be using also has
import autoreload
%load_ext autoreload
%autoreload 2
in its imports, so the edits are auto-loaded into the kernel too.
A huge portion of this code assumes you have an instance of Neo4j (we used Community Edition) running on a server that you can access. We used Neo4j for the network-based analyses presented herein as well as a single source of "data truth" for all collaborators and developers.
Unfortunately, our dataset was proprietary as it utilized the Web of Science, but much of what we did here could be replicated via an open dataset like Semantic Scholar (which we also used).
We used AWS Batch to run GPU- and CPU-based loads for much of this project, in particular the data engineering bits converting raw bibliometric data into a Neo4j graph format and training and optimizing UMAP+HDBSCAN-based clustering pipelines. We recommend you do the same.
In case you're interested in setting up a Neo4j server on AWS like we did, here's some info for you.
- Create an instance (e.g. t2.large) with at least 2 GB of memory, ideally 16 GB, using the Amazon Linux 2 AMI
ssh
insudo rpm --import https://debian.neo4j.com/neotechnology.gpg.key
- Open up a text editor like
nano
and create the file/etc/yum.repos.d/neo4j.repo
with the contents:[neo4j] name=Neo4j RPM Repository baseurl=https://yum.neo4j.com/stable enabled=1 gpgcheck=1
sudo amazon-linux-extras enable java-openjdk11
to ensure you have Java 11 enabledsudo yum install neo4j
for the latest version orsudo yum install neo4j-<version_num>
for a specific versionnano /etc/neo4j/neo4j.conf
- Uncomment
dbms.connectors.default_listen_address=0.0.0.0
so it will accept IPs from outside of localhost - Make sure
dbms.connector.bolt.enabled=true
(should be by default) - Consider setting
dbms.connector.https.enabled=true
, if you have an SSL policy/certificate you can also provide, and then set the corresponding HTTP setting to false (so you only send authentication info over the secure wire using the 7473, instead of unsecured 7474, port)
- Uncomment
- Go to
public_IP:7474
in your browser and use username=neo4j, password=neo4j for the first-time login (you'll be prompted to create a new one going forward) - Follow Neo4j's helpful guides on ingesting data into your new database!
Often useful for when you want to switch out datasets (if using Community Edition, wherein only a single database is allowed, or if you are storage-constrained).
- (as neo4j user)
neo4j stop
neo4j-admin dump --database=neo4j --to=/dumps/neo4j/neo4j-<db_nickname>-<datetimestamp>.dump
- If the directory
/dumps/neo4j/
doesn't exist yet, switch to ec2-user and runsudo mkdir /dumps/ && sudo mkdir /dumps/neo4j/ && sudo chown neo4j:neo4j /dumps/neo4j/
- Note that 150M nodes and 523M edges results in a compressed dump file of around 61 GB
- If the directory
- Jump over to ec2-user
- Copy your dump file somewhere helpful, like an S3 bucket
sudo su - neo4j
- Import/ingest data as needed
neo4j start && tail -f /var/log/neo4j/neo4j.log
Note that these steps have some explicit version numbers for software/packages used. Make sure you update to the latest versions of these before running any of this code!
- Follow steps 1-4 in the instructions above for backing up the database
- Copy the old
sudo cp /etc/neo4j/neo4j.conf /etc/neo4j/neo4j_<current_date_in_form_MM-DD-YYYY>.conf
.
- This ensures we can throw any custom config settings into the upgraded DB config, in case it doesn't do so automatically during the upgrade
sudo yum update
to update the core DBMS as ec2-user (and anything else needing it!)sudo su - neo4j
to switch to neo4j usernano /etc/neo4j/neo4j.conf
and setdbms.allow_upgrade=true
as well asdbms.mode=SINGLE
(you'll have to add the second one as a new line likely) * Full form of dbms.mode entry should be:# Makes sure the system database is upgraded too dbms.mode=SINGLE
cd /var/lib/neo4j/plugins/
and install updated versions of the plugins that will work with the Neo4j version you're installingrm <apoc_jar_file> && rm <gds_jar_file>
to remove old plugins so system isn't confused between old and new versions at startupwget https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/4.4.0.1/apoc-4.4.0.1-all.jar
wget https://s3-eu-west-1.amazonaws.com/com.neo4j.graphalgorithms.dist/graph-data-science/neo4j-graph-data-science-1.8.1-standalone.zip && unzip neo4j-graph-data-science-1.8.1-standalone.zip
rm neo4j-graph-data-science-1.8.1-standalone.zip
cd ..
to get back to/var/lib/neo4j/
and thenneo4j start && tail -f /var/log/neo4j/neo4j.log
: the process of starting the database should cause it to actually perform the upgrade, which you'll see in the logs when it spins up- Kick the tires on the newly spun-up DBMS:
CALL dbms.components()
to check that you're at the version of the core DB you're expectingRETURN gds.version()
RETURN apoc.version()
MATCH (n) RETURN COUNT(n)
just to be sure!
nano /etc/neo4j/neo4j.conf
and setdbms.allow_upgrade=false
and comment outdbms.mode=SINGLE
neo4j restart && tail -f /var/log/neo4j/neo4j.log
You can call export and import procedures from apoc
(see below), moving data from remote instance to your local and back, etc. We've created vespid.data.neo4j.export_scp
script to help you with this (exporting and scp'ing to your local). Note that you'll need to add a line at the end of /etc/neo4j/neo4j.conf
that says apoc.export.file.enabled=true
in order to make this work.
To import the data into a new database via a file on disk, make sure apoc.import.file.enabled=true
in neo4j.conf
. A simple approach to importing at that point (after moving the relevant graphML file into the neo4j import/
directory) is to run CALL apoc.import.graphml("file:///<filename.graphml>", {readLabels: 'true'})
.
Note that you can also load data programmatically in python, if that makes more sense, via Neo4jConnectionHandler.insert_data()
.
The assumption in this section is that we're trying to overwrite an old version of an existing database (e.g. the default neo4j
database), likely because we're using Community Edition and can't access more than one database at a time.
- (as neo4j user in EC2 instance in /var/lib/neo4j/)
neo4j stop
- Note that
neo4j status
will report that Neo4j isn't running if you run that command without being the neo4j user first.
- Note that
rm -rf data/databases/neo4j/ && rm -rf data/transactions/neo4j/
- This deletes the default database contents so we can import into a fresh database instance
neo4j-admin import <details here>
orneo4j-admin load <details here>
if loading from a Neo4j database dump fileneo4j start && tail -f /var/log/neo4j/neo4j.log
cd /var/lib/plugins/
if you're not already therewget https://s3-eu-west-1.amazonaws.com/com.neo4j.graphalgorithms.dist/graph-data-science/neo4j-graph-data-science-1.7.2-standalone.zip
- This is the latest version as of 2/15/2021, but check the Neo4j Download Center for the latest version before downloading - can get URL by right-clicking and copying URL from within browser.
unzip <file_you_just_got>
sudo mv <unzipped_jar_file> /var/lib/neo4j/plugins/
nano /etc/neo4j/neo4j.conf
- Uncomment and modify relevant line to be
dbms.security.procedures.unrestricted=gds.*
- May also have other procedure groups listed here for other plugins, in a comma-separated fashion. If so, leave those as they are and just add the
gds.*
bit.
- May also have other procedure groups listed here for other plugins, in a comma-separated fashion. If so, leave those as they are and just add the
neo4j restart
- In Cypher:
RETURN gds.version()
to verify you got what you were looking for - In Cypher:
CALL gds.list()
to see all procedures/algorithms available to you
cd /var/lib/plugins/
if you're not already therewget https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/4.3.0.4/apoc-4.3.0.4-all.jar
mv <file_you_just_got> plugins/
neo4j restart
&& tail -f /var/log/neo4j/neo4j.logRETURN apoc.version()
to check that it's installed as expected
- Run the below uncommented commands in your current shell, and for later re-runs, create a
.profile
file owned by theneo4j
user inside of$NEO4J_HOME
on the server you're installing, with the following contents (ensuring you've installed certificates to these locations, setting up SSL)
# neo4j arrow config, per https://github.com/neo4j-field/neo4j-arrow#configuration-%EF%B8%8F
export HOST="0.0.0.0"
export ARROW_TLS_PRIVATE_KEY="$NEO4J_HOME/certificates/private.key"
export ARROW_TLS_CERTIFICATE="$NEO4J_HOME/certificates/public.crt"
# # these are options that might help if you're running into issues:
# first remove what you might have set already for a clean slate...
unset MAX_MEM_GLOBAL MAX_MEM_STREAM ARROW_MAX_PARTITIONS ARROW_BATCH_SIZE
# # now set options!
# export MAX_MEM_GLOBAL=24
# export MAX_MEM_STREAM=8
# export ARROW_MAX_PARTITIONS=6
# export ARROW_BATCH_SIZE=500
- Install the jar plugin from the latest release page
- Download the jar
- As the
neo4j
user, move the file into$NEO4J_HOME/plugins/
and change ownership to match other files (e.g.,chown neo4j:neo4j
)
- Restart the server with the new configuration
- If you didn't before, run the commands in your new
.profile
withsource .profile
- The
.profile
above does this for you, but if you're manually removing variables, don't forget you have tounset
, not just comment theexport
line in.profile
!
- The
neo4j restart
&& tail -f /var/log/neo4j/neo4j.log- Verify that the log contains something like
INFO org.neo4j.arrow.App - server listening @ grpc+tcp://0.0.0.0:9999
- Test with a dummy query call to
Neo4jConnectionHandler(...).cypher_query_to_dataframe(..., drivers='neo4j-arrow')
!
- If you didn't before, run the commands in your new
The most basic elements of using the database come from vespid.data.neo4j_tools
. Here's some basic usage:
from vespid.data.neo4j_tools import Neo4jConnectionHandler
graph = Neo4jConnectionHandler(db_ip=db_ip, db_password=db_password)
query = """
MATCH (p:Publication)
WHERE p.title IS NOT NULL
RETURN p.title AS PublicationTitle LIMIT 5
"""
df = graph.cypher_query_to_dataframe(query)
This code will return the graph database query results as a pandas DataFrame. If you don't recognize the query language used above, plug "Neo4j Cypher" into your search engine of choice and you'll find info on it.
Amazon Linux 2 AMI comes with python 2 by default, so we need to get some basic functionality going so we can do fun things like use glances
to monitor system operation!
- [As ec2-user]
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
- Agree to all terms and allow the default install directory (which should be in ec2-user home or some such)
- Allow it to initialize conda
exit
to drop the SSH connection and then SSH back in (to refresh the shell withconda
commands)pip install glances
This project involves training and using models in multiple places, with different and/or connected pipelines for training and serving. As such, for the purposes of maximal collaboration, we need to maintain metadata, experimental records, and artifacts from the models developed so we can use them more effectively.
The remote MLFlow tracking server is designed to track experiments such as entirely new training runs, hyperparameter tuning experiments, etc. It also serves to maintain an authoratative source of all trained models that we found to be optimal after tuning and evaluation. These models should be used in the proper context across different contexts. To this end, we setup our own tracking server backed by an S3 artifact store and recommend that you do the same!
Note that we've taken elements from these articles to set all of this up.
- Spin up a tracking server with a database (postgresql ideally) you can use (e.g. AWS EC2 t2.medium with Amazon Linux 2)
- Add IAM role that will grant access to S3 for artifact storage (e.g.
ecsInstanceRole
) - Make sure security group has port 5432 open for postgresql database
- Tag with
{project: <org_name>, type: database, database_type: postgresql}
- Associate an Elastic IP with it that can then be added to our domain via a DNS type-A record
- Note that the default port for MLFlow is 5000
- Add IAM role that will grant access to S3 for artifact storage (e.g.
- SSH into your server
- Setup the postgresql server
sudo apt-get update -y && sudo apt-get upgrade -y
sudo amazon-linux-extras install postgresql13
sudo yum install postgresql postgresql-server
sudo /usr/bin/postgresql-setup --initdb
sudo systemctl enable postgresql.service
sudo systemctl start postgresql.service
sudo -u postgres psql
"CREATE ROLE mlflow_user;" "CREATE DATABASE mlflow_db;" "CREATE USER mlflow_user WITH ENCRYPTED PASSWORD 'mlflow';" "GRANT ALL PRIVILEGES ON DATABASE mlflow_db TO mlflow_user;" "\q"sudo nano /var/lib/pgsql/data/postgresql.conf
- Look for "listen_addresses" and replace the entry there with
listen_addresses = '*'
- Look for "listen_addresses" and replace the entry there with
sudo nano /var/lib/pgsql/data/pg_hba.conf
- Add this line at the top of the host/ident/etc. config stuff so it's read first (one tab between entries, except two tabs between the final two entries):
host all all 0.0.0.0/0 md5
- Add this line at the top of the host/ident/etc. config stuff so it's read first (one tab between entries, except two tabs between the final two entries):
sudo systemctl restart postgresql
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh
- Agree to all terms and allow the default install directory (which should be in ec2-user home or some such)
- Allow it to initialize conda
exit
to drop the SSH connection and then SSH back in (to refresh the shell withconda
commands)sudo yum install gcc
pip install glances mlflow psycopg2-binary boto3
- Setup the mlflow tracking server as a failure-protected service
sudo nano /etc/systemd/system/mlflow-tracking.service
and write the contents ofmlflow/mlflow-tracking.service
and savesudo systemctl daemon-reload
sudo systemctl enable mlflow-tracking
sudo systemctl start mlflow-tracking && sudo systemctl status mlflow-tracking
Many, many open source libraries were used to build this project, and we thank every contributor involved in those for their contributions! Additionally, we developed a lot of new tools along the way in this project, but have not yet had the opportunity to build them out as standalone libraries of their own. So this section will serve as our "to do" list for those. Help in spinning these out into their own repos or cleaning them up to submit to existing relevant repos would be most welcome!
vespid.models.optuna
contains new classes ofCriterion
,Hyperparameter
, andObjectives
that enable a much more flexible approach tooptuna
-based Bayesian hyperparameter optimization, and thus would likely be a fantastic addition to theoptuna
library.vespid.models.neo4j_tools.Neo4jConnectionHandler
includes a bunch of handy methods for efficiently exploring new graphs as well as multi-driver support that allows users to pick the driver that best suits their needs. Thus far, we've found:- The native/official Neo4j driver (published by Neo4j themselves) is great for data inserts
py2neo
is great at read-only queries at small scaleneo4j-arrow
is the fastest by far for read-only queries (and likely for inserts as well, but we haven't tested this). However, it can fail for very large (e.g. millions of records) queries. That said, it is an extremely new library and bound to improve in leaps and bounds in the near future.
vespid.models.visualization
has some helpful code on visualizing very large graphs via edge-bundling techniques and zoom-context-aware dynamic resolutions, enabled largely by the datashader library.vespid.models.mlflow_tools
has some nice helper functions that make quickly setting up new MLFlow experiments a breeze. These enhancements would like make a nice PR to themlflow
library, once they've been cleaned up a bit.- We created
vespid.pipeline
to make it quick and easy (and hopefully intuitive) to set up data pipelines. There may be some useful concepts in here for projects like Apache Airflow, but it's also possible we simply could have used that, had we had the time to learn it :P - Along the same lines as the earlier item about
vespid.models.optuna
,vespid.models.batch_cluster_tuning.py
has some interesting ideas we explore more in the paper referenced at the top of the README. Specifically, we take a multi-object Bayesian hyperparameter optimization approach to finding robust and reproducible HDBSCAN-based clustering solutions. It's possible, with more experimentation on novel datasets, that this could become a clustering library unto itself or an addition to a larger clustering library.
Project based on the cookiecutter data science project template. #cookiecutterdatascience