Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download SciSciNet_NSF and link to mag authors #44

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions src/dataprep/main/prep_nsf/link_scinetnsf_to_mag.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Link SciSciNet_Links_NSF table with Paper_Author_Affiliations, Authors, and NSF_Investigator
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what the purpose of this function is. It does not seem to change anything in the database? Is it a function to load relevant links that we will use in the future? If so, then maybe we can think to only keep the relevant links directly when writing to the database? Or will we use the links with low similarity later as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't written anything to the db here as I noticed that the command to compare the names does not yet do what I want, so I need to adjust it first. What I want to do is just compare the names and only keep the names that match, it just needs to be discussed how strict this should be

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion!

# Keeps only those with link between NSF grant and author ID.
# Data downloaded and uploaded into db in: scinet_data_to_db.py in same folder


# Note: Not sure if calculating string distance now works correctly



packages <- c("tidyverse", "broom", "dbplyr", "RSQLite", "stringdist")
lapply(packages, library, character.only = TRUE)

datapath <- "/mnt/ssd/"
db_file <- paste0(datapath, "AcademicGraph/AcademicGraph.sqlite")
#sciscinet_path <- paste0(datapath,"sciscinet_data/")


#filepath_nsf=paste0(sciscinet_path,"SciSciNet_Link_NSF.tsv")

con <- DBI::dbConnect(RSQLite::SQLite(), db_file)
cat("The database connection is: \n")
src_dbi(con)

# Create table with all links between NSF-grant and authors via papers

NSF_to_Authors <- tbl(con, sql("
select a. PaperID, a.Type, a.GrantID, b.AuthorId, b.OriginalAuthor
,c.NormalizedName, Position, FirstName, LastName
from scinet_links_nsf as a
inner join (
select PaperId AS PaperID, AuthorId, OriginalAuthor
from PaperAuthorAffiliations
)b
using (PaperID)
inner join (
select AuthorId, NormalizedName
from Authors
) c
using (AuthorId)
inner join (
select GrantID, Position, FirstName, LastName
from NSF_Investigator
) d
using (GrantID)
"))

nsf_to_authors <- collect(NSF_to_Authors)

# Create a variable with the full name from mag
nsf_to_authors$mag_name <- paste(nsf_to_authors$FirstName, nsf_to_authors$LastName, sep = " ")

## Still running, not sure if running correctly from here

### Compare name similarity
# Set a threshold for similarity
threshold <- 0.8

# Calculate string similarity for each row and add a new column
name_similarity <- numeric(0)


# Iterate through rows and calculate string distances
for (i in 1:nrow(nsf_to_authors)) {
mag_name <- nsf_to_authors$mag_name[i]
NormalizedName <- nsf_to_authors$NormalizedName[i]

# Calculate string distance for this row
row_similarity <- stringsim(
mag_name,
NormalizedName
)

# Append the calculated distance to the results vector
name_similarity <- c(name_similarity, row_similarity)
}

# Assign the calculated distances to a new column in data frame
nsf_to_authors$name_similarity <- name_similarity

# Filter observations where the names are above the threshold
similar_names <- nsf_to_authors %>%
filter(name_similarity >= threshold)

# drop unnecessary variables
df <- similar_names %>%
select(GrantID, AuthorId, Position) %>%
distinct()

# Write table to db:
dbWriteTable(con, name = "links_nsf_mag", value = df, overwrite = TRUE)

# close connection to db
DBI::dbDisconnect(con)
97 changes: 97 additions & 0 deletions src/dataprep/main/prep_nsf/scinet_data_to_db.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-

# %%
"""
Download SciSciNet table SciSciNet_Link_NSF with links between NSF-grants and papers
Upload into db
link to Paper_Author_Affiliations, Authors, NSF_Investigator in R file: test_sciscinet_data.R in same folder


Create table in database:
- scinet_links_nsf

SciSciNet_Link_NSF schema is:

GrantID TEXT, PaperID INTEGER, Type TEXT


unique index on Grantid and PaperID (multiple PaperIDs per GrantID)
"""

import subprocess
import sqlite3 as sqlite
import argparse
import os
from os import listdir
from os.path import isfile, join
import pandas as pd
import numpy as np
import re
import sys
import requests

sys.path.append('/home/mona/mag_sample/src/dataprep/')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be necessary (and not be in the script). If you have problem when not using this, we can solve it together.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I cannot make it run when leaving it out, I'd need your help here please

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some instructions: https://github.com/f-hafner/mag_sample/blob/main/README.dev.md
Let me know if it works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works, thank you!


from helpers.variables import db_file, datapath, databasepath
from helpers.functions import analyze_db

scinet_path = os.path.join(datapath, "sciscinet_data/")
filepath_nsf = os.path.join(scinet_path, "SciSciNet_Link_NSF.tsv")



# Download file

url_nsf = "https://ndownloader.figstatic.com/files/36139242"
response = requests.get(url_nsf)
with open(filepath_nsf, "wb") as file:
file.write(response.content)
print("Downloaded data")


# ## Read files in loop and dump to db

def load_scinet(filepath):
df = pd.read_csv(filepath,
sep="\t",
names=["NSF_Award_Number", "PaperID", "Type", "Diff_ZScore"],
skiprows=1)
df.drop_duplicates(inplace=True)

# Create the GrantID column by removing non-numeric characters and formatting

df['GrantID'] = df['NSF_Award_Number'].str.extract(r'-(\d+)')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is here a new variable created? Why not just load the table as it is?

Also why drop the diff_zscore and type? These are important to figure out confidence of the link

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new vairable: I mean GrantId, why not just keep the original NSF_Award_Number

df = df.drop(columns=['NSF_Award_Number', 'Diff_ZScore'])

return df

files = [f for f in listdir(scinet_path) if isfile(join(scinet_path, f))]


con = sqlite.connect(database = db_file, isolation_level= None)
with con:
for (i,f) in enumerate(files):
df = load_scinet(scinet_path+f)
#print(df.head())
if i==0:
if_exists_opt="replace"
else:
if_exists_opt="append"

df.to_sql("scinet_links_nsf",
con=con,
if_exists=if_exists_opt,
index=False,
schema= """ PaperID INTEGER
, Type TEXT
, GrantID TEXT
"""
)

# Make index and clean up
con.execute("CREATE UNIQUE INDEX idx_scinet_grantpaper ON scinet_links_nsf (GrantID ASC, PaperID ASC)")

analyze_db(con)

con.close()