-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download SciSciNet_NSF and link to mag authors #44
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,58 @@ | |||
# Link SciSciNet_Links_NSF table with Paper_Author_Affiliations, Authors, and NSF_Investigator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me what the purpose of this function is. It does not seem to change anything in the database? Is it a function to load relevant links that we will use in the future? If so, then maybe we can think to only keep the relevant links directly when writing to the database? Or will we use the links with low similarity later as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't written anything to the db here as I noticed that the command to compare the names does not yet do what I want, so I need to adjust it first. What I want to do is just compare the names and only keep the names that match, it just needs to be discussed how strict this should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion!
else: | ||
if_exists_opt="append" | ||
|
||
df.to_sql("scinet", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a more descriptive if longer table name, ie "scinet_links", or something. @chrished ?
It looks like there's something wrong the the data are stored in the database at the moment (maybe you just have to delete the db and run the script again)
NSF_Award_Number|PaperID|Type|Diff_ZScore|GrantID
NSF_Award_Number|PaperID|Type|Diff_ZScore| #this row should not be here
NSF-1907207|3101155693|First||1907207
NSF-1900929|2966372995|First||1900929
NSF-1900929|2973880421|First||1900929
NSF-1900929|3042450767|First||1900929
NSF-1900929|2987739614|First||1900929
Also, do we need both GrantID and NSF_Award_Number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll adjust the name.
I wanted to keep both variables for now to be sure that there's no inconsistencies between the raw data set and the data set I use to link with mag authors, but I can of course remove the NSF_Award_Number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I then also remove the Diff_ZScore variable before loading it to the db?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I then also remove the Diff_ZScore variable before loading it to the db?
See christoph's comment below.
be sure that there's no inconsistencies between the raw data set and the data set I use
That's a good idea (what's the exact check you want to run?). Perhaps you can do this check separately (depending on how extensive it is, in a separate file or not), before writing the data to the database?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I just looked at the number of rows to be sure that no observations are lost and when creating an index it could serve as a check for unique values at the same time
) | ||
|
||
# Make index and clean up | ||
con.execute("CREATE UNIQUE INDEX idx_scinet ON scinet (GrantID ASC, PaperID ASC)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've used the index name convention idx_tablename_indexcolumns
(sometimes we have multiple indexes on a table, and sqlite does not accept duplicated index names). So here something like idx_scinet_grantpaper
import sys | ||
import requests | ||
|
||
sys.path.append('/home/mona/mag_sample/src/dataprep/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be necessary (and not be in the script). If you have problem when not using this, we can solve it together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I cannot make it run when leaving it out, I'd need your help here please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some instructions: https://github.com/f-hafner/mag_sample/blob/main/README.dev.md
Let me know if it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works, thank you!
@@ -55,12 +55,14 @@ | |||
def load_scinet(filepath): | |||
df = pd.read_csv(filepath, | |||
sep="\t", | |||
names=["NSF_Award_Number", "PaperID", "Type", "Diff_ZScore"]) | |||
names=["NSF_Award_Number", "PaperID", "Type", "Diff_ZScore"], | |||
skiprows=1) | |||
df.drop_duplicates(inplace=True) | |||
|
|||
# Create the GrantID column by removing non-numeric characters and formatting | |||
|
|||
df['GrantID'] = df['NSF_Award_Number'].str.extract(r'-(\d+)') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is here a new variable created? Why not just load the table as it is?
Also why drop the diff_zscore and type? These are important to figure out confidence of the link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new vairable: I mean GrantId, why not just keep the original NSF_Award_Number
goal is a table that assigns nsf investigators (identified by GrantID and position) to a MAG author ID. |
I had a look at the remaining links (authors that received a grant) and it is only 200,000 observations left and we lose around 7,000 grants (~5.5%) that could not be linked to any author by name. Originally, there were almost 11 mio observations (authors that had a link to a grant through PaperId) |
links_nsf_mag <- data.frame() | ||
|
||
# Loop through the file names and append the data | ||
for (i in 1:1072) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard coded loop length. Please replace by something that works generically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the whole loop etc seems to be a leftover of code from before. Line 176 to end should be deleted, right?
@@ -0,0 +1,195 @@ | |||
# Link SciSciNet_Links_NSF table with Paper_Author_Affiliations, Authors, and NSF_Investigator | |||
# Keeps only those with link between NSF grant and author ID. | |||
# only those links with a similar name (similarity >=0.8) are loaded into db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not exactly true anymore. drop line
#45 solved by last commit |
Branch for NSF data from SciSciNet