Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download SciSciNet_NSF and link to mag authors #44

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from
Draft

Conversation

monadap
Copy link
Collaborator

@monadap monadap commented Sep 22, 2023

Branch for NSF data from SciSciNet

@@ -0,0 +1,58 @@
# Link SciSciNet_Links_NSF table with Paper_Author_Affiliations, Authors, and NSF_Investigator
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what the purpose of this function is. It does not seem to change anything in the database? Is it a function to load relevant links that we will use in the future? If so, then maybe we can think to only keep the relevant links directly when writing to the database? Or will we use the links with low similarity later as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't written anything to the db here as I noticed that the command to compare the names does not yet do what I want, so I need to adjust it first. What I want to do is just compare the names and only keep the names that match, it just needs to be discussed how strict this should be

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the confusion!

else:
if_exists_opt="append"

df.to_sql("scinet",
Copy link
Owner

@f-hafner f-hafner Sep 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a more descriptive if longer table name, ie "scinet_links", or something. @chrished ?

It looks like there's something wrong the the data are stored in the database at the moment (maybe you just have to delete the db and run the script again)

NSF_Award_Number|PaperID|Type|Diff_ZScore|GrantID
NSF_Award_Number|PaperID|Type|Diff_ZScore| #this row should not be here
NSF-1907207|3101155693|First||1907207
NSF-1900929|2966372995|First||1900929
NSF-1900929|2973880421|First||1900929
NSF-1900929|3042450767|First||1900929
NSF-1900929|2987739614|First||1900929

Also, do we need both GrantID and NSF_Award_Number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll adjust the name.
I wanted to keep both variables for now to be sure that there's no inconsistencies between the raw data set and the data set I use to link with mag authors, but I can of course remove the NSF_Award_Number

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I then also remove the Diff_ZScore variable before loading it to the db?

Copy link
Owner

@f-hafner f-hafner Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I then also remove the Diff_ZScore variable before loading it to the db?

See christoph's comment below.

be sure that there's no inconsistencies between the raw data set and the data set I use

That's a good idea (what's the exact check you want to run?). Perhaps you can do this check separately (depending on how extensive it is, in a separate file or not), before writing the data to the database?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I just looked at the number of rows to be sure that no observations are lost and when creating an index it could serve as a check for unique values at the same time

)

# Make index and clean up
con.execute("CREATE UNIQUE INDEX idx_scinet ON scinet (GrantID ASC, PaperID ASC)")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used the index name convention idx_tablename_indexcolumns (sometimes we have multiple indexes on a table, and sqlite does not accept duplicated index names). So here something like idx_scinet_grantpaper

import sys
import requests

sys.path.append('/home/mona/mag_sample/src/dataprep/')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be necessary (and not be in the script). If you have problem when not using this, we can solve it together.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I cannot make it run when leaving it out, I'd need your help here please

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some instructions: https://github.com/f-hafner/mag_sample/blob/main/README.dev.md
Let me know if it works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works, thank you!

@@ -55,12 +55,14 @@
def load_scinet(filepath):
df = pd.read_csv(filepath,
sep="\t",
names=["NSF_Award_Number", "PaperID", "Type", "Diff_ZScore"])
names=["NSF_Award_Number", "PaperID", "Type", "Diff_ZScore"],
skiprows=1)
df.drop_duplicates(inplace=True)

# Create the GrantID column by removing non-numeric characters and formatting

df['GrantID'] = df['NSF_Award_Number'].str.extract(r'-(\d+)')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is here a new variable created? Why not just load the table as it is?

Also why drop the diff_zscore and type? These are important to figure out confidence of the link

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new vairable: I mean GrantId, why not just keep the original NSF_Award_Number

@chrished chrished marked this pull request as draft September 25, 2023 13:21
@chrished
Copy link
Collaborator

goal is a table that assigns nsf investigators (identified by GrantID and position) to a MAG author ID.

@monadap
Copy link
Collaborator Author

monadap commented Oct 3, 2023

I had a look at the remaining links (authors that received a grant) and it is only 200,000 observations left and we lose around 7,000 grants (~5.5%) that could not be linked to any author by name. Originally, there were almost 11 mio observations (authors that had a link to a grant through PaperId)

links_nsf_mag <- data.frame()

# Loop through the file names and append the data
for (i in 1:1072) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard coded loop length. Please replace by something that works generically

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the whole loop etc seems to be a leftover of code from before. Line 176 to end should be deleted, right?

@@ -0,0 +1,195 @@
# Link SciSciNet_Links_NSF table with Paper_Author_Affiliations, Authors, and NSF_Investigator
# Keeps only those with link between NSF grant and author ID.
# only those links with a similar name (similarity >=0.8) are loaded into db
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly true anymore. drop line

@chrished
Copy link
Collaborator

chrished commented Oct 9, 2023

#45 solved by last commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants