Download SciSciNet_NSF and link to mag authors #44

monadap · 2023-09-22T08:18:52Z

Branch for NSF data from SciSciNet

f-hafner · 2023-09-23T08:14:35Z

src/dataprep/main/prep_nsf/link_scinetnsf_to_mag.R

@@ -0,0 +1,58 @@
+# Link SciSciNet_Links_NSF table with Paper_Author_Affiliations, Authors, and NSF_Investigator 


It's not clear to me what the purpose of this function is. It does not seem to change anything in the database? Is it a function to load relevant links that we will use in the future? If so, then maybe we can think to only keep the relevant links directly when writing to the database? Or will we use the links with low similarity later as well?

I haven't written anything to the db here as I noticed that the command to compare the names does not yet do what I want, so I need to adjust it first. What I want to do is just compare the names and only keep the names that match, it just needs to be discussed how strict this should be

Sorry for the confusion!

f-hafner · 2023-09-23T08:20:09Z

src/dataprep/main/prep_nsf/scinet_data_to_db.py

+        else:
+            if_exists_opt="append"
+
+        df.to_sql("scinet", 


I would prefer a more descriptive if longer table name, ie "scinet_links", or something. @chrished ?

It looks like there's something wrong the the data are stored in the database at the moment (maybe you just have to delete the db and run the script again)

NSF_Award_Number|PaperID|Type|Diff_ZScore|GrantID NSF_Award_Number|PaperID|Type|Diff_ZScore| #this row should not be here NSF-1907207|3101155693|First||1907207 NSF-1900929|2966372995|First||1900929 NSF-1900929|2973880421|First||1900929 NSF-1900929|3042450767|First||1900929 NSF-1900929|2987739614|First||1900929

Also, do we need both GrantID and NSF_Award_Number?

Okay, I'll adjust the name.
I wanted to keep both variables for now to be sure that there's no inconsistencies between the raw data set and the data set I use to link with mag authors, but I can of course remove the NSF_Award_Number

Should I then also remove the Diff_ZScore variable before loading it to the db?

Should I then also remove the Diff_ZScore variable before loading it to the db?

See christoph's comment below.

be sure that there's no inconsistencies between the raw data set and the data set I use

That's a good idea (what's the exact check you want to run?). Perhaps you can do this check separately (depending on how extensive it is, in a separate file or not), before writing the data to the database?

For now I just looked at the number of rows to be sure that no observations are lost and when creating an index it could serve as a check for unique values at the same time

f-hafner · 2023-09-23T08:23:16Z

src/dataprep/main/prep_nsf/scinet_data_to_db.py

+                    )
+
+    # Make index and clean up
+    con.execute("CREATE UNIQUE INDEX idx_scinet ON scinet (GrantID ASC, PaperID ASC)")


I've used the index name convention idx_tablename_indexcolumns (sometimes we have multiple indexes on a table, and sqlite does not accept duplicated index names). So here something like idx_scinet_grantpaper

f-hafner · 2023-09-23T09:57:07Z

src/dataprep/main/prep_nsf/scinet_data_to_db.py

+import sys
+import requests
+
+sys.path.append('/home/mona/mag_sample/src/dataprep/')  


This should not be necessary (and not be in the script). If you have problem when not using this, we can solve it together.

Unfortunately, I cannot make it run when leaving it out, I'd need your help here please

Here are some instructions: https://github.com/f-hafner/mag_sample/blob/main/README.dev.md
Let me know if it works.

It works, thank you!

chrished · 2023-09-25T13:14:32Z

src/dataprep/main/prep_nsf/scinet_data_to_db.py

@@ -55,12 +55,14 @@
 def load_scinet(filepath):
    df = pd.read_csv(filepath, 
                        sep="\t",
-                        names=["NSF_Award_Number", "PaperID", "Type", "Diff_ZScore"])
+                        names=["NSF_Award_Number", "PaperID", "Type", "Diff_ZScore"], 
+                        skiprows=1)
    df.drop_duplicates(inplace=True)

    # Create the GrantID column by removing non-numeric characters and formatting

    df['GrantID'] = df['NSF_Award_Number'].str.extract(r'-(\d+)') 


Why is here a new variable created? Why not just load the table as it is?

Also why drop the diff_zscore and type? These are important to figure out confidence of the link

new vairable: I mean GrantId, why not just keep the original NSF_Award_Number

chrished · 2023-09-26T14:14:58Z

goal is a table that assigns nsf investigators (identified by GrantID and position) to a MAG author ID.

monadap · 2023-10-03T15:33:10Z

I had a look at the remaining links (authors that received a grant) and it is only 200,000 observations left and we lose around 7,000 grants (~5.5%) that could not be linked to any author by name. Originally, there were almost 11 mio observations (authors that had a link to a grant through PaperId)

chrished · 2023-10-06T18:25:39Z

src/dataprep/main/prep_nsf/link_scinetnsf_to_mag.R

+links_nsf_mag <- data.frame()
+
+# Loop through the file names and append the data
+for (i in 1:1072) {


hard coded loop length. Please replace by something that works generically

actually the whole loop etc seems to be a leftover of code from before. Line 176 to end should be deleted, right?

chrished · 2023-10-06T18:29:27Z

src/dataprep/main/prep_nsf/link_scinetnsf_to_mag.R

@@ -0,0 +1,195 @@
+# Link SciSciNet_Links_NSF table with Paper_Author_Affiliations, Authors, and NSF_Investigator 
+# Keeps only those with link between NSF grant and author ID.
+# only those links with a similar name (similarity >=0.8) are loaded into db


not exactly true anymore. drop line

…ting to db

chrished · 2023-10-09T01:27:00Z

#45 solved by last commit

Download SciSciNet_NSF and link to mag authors

3c91d73

f-hafner reviewed Sep 23, 2023

View reviewed changes

monadap added 2 commits September 25, 2023 10:11

Some improvements in loading to db, linking

b2c08e0

Updated code for name similarity btw nsf and mag

5a17e7f

chrished reviewed Sep 25, 2023

View reviewed changes

chrished marked this pull request as draft September 25, 2023 13:21

Final code to download nsf links and upload to db

64e10d9

monadap added 7 commits September 26, 2023 15:10

Final code to link mag and nsf + upload to db

5ac9632

Comparison of methods for stringdistance

ae370fc

Added info on processing stage + updated pipeline

47aa945

just uncommented db-upload

85820d7

Optimized similarity check of names

ea8c1f4

Changed restriction for name similarity

1a93552

Loaded chunk-csv files into one file

63c8946

chrished reviewed Oct 6, 2023

View reviewed changes

monadap and others added 2 commits October 6, 2023 18:49

added upload to db

6ba244d

handle chunks correctly, change AuthorId to integer64 for correct wri…

94bda09

…ting to db

droped a line which was not true anymore

ac571fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download SciSciNet_NSF and link to mag authors #44

Download SciSciNet_NSF and link to mag authors #44

monadap commented Sep 22, 2023

f-hafner Sep 23, 2023

monadap Sep 25, 2023

monadap Sep 25, 2023

f-hafner Sep 23, 2023 •

edited

Loading

monadap Sep 25, 2023

monadap Sep 25, 2023

f-hafner Sep 26, 2023 •

edited

Loading

monadap Sep 26, 2023

f-hafner Sep 23, 2023

f-hafner Sep 23, 2023

monadap Sep 25, 2023

f-hafner Sep 26, 2023

monadap Sep 26, 2023

chrished Sep 25, 2023

chrished Sep 25, 2023

chrished commented Sep 26, 2023

monadap commented Oct 3, 2023

chrished Oct 6, 2023

chrished Oct 6, 2023

chrished Oct 6, 2023

chrished commented Oct 9, 2023

		@@ -0,0 +1,58 @@
		# Link SciSciNet_Links_NSF table with Paper_Author_Affiliations, Authors, and NSF_Investigator

Download SciSciNet_NSF and link to mag authors #44

Are you sure you want to change the base?

Download SciSciNet_NSF and link to mag authors #44

Conversation

monadap commented Sep 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

f-hafner Sep 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

f-hafner Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrished commented Sep 26, 2023

monadap commented Oct 3, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrished commented Oct 9, 2023

f-hafner Sep 23, 2023 •

edited

Loading

f-hafner Sep 26, 2023 •

edited

Loading