-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Rename code files and added tests for processed submission files.
- Loading branch information
1 parent
36a7a69
commit 393599d
Showing
16 changed files
with
631 additions
and
646 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,3 @@ | ||
|
||
import REF2021_processing.read_write as rw | ||
|
||
rule all: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
2024-02-06 15:34:18,160 [INFO] EnvironmentStatementsInstitutionLevel - read data from 'data/processed/environment_statements/extracted/institution/' | ||
2024-02-06 15:34:18,160 [INFO] EnvironmentStatementsInstitutionLevel - statements: 143, sections: 4 | ||
2024-02-06 15:34:18,607 [INFO] EnvironmentStatementsInstitutionLevel - prepared institution statements: 143 records, 5 columns | ||
2024-02-06 15:34:18,655 [INFO] EnvironmentStatementsInstitutionLevel - write dataset to 'data/processed/environment_statements/prepared/EnvironmentStatementsInstitutionLevel.parquet' | ||
2024-02-06 17:05:39,048 [INFO] EnvironmentStatementsInstitutionLevel - read data from 'data/processed/environment_statements/extracted/institution/' | ||
2024-02-06 17:05:39,048 [INFO] EnvironmentStatementsInstitutionLevel - statements: 143, sections: 4 | ||
2024-02-06 17:05:39,485 [INFO] EnvironmentStatementsInstitutionLevel - prepared institution statements: 143 records, 5 columns | ||
2024-02-06 17:05:39,508 [INFO] EnvironmentStatementsInstitutionLevel - write dataset to 'data/processed/environment_statements/prepared/EnvironmentStatementsInstitutionLevel.parquet' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
2024-02-06 15:34:18,145 [INFO] EnvironmentStatementsUnitLevel - read data from 'data/processed/environment_statements/extracted/unit/' | ||
2024-02-06 15:34:18,145 [INFO] EnvironmentStatementsUnitLevel - statements: 1874, sections: 4 | ||
2024-02-06 15:34:37,064 [INFO] EnvironmentStatementsUnitLevel - prepared statements: 1874 records | ||
2024-02-06 15:34:37,510 [INFO] EnvironmentStatementsUnitLevel - write dataset to 'data/processed/environment_statements/prepared/EnvironmentStatementsUnitLevel.parquet' | ||
2024-02-06 17:05:39,030 [INFO] EnvironmentStatementsUnitLevel - read data from 'data/processed/environment_statements/extracted/unit/' | ||
2024-02-06 17:05:39,030 [INFO] EnvironmentStatementsUnitLevel - statements: 1874, sections: 4 | ||
2024-02-06 17:05:57,989 [INFO] EnvironmentStatementsUnitLevel - prepared statements: 1874 records | ||
2024-02-06 17:05:58,524 [INFO] EnvironmentStatementsUnitLevel - write dataset to 'data/processed/environment_statements/prepared/EnvironmentStatementsUnitLevel.parquet' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,10 @@ | ||
2024-02-06 15:34:21,959 [INFO] ImpactCaseStudies - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 15:34:22,544 [INFO] ImpactCaseStudies - parsed sheet: 6361 records | ||
2024-02-06 15:34:22,545 [INFO] ImpactCaseStudies - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 15:34:22,546 [INFO] ImpactCaseStudies - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 15:34:22,548 [INFO] ImpactCaseStudies - add columns for panel names | ||
2024-02-06 15:34:22,549 [INFO] ImpactCaseStudies - shift columns from title to the left to fix raw data issue | ||
2024-02-06 15:34:25,546 [INFO] ImpactCaseStudies - replace styling characters in ['1. Summary of the impact', '2. Underpinning research', '3. References to the research', '4. Details of the impact', '5. Sources to corroborate the impact'] | ||
2024-02-06 15:34:25,549 [INFO] ImpactCaseStudies - drop columns '['Researcher ORCIDs', 'Is continued from 2014', '5. Sources to corroborate the impact', '2. Underpinning research', '3. References to the research', 'Countries', 'Institution UKPRN code', 'Global research identifiers', 'Formal partners', 'Unit of assessment number', 'Grant funding', 'Main panel code']' | ||
2024-02-06 15:34:25,564 [INFO] ImpactCaseStudies - make categorical ['Multiple submission letter', 'Institution name', 'Main panel name', 'Unit of assessment name', 'Multiple submission name', 'Joint submission'] | ||
2024-02-06 15:34:25,819 [INFO] ImpactCaseStudies - write dataset to 'data/processed/sheets/ImpactCaseStudies.parquet' | ||
2024-02-06 17:05:42,732 [INFO] ImpactCaseStudies - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 17:05:43,321 [INFO] ImpactCaseStudies - parsed sheet: 6361 records | ||
2024-02-06 17:05:43,322 [INFO] ImpactCaseStudies - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 17:05:43,324 [INFO] ImpactCaseStudies - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 17:05:43,325 [INFO] ImpactCaseStudies - add columns for panel names | ||
2024-02-06 17:05:43,326 [INFO] ImpactCaseStudies - shift columns from title to the left to fix raw data issue | ||
2024-02-06 17:05:46,539 [INFO] ImpactCaseStudies - replace styling characters in ['1. Summary of the impact', '2. Underpinning research', '3. References to the research', '4. Details of the impact', '5. Sources to corroborate the impact'] | ||
2024-02-06 17:05:46,543 [INFO] ImpactCaseStudies - drop columns '['Researcher ORCIDs', 'Institution UKPRN code', '5. Sources to corroborate the impact', 'Unit of assessment number', 'Global research identifiers', 'Main panel code', '3. References to the research', 'Formal partners', 'Is continued from 2014', 'Grant funding', '2. Underpinning research', 'Countries']' | ||
2024-02-06 17:05:46,566 [INFO] ImpactCaseStudies - make categorical ['Institution name', 'Main panel name', 'Joint submission', 'Unit of assessment name', 'Multiple submission letter', 'Multiple submission name'] | ||
2024-02-06 17:05:46,765 [INFO] ImpactCaseStudies - write dataset to 'data/processed/sheets/ImpactCaseStudies.parquet' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,12 @@ | ||
2024-02-06 15:34:21,959 [INFO] Outputs - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 15:34:38,633 [INFO] Outputs - parsed sheet: 185353 records | ||
2024-02-06 15:34:38,656 [INFO] Outputs - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 15:34:38,690 [INFO] Outputs - rename 'Output type' to 'Output type code' | ||
2024-02-06 15:34:38,737 [INFO] Outputs - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 15:34:38,777 [INFO] Outputs - add columns for panel names | ||
2024-02-06 15:34:39,528 [INFO] Outputs - replace styling characters in ['Title'] | ||
2024-02-06 15:34:39,563 [INFO] Outputs - add columns for output types names | ||
2024-02-06 15:34:39,567 [INFO] Outputs - make output year categorical | ||
2024-02-06 15:34:39,612 [INFO] Outputs - drop columns '['Output type code', 'Institution UKPRN code', 'Unit of assessment number', 'Main panel code']' | ||
2024-02-06 15:34:39,626 [INFO] Outputs - make categorical ['Propose double weighting', 'Forensic science', 'Open access status', 'Unit of assessment name', 'Delayed by COVID19', 'Interdisciplinary', 'Output type', 'Multiple submission letter', 'Joint submission', 'Is reserve output', 'Non-English', 'Criminology', 'Multiple submission name', 'Research group', 'Institution name', 'Citations applicable', 'Main panel name', 'Cross-referral requested'] | ||
2024-02-06 15:34:39,938 [INFO] Outputs - write dataset to 'data/processed/sheets/Outputs.parquet' | ||
2024-02-06 17:05:42,713 [INFO] Outputs - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 17:05:59,201 [INFO] Outputs - parsed sheet: 185353 records | ||
2024-02-06 17:05:59,221 [INFO] Outputs - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 17:05:59,252 [INFO] Outputs - rename 'Output type' to 'Output type code' | ||
2024-02-06 17:05:59,298 [INFO] Outputs - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 17:05:59,338 [INFO] Outputs - add columns for panel names | ||
2024-02-06 17:06:00,062 [INFO] Outputs - replace styling characters in ['Title'] | ||
2024-02-06 17:06:00,097 [INFO] Outputs - add columns for output types names | ||
2024-02-06 17:06:00,098 [INFO] Outputs - make output year categorical | ||
2024-02-06 17:06:00,141 [INFO] Outputs - drop columns '['Main panel code', 'Institution UKPRN code', 'Unit of assessment number', 'Output type code']' | ||
2024-02-06 17:06:00,155 [INFO] Outputs - make categorical ['Institution name', 'Multiple submission letter', 'Research group', 'Delayed by COVID19', 'Interdisciplinary', 'Output type', 'Joint submission', 'Is reserve output', 'Open access status', 'Cross-referral requested', 'Propose double weighting', 'Citations applicable', 'Main panel name', 'Unit of assessment name', 'Multiple submission name', 'Forensic science', 'Non-English', 'Criminology'] | ||
2024-02-06 17:06:00,469 [INFO] Outputs - write dataset to 'data/processed/sheets/Outputs.parquet' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,9 @@ | ||
2024-02-06 15:34:21,959 [INFO] ResearchDoctoralDegreesAwarded - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 15:34:22,043 [INFO] ResearchDoctoralDegreesAwarded - parsed sheet: 1888 records | ||
2024-02-06 15:34:22,043 [INFO] ResearchDoctoralDegreesAwarded - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 15:34:22,044 [INFO] ResearchDoctoralDegreesAwarded - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 15:34:22,045 [INFO] ResearchDoctoralDegreesAwarded - add columns for panel names | ||
2024-02-06 15:34:22,046 [INFO] ResearchDoctoralDegreesAwarded - calculate total number of degrees awarded | ||
2024-02-06 15:34:22,047 [INFO] ResearchDoctoralDegreesAwarded - drop columns '['Unit of assessment number', 'Institution UKPRN code', 'Main panel code']' | ||
2024-02-06 15:34:22,047 [INFO] ResearchDoctoralDegreesAwarded - make categorical ['Main panel name', 'Multiple submission name', 'Multiple submission letter', 'Institution name', 'Joint submission', 'Unit of assessment name'] | ||
2024-02-06 15:34:22,061 [INFO] ResearchDoctoralDegreesAwarded - write dataset to 'data/processed/sheets/ResearchDoctoralDegreesAwarded.parquet' | ||
2024-02-06 17:05:42,746 [INFO] ResearchDoctoralDegreesAwarded - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 17:05:42,829 [INFO] ResearchDoctoralDegreesAwarded - parsed sheet: 1888 records | ||
2024-02-06 17:05:42,830 [INFO] ResearchDoctoralDegreesAwarded - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 17:05:42,831 [INFO] ResearchDoctoralDegreesAwarded - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 17:05:42,832 [INFO] ResearchDoctoralDegreesAwarded - add columns for panel names | ||
2024-02-06 17:05:42,833 [INFO] ResearchDoctoralDegreesAwarded - calculate total number of degrees awarded | ||
2024-02-06 17:05:42,833 [INFO] ResearchDoctoralDegreesAwarded - drop columns '['Unit of assessment number', 'Main panel code', 'Institution UKPRN code']' | ||
2024-02-06 17:05:42,833 [INFO] ResearchDoctoralDegreesAwarded - make categorical ['Multiple submission name', 'Joint submission', 'Main panel name', 'Multiple submission letter', 'Unit of assessment name', 'Institution name'] | ||
2024-02-06 17:05:42,851 [INFO] ResearchDoctoralDegreesAwarded - write dataset to 'data/processed/sheets/ResearchDoctoralDegreesAwarded.parquet' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,9 @@ | ||
2024-02-06 15:34:21,961 [INFO] ResearchGroups - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 15:34:22,022 [INFO] ResearchGroups - parsed sheet: 2036 records | ||
2024-02-06 15:34:22,022 [INFO] ResearchGroups - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 15:34:22,023 [INFO] ResearchGroups - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 15:34:22,027 [INFO] ResearchGroups - add columns for panel names | ||
2024-02-06 15:34:22,028 [INFO] ResearchGroups - make group code categorical | ||
2024-02-06 15:34:22,028 [INFO] ResearchGroups - drop columns '['Institution UKPRN code', 'Unit of assessment number', 'Main panel code']' | ||
2024-02-06 15:34:22,028 [INFO] ResearchGroups - make categorical ['Main panel name', 'Unit of assessment name', 'Joint submission', 'Multiple submission letter', 'Multiple submission name', 'Institution name'] | ||
2024-02-06 15:34:22,042 [INFO] ResearchGroups - write dataset to 'data/processed/sheets/ResearchGroups.parquet' | ||
2024-02-06 17:05:42,734 [INFO] ResearchGroups - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 17:05:42,800 [INFO] ResearchGroups - parsed sheet: 2036 records | ||
2024-02-06 17:05:42,800 [INFO] ResearchGroups - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 17:05:42,801 [INFO] ResearchGroups - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 17:05:42,802 [INFO] ResearchGroups - add columns for panel names | ||
2024-02-06 17:05:42,803 [INFO] ResearchGroups - make group code categorical | ||
2024-02-06 17:05:42,803 [INFO] ResearchGroups - drop columns '['Main panel code', 'Unit of assessment number', 'Institution UKPRN code']' | ||
2024-02-06 17:05:42,803 [INFO] ResearchGroups - make categorical ['Institution name', 'Joint submission', 'Main panel name', 'Unit of assessment name', 'Multiple submission name', 'Multiple submission letter'] | ||
2024-02-06 17:05:42,817 [INFO] ResearchGroups - write dataset to 'data/processed/sheets/ResearchGroups.parquet' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,9 @@ | ||
2024-02-06 15:34:21,987 [INFO] ResearchIncome - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 15:34:23,245 [INFO] ResearchIncome - parsed sheet: 28637 records | ||
2024-02-06 15:34:23,245 [INFO] ResearchIncome - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 15:34:23,252 [INFO] ResearchIncome - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 15:34:23,255 [INFO] ResearchIncome - add columns for panel names | ||
2024-02-06 15:34:23,258 [INFO] ResearchIncome - make income source categorical | ||
2024-02-06 15:34:23,259 [INFO] ResearchIncome - drop columns '['Institution UKPRN code', 'Unit of assessment number', 'Main panel code']' | ||
2024-02-06 15:34:23,259 [INFO] ResearchIncome - make categorical ['Main panel name', 'Multiple submission name', 'Institution name', 'Multiple submission letter', 'Joint submission', 'Unit of assessment name'] | ||
2024-02-06 15:34:23,285 [INFO] ResearchIncome - write dataset to 'data/processed/sheets/ResearchIncome.parquet' | ||
2024-02-06 17:05:42,714 [INFO] ResearchIncome - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 17:05:43,917 [INFO] ResearchIncome - parsed sheet: 28637 records | ||
2024-02-06 17:05:43,918 [INFO] ResearchIncome - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 17:05:43,926 [INFO] ResearchIncome - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 17:05:43,931 [INFO] ResearchIncome - add columns for panel names | ||
2024-02-06 17:05:43,934 [INFO] ResearchIncome - make income source categorical | ||
2024-02-06 17:05:43,936 [INFO] ResearchIncome - drop columns '['Unit of assessment number', 'Institution UKPRN code', 'Main panel code']' | ||
2024-02-06 17:05:43,936 [INFO] ResearchIncome - make categorical ['Unit of assessment name', 'Multiple submission name', 'Joint submission', 'Multiple submission letter', 'Main panel name', 'Institution name'] | ||
2024-02-06 17:05:43,965 [INFO] ResearchIncome - write dataset to 'data/processed/sheets/ResearchIncome.parquet' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,9 @@ | ||
2024-02-06 15:34:21,960 [INFO] ResearchIncomeInKind - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 15:34:22,161 [INFO] ResearchIncomeInKind - parsed sheet: 4093 records | ||
2024-02-06 15:34:22,161 [INFO] ResearchIncomeInKind - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 15:34:22,163 [INFO] ResearchIncomeInKind - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 15:34:22,164 [INFO] ResearchIncomeInKind - add columns for panel names | ||
2024-02-06 15:34:22,164 [INFO] ResearchIncomeInKind - make income source categorical | ||
2024-02-06 15:34:22,165 [INFO] ResearchIncomeInKind - drop columns '['Institution UKPRN code', 'Unit of assessment number', 'Main panel code']' | ||
2024-02-06 15:34:22,165 [INFO] ResearchIncomeInKind - make categorical ['Unit of assessment name', 'Institution name', 'Multiple submission name', 'Multiple submission letter', 'Main panel name', 'Joint submission'] | ||
2024-02-06 15:34:22,178 [INFO] ResearchIncomeInKind - write dataset to 'data/processed/sheets/ResearchIncomeInKind.parquet' | ||
2024-02-06 17:05:42,756 [INFO] ResearchIncomeInKind - read sheet from 'data/raw/REF-2021-Submissions-All-2022-07-27.xlsx' | ||
2024-02-06 17:05:42,964 [INFO] ResearchIncomeInKind - parsed sheet: 4093 records | ||
2024-02-06 17:05:42,965 [INFO] ResearchIncomeInKind - rename 'Main panel' to 'Main panel code' | ||
2024-02-06 17:05:42,967 [INFO] ResearchIncomeInKind - replace '['/', ':']' with '_' in 'Institution name' | ||
2024-02-06 17:05:42,968 [INFO] ResearchIncomeInKind - add columns for panel names | ||
2024-02-06 17:05:42,969 [INFO] ResearchIncomeInKind - make income source categorical | ||
2024-02-06 17:05:42,969 [INFO] ResearchIncomeInKind - drop columns '['Unit of assessment number', 'Main panel code', 'Institution UKPRN code']' | ||
2024-02-06 17:05:42,969 [INFO] ResearchIncomeInKind - make categorical ['Multiple submission name', 'Main panel name', 'Joint submission', 'Unit of assessment name', 'Multiple submission letter', 'Institution name'] | ||
2024-02-06 17:05:42,983 [INFO] ResearchIncomeInKind - write dataset to 'data/processed/sheets/ResearchIncomeInKind.parquet' |
Oops, something went wrong.