Nodejs script for archiving Visma InSchool documents, based on pdf-text-recognition
Uses @vtfk/pdf-text-reader for extracting pdf text
Uses @vtfk/pdf-splitter for extracting pdf text
Uses archiveApi/SyncElevmappe for getting student data as well as creation and updating of Elevmapper
- Only tested and used on Windows as far as I know
- Not very suitable for other use cases than archiving pdfs when you do not have any other data than the pdf
- Good luck understanding the code... I am sorry, I take full responsibility
$ git clone https://github.com/vtfk/vis-til-arkiv-v3
If you want to be able to split pdfs (handle a large pdf-document consisting of several documents of the same type), Make sure you have PDFtk installed in the same environment as you are running the nodejs script from.
Use-case: Archiving several 'karakterprotokoller' in the one go - send in 300 protocols in one document, each protocol is handled as a separate document.
See et sted
$ npm i
NODE_ENV="production"
ROOT_DIRECTORY="C:/PROD-VIStilArkiv" #This is the root folder for the jobs and documents, can be wherever
STAT_FILE="C:/PROD-VIStilArkiv/stat.json" #A local statistics file
P360_SYNCELEVMAPPE_URL="https://<base_url>/SyncElevmappe"
P360_SYNCELEVMAPPE_KEY="<key>"
P360_SYNCELEVMAPPE_HEADER_NAME="<authorization_header_name>"
PDFTK_EXT="C:/Program Files (x86)/PDFtk/bin/pdftk"
P360_ARCHIVE_DOC_URL="https://<sif_rpc_api_url>/DocumentService/CreateDocument"
P360_DISPATCH_DOC_URL="https://https://<sif_rpc_api_url>/DocumentService/DispatchDocuments"
P360_ARCHIVE_KEY="<auth_key>"
P360_ARCHIVE_QUERY_STRING="<query_string_property_name_before_the_<auth_key>>"
E18_URL="https://<api_url>/e18"
E18_KEY="<e18_key>"
E18_HEADER_NAME="<authorization_header_name>"
DELETE_FINISHED_JOBS=false # If true, archived pdfs are deleted, if false, they are put in <imported_folder>
DISPATCH_DIRECTORY_NAME="dispatchInput" #Optional, defaults to "input". Name of folder where script looks for pdfs
DOCUMENT_DIRECTORY_NAME="document" #Optional, defaults to "documents"
DELETE_DIRECTORY_NAME="delete" #Optional, defaults to "delete"
TYPE_SEARCH_WORD="VIS MAL TYPE" #Optional, defaults to "VIS MAL TYPE", used for recognizing documentTypes
TEAMSWEBHOOK_URL="<webhook_url>" #Optional, for alerts in Teams
SMTP_HOST="<smtp_host>" #Optional, for sending emails to users when documents are not recognized as a valid type
SMTP_PORT=<smtp_host> #Optional, for sending emails to users when documents are not recognized as a valid type
PAPERTRAIL_HOST="<papertrail_host>" #Optional, for logging
PAPERTRAIL_TOKEN="<papertrail_token>" #Optional, for logging
To see if it runs, the first run will set up necessary directories witihin the ROOT_DIRECTORY
$ node ./index.js
In ./archiveMethods.js you can add, disable, or delete methods. Create a method for each documentType you want to archive
Example method without typeSearchWord
VISVarsel: {
active: true, // set to false to disable the method
id: 'VISVarsel', // set to the same as the property for the whole method. Don't ask why...
name: 'Varsel om fare for regn',
findDataMethod: 'visVarselDoc', // Use or create methods defined in "./lib/getData.js"
identifierStrings: ['Varsel om fare', 'yr'], // Sentences or words that uniquely distinguish this document
archiveTemplate: 'varsel-fare', // the template used to create archive metadata
internalNoteTemplate: 'internt-notat-varsel',
internalNote: './data/blockedAddress.pdf', // If svarUt is used, and the document could not be sent, send note to school
svarUt: false, // If document should be sent on svarut as well as archived
manualSvarUt: false, // If you need to manually control the document in P360 before svarut
schoolOrgnr: '994309153', // optional, overrides school found in document
accessGroup: 'Elev Kompetansebyggeren' // optional, overrides accessgroup found in document
}
Example method with typeSearchWord
VIS001: {
active: true,
id: 'VIS001', // id is the value found in the document behind "<TYPE_SEARCH_WORD>:"
name: 'Fritak for opplæring i vær',
findDataMethod: 'soknad',
archiveTemplate: 'fritak-oppl-kro',
internalNoteTemplate: 'internt-notat-svarbrev',
internalNote: './data/blockedAddress.pdf',
svarUt: true,
manualSvarUt: false
}
Example method with splitting enabled
VISVarsel: {
active: true,
id: 'VISVarsel',
name: 'Varsel om fare for regn',
findDataMethod: 'visVarselDoc', // note that the findDataMethod must check if documents need to be splitted
identifierStrings: ['Varsel om fare', 'yr'],
splitStrings: ['Varsel', 'om fare', 'for regn i dag'], // The split strings are words and sentences present on the page you want to split on
archiveTemplate: 'varsel-fare',
svarUt: false,
manualSvarUt: false,
}
Create a json file inside the ./templates directory, reference the template in the corresponding archive method
// Use <<<token>>>, where you want to replace the token with documentData.token, when running createMetadata.js
{
"Title": "Varsel om fare for regn på <<<day>>>",
"UnofficialTitle": "Varsel om fare for regn på dag <<<day>>> - <<<month>>> - <<<year>>>",
"DocumentDate": "<<<documentDate>>>",
"Archive": "Elevdokument",
"Category": "Dokument ut",
"Paragraph": "Offl. § 13 jf. fvl. § 13 (1) nr.1",
"AccessCodeDescription": "Offl §13 jf. fvl §13 første ledd pkt. 1 - taushetsplikt om værforhold",
"Status": "J",
"CaseNumber": "<<<elevmappeCaseNumber>>>",
"AccessGroup": "<<<schoolAccessGroup>>>",
"AccessCode": "13",
"ResponsibleEnterpriseNumber": "<<<schoolOrgNr>>>",
"Contacts": [
{
"ReferenceNumber": "<<<schoolOrgNr>>>",
"Role": "Avsender",
"IsUnofficial": false
},
{
"ReferenceNumber": "<<<ssn>>>",
"Role": "Mottaker",
"IsUnofficial": true
}
],
"Files": [
{
"Base64Data": "<<<pdfFileBase64>>>",
"Category": "1",
"Format": "pdf",
"Status": "F",
"Title": "Varsel om fare for regn på <<<day>>>",
"VersionFormat": "A"
}
]
}
- Get all pdfs in dispatch folder
- Extract text from pdfs, run recognition-methods
- If found an active document type defined in archive methods
- Move pdf to next job "Get data"
- Else
- Move to delete, and send email to user that sent document
If you already know the document-type, you could just put it in the next job and skip this step
- For each archive method
- For each pdf in archive method get-data folder
- Extract text and run findDataMethod for this document type
- Save result and send to next job "sync student data"
- For each pdf in archive method get-data folder
- For each archive method
- For each pdf in archive method sync-student-data folder
- Send social security number or birthdate, firstname, lastname to archiveApi/SyncElevmappe, it handles elevmappe-stuff
- Save result and send to next job "get archive metadata"
- For each pdf in archive method sync-student-data folder
- For each archive method
- For each pdf in archive method get-archive-metadata folder
- Send document and studentdata into create-metadata function, along with which archive template to use
- Save result and send to next job "archive document"
- For each pdf in archive method get-archive-metadata folder
- For each archive method
- For each pdf in archive method get-archive-metadata folder
- Send archive metadata along with base64 of pdf to P360
- If svarut
- Save result and send to next job "svarut"
- Else
- Save result and send to next job "stats and cleanup"
- For each pdf in archive method get-archive-metadata folder
- For each archive method
- For each pdf in archive method get-archive-metadata folder
- Send document on svarut to student
- Save result and send to next job "stats and cleanup"
- For each pdf in archive method get-archive-metadata folder
- Save statistics and either delete or move pdfs and results to imported folder