vis-til-arkiv

Nodejs script for archiving Visma InSchool documents, based on pdf-text-recognition

Uses @vtfk/pdf-text-reader for extracting pdf text
Uses @vtfk/pdf-splitter for extracting pdf text
Uses archiveApi/SyncElevmappe for getting student data as well as creation and updating of Elevmapper

Remarks

Only tested and used on Windows as far as I know
Not very suitable for other use cases than archiving pdfs when you do not have any other data than the pdf
Good luck understanding the code... I am sorry, I take full responsibility

Setup

Clone repo

$ git clone https://github.com/vtfk/vis-til-arkiv-v3

PDFtk

If you want to be able to split pdfs (handle a large pdf-document consisting of several documents of the same type), Make sure you have PDFtk installed in the same environment as you are running the nodejs script from.

Use-case: Archiving several 'karakterprotokoller' in the one go - send in 300 protocols in one document, each protocol is handled as a separate document.
See et sted

Install dependencies

$ npm i

Set up .env

NODE_ENV="production"
ROOT_DIRECTORY="C:/PROD-VIStilArkiv" #This is the root folder for the jobs and documents, can be wherever
STAT_FILE="C:/PROD-VIStilArkiv/stat.json" #A local statistics file
P360_SYNCELEVMAPPE_URL="https://<base_url>/SyncElevmappe" 
P360_SYNCELEVMAPPE_KEY="<key>"
P360_SYNCELEVMAPPE_HEADER_NAME="<authorization_header_name>"
PDFTK_EXT="C:/Program Files (x86)/PDFtk/bin/pdftk"
P360_ARCHIVE_DOC_URL="https://<sif_rpc_api_url>/DocumentService/CreateDocument"
P360_DISPATCH_DOC_URL="https://https://<sif_rpc_api_url>/DocumentService/DispatchDocuments"
P360_ARCHIVE_KEY="<auth_key>"
P360_ARCHIVE_QUERY_STRING="<query_string_property_name_before_the_<auth_key>>"
E18_URL="https://<api_url>/e18"
E18_KEY="<e18_key>"
E18_HEADER_NAME="<authorization_header_name>" 
DELETE_FINISHED_JOBS=false # If true, archived pdfs are deleted, if false, they are put in <imported_folder>
DISPATCH_DIRECTORY_NAME="dispatchInput" #Optional, defaults to "input". Name of folder where script looks for pdfs
DOCUMENT_DIRECTORY_NAME="document" #Optional, defaults to "documents"
DELETE_DIRECTORY_NAME="delete" #Optional, defaults to "delete"
TYPE_SEARCH_WORD="VIS MAL TYPE" #Optional, defaults to "VIS MAL TYPE", used for recognizing documentTypes
TEAMSWEBHOOK_URL="<webhook_url>" #Optional, for alerts in Teams
SMTP_HOST="<smtp_host>" #Optional, for sending emails to users when documents are not recognized as a valid type
SMTP_PORT=<smtp_host> #Optional, for sending emails to users when documents are not recognized as a valid type
PAPERTRAIL_HOST="<papertrail_host>" #Optional, for logging
PAPERTRAIL_TOKEN="<papertrail_token>" #Optional, for logging

Start the script

To see if it runs, the first run will set up necessary directories witihin the ROOT_DIRECTORY

$ node ./index.js

Usage

Set up archive method

In ./archiveMethods.js you can add, disable, or delete methods. Create a method for each documentType you want to archive

Example method without typeSearchWord

VISVarsel: {
    active: true, // set to false to disable the method
    id: 'VISVarsel', // set to the same as the property for the whole method. Don't ask why...
    name: 'Varsel om fare for regn',
    findDataMethod: 'visVarselDoc', // Use or create methods defined in "./lib/getData.js"
    identifierStrings: ['Varsel om fare', 'yr'], // Sentences or words that uniquely distinguish this document
    archiveTemplate: 'varsel-fare', // the template used to create archive metadata
    internalNoteTemplate: 'internt-notat-varsel',
    internalNote: './data/blockedAddress.pdf', // If svarUt is used, and the document could not be sent, send note to school
    svarUt: false, // If document should be sent on svarut as well as archived
    manualSvarUt: false, // If you need to manually control the document in P360 before svarut
    schoolOrgnr: '994309153', // optional, overrides school found in document
    accessGroup: 'Elev Kompetansebyggeren' // optional, overrides accessgroup found in document
}

Example method with typeSearchWord

VIS001: {
    active: true,
    id: 'VIS001', // id is the value found in the document behind "<TYPE_SEARCH_WORD>:"
    name: 'Fritak for opplæring i vær',
    findDataMethod: 'soknad',
    archiveTemplate: 'fritak-oppl-kro',
    internalNoteTemplate: 'internt-notat-svarbrev',
    internalNote: './data/blockedAddress.pdf',
    svarUt: true,
    manualSvarUt: false
  }

Example method with splitting enabled

VISVarsel: {
    active: true,
    id: 'VISVarsel', 
    name: 'Varsel om fare for regn',
    findDataMethod: 'visVarselDoc', // note that the findDataMethod must check if documents need to be splitted
    identifierStrings: ['Varsel om fare', 'yr'],
    splitStrings: ['Varsel', 'om fare', 'for regn i dag'], // The split strings are words and sentences present on the page you want to split on
    archiveTemplate: 'varsel-fare', 
    svarUt: false, 
    manualSvarUt: false, 
}

Set up archvive template

Create a json file inside the ./templates directory, reference the template in the corresponding archive method

// Use <<<token>>>, where you want to replace the token with documentData.token, when running createMetadata.js
{
    "Title": "Varsel om fare for regn på <<<day>>>",
    "UnofficialTitle": "Varsel om fare for regn på dag <<<day>>> - <<<month>>> - <<<year>>>",
    "DocumentDate": "<<<documentDate>>>",
    "Archive": "Elevdokument",
    "Category": "Dokument ut",
    "Paragraph": "Offl. § 13 jf. fvl. § 13 (1) nr.1",
    "AccessCodeDescription": "Offl §13 jf. fvl §13 første ledd pkt. 1 - taushetsplikt om værforhold",
    "Status": "J",
    "CaseNumber": "<<<elevmappeCaseNumber>>>",
    "AccessGroup": "<<<schoolAccessGroup>>>",
    "AccessCode": "13",
    "ResponsibleEnterpriseNumber": "<<<schoolOrgNr>>>",
    "Contacts": [
        {
            "ReferenceNumber": "<<<schoolOrgNr>>>",
            "Role": "Avsender",
            "IsUnofficial": false
        },
        {
            "ReferenceNumber": "<<<ssn>>>",
            "Role": "Mottaker",
            "IsUnofficial": true
        }
    ],
    "Files": [
        {
            "Base64Data": "<<<pdfFileBase64>>>",
            "Category": "1",
            "Format": "pdf",
            "Status": "F",
            "Title": "Varsel om fare for regn på <<<day>>>",
            "VersionFormat": "A"
        }
    ]
}

Job-flow

1. Dispatch documents

Get all pdfs in dispatch folder
Extract text from pdfs, run recognition-methods
If found an active document type defined in archive methods
- Move pdf to next job "Get data"
Else
- Move to delete, and send email to user that sent document

If you already know the document-type, you could just put it in the next job and skip this step

2. Get data

For each archive method
- For each pdf in archive method get-data folder
  - Extract text and run findDataMethod for this document type
  - Save result and send to next job "sync student data"

3. Sync student data

For each archive method
- For each pdf in archive method sync-student-data folder
  - Send social security number or birthdate, firstname, lastname to archiveApi/SyncElevmappe, it handles elevmappe-stuff
  - Save result and send to next job "get archive metadata"

3. Get archive metadata

For each archive method
- For each pdf in archive method get-archive-metadata folder
  - Send document and studentdata into create-metadata function, along with which archive template to use
  - Save result and send to next job "archive document"

4. Archive document

For each archive method
- For each pdf in archive method get-archive-metadata folder
  - Send archive metadata along with base64 of pdf to P360
  - If svarut
    - Save result and send to next job "svarut"
  - Else
    - Save result and send to next job "stats and cleanup"

5. Svarut

For each archive method
- For each pdf in archive method get-archive-metadata folder
  - Send document on svarut to student
  - Save result and send to next job "stats and cleanup"

6. Stats and cleanup

Save statistics and either delete or move pdfs and results to imported folder

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

vis-til-arkiv

Remarks

Setup

Clone repo

PDFtk

Install dependencies

Set up .env

Start the script

Usage

Set up archive method

Set up archvive template

Job-flow

1. Dispatch documents

2. Get data

3. Sync student data

3. Get archive metadata

4. Archive document

5. Svarut

6. Stats and cleanup

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

vis-til-arkiv

Remarks

Setup

Clone repo

PDFtk

Install dependencies

Set up .env

Start the script

Usage

Set up archive method

Set up archvive template

Job-flow

1. Dispatch documents

2. Get data

3. Sync student data

3. Get archive metadata

4. Archive document

5. Svarut

6. Stats and cleanup

License