Explanation

Motivation

Long time user of beancount and the Beancount Categorizer, a need materialized to delegate categorization of entries to an unskilled other person.

Therefore, constraints:

Data entry in spreadsheet-like format.
Irrelevance of ordering of clauses, i.e. any sorting or ordering manipulation should not affect matching logic.
Partial matching.
Case insensitive.
Punctuation allowed.

Tutorial

Install

Place the file spreadsheet_categorizer.py in your PYTHONPATH. Alternatively, use this line to add the folder where you saved the module to PYTHONPATH, integrating <full path> with the full absolute path:
```
sys.path.append("<full path>/projects/beancount_spreadsheet_categorizer")
    
```
Create the categorizer spreadsheet starting from the template categorizer.ods. Add at least one entry.

Setup

Import

from spreadsheet_categorizer import SpreadsheetCategorizer as SC

Initialize

Initialize the categorizer object in the __init__ of your importer, so it can be called by the extraction methods.

class MyImporter(importer.ImporterProtocol):
    def __init__(self):
        spreadsheet_path = 'test/categorizer.ods'
        sheet_name = 'valid'
        self.SC = SC(spreadsheet_path, sheet_name)

Apply

Your imported should have some code that create transactions using information extracted from the source. Call the method match from the categorizer object initialized in init, and pass the two arguments payee and description (sometimes called ‘narration’ in beancount code). One or both arguments can be None and will be matched as per matching logic.

As example, see below an importer class method that calls match to get the two transaction’s posting accounts.

def create_posting(self, line, cedente):
    desc = line['Desc']
    acc_source, _ = self.SC.match(cedente, desc)
    if not acc_source:
        acc_source = 'Expenses:FIXME'
    price = float(line['Prezzo'])
    tax = float(line['Aliquota IVA'])
    price_withtax = round(price + price * tax / 100, 2)
    amount = Amount(Decimal('{:.2f}'.format(price_withtax)), 'EUR')
    meta = {'description': self.clean_string(desc)}
    p = data.Posting(acc_source, amount, None, None, None, meta)
    return p

In this case, the method is passed a dataframe line with the columns ['Desc', 'Prezzo', 'Aliquota IVA']; in addition, the payee is passed as cedente. The match method is called to get only the source account, and the destination account is ignored. If the source account is not found, the posting is created with a Expenses:FIXME account.

The destination account of the same transaction is extracted somewhere else in the code to create the main transaction:

_, acc_dest = self.SC.match(cedente, None)

In this case, the description is not passed as parameter since, as part of this importer matching logic, there is one and only one destination account by payee, and ‘catch-all’ clauses are present for every payee in the spreadsheet (behaviour of catch-all cases to be applied when missing payee or description can be controlled by setting them up in the spreadsheet, see matching logic.)

How-to

Hook the categorizer to an importer

Below a simple example to use the spreadsheet categorizer in an importer.

It uses the ‘hook’ facility built with smart_importer.

file ‘categorizer.py’ in your “importers” folder to be used by bean-extract

Built on the `ImporterHook’ class in hooks.py

import pandas as pd
from beancount.core import data
from smart_importer.hooks import ImporterHook
from spreadsheet_categorizer import SpreadsheetCategorizer as SC


class AppliedSpreadsheetCategorizer(ImporterHook):
    def __init__(self, spreadsheet, sheet_name):
        self.asc = SC(spreadsheet, sheet_name)

    def __call__(self, importer, file, imported_entries, existing_entries):
        return [self._process(entry) for entry in imported_entries]

    def _process(self, entry):
        """Process the entry by adding/replacing the posting(s).

        If entry is not a transaction, pass it unchanged."""
        # exclude all non-transaction entries
        if type(entry) != data.Transaction or len(entry.postings) != 1:
            return entry
        # use the spreadsheet importer: match the payee and narration
        acc_source, acc_dest = self.asc.match(entry.payee, entry.narration)
        # create the main posting
        source_posting = data.Posting(acc_source, None, None, None, None, None)
        if acc_dest and not pd.isnull(acc_dest):
            # if both accounts are specified in the spreadsheet, replace also
            # the existing posting with the new one
            units = entry.postings[0].units
            meta = entry.postings[0].meta
            dest_posting = data.Posting(acc_dest, units, None, None, None,
                                        meta)
            entry.postings.clear()
            entry.postings.append(source_posting)
            entry.postings.append(dest_posting)
        else:
            # otherwise, simply append or insert the new posting
            if entry.postings[0].units.number > 0:
                entry.postings.append(source_posting)
            else:
                entry.postings.insert(0, source_posting)
        return entry

file ‘config.py’ used by bean-extract

from smart_importer import apply_hooks
from smart_importer import PredictPostings as predict

# import my custom importer
from my_custom_importer_module import MyCustomImporter

# import the categorizer class built on smart_importer
from categorizer import AppliedSpreadsheetCategorizer as ASC

SPREADSHEET_PATH = "~/projects/spreadsheet-importer.ods"

spreadsheet_categorizer_instance = ASC(SPREADSHEET_PATH, 'categorizer-sheet')

# the below applies both the spreadsheet categorizer, and the suggested
# PredictPostings hook
CONFIG = [apply_hooks(MyCustomImporter(), [spreadsheet_categorizer_instance,
                                          predict()])]

Matching logic

Matching is done with the method search_key that does

a regex search of the payee, therefore allowing partial match
if more matches are found, a string match, allowing only exact match.

The regex is escaped before the matching, to allow for punctuation in the search payee or description. Case is always insensitive.

If only the payee, or only the description, or neither of them are passed, the method search for a ‘nan’ in place of the missing argument; during categorizer initialization, the catch-all cases are compiled using this ‘nan’ string to identify cases that should match everything.

Example cases are more eloquent. Below, the match and the corresponding relevant entries. All these cases are in unittest test/test.py.

Match payee and description, case insensitive

payee	description	account-source	account-destination
Gas Station	Tank car #1	Exp:Car:Volvo	Assets:Bank

match('gas station', 'tank car #1') == 'Exp:Car:Volvo', 'Assets:Bank'

Match payee only

payee	description	account-source	account-destination
Gas Station		Exp:Car	Assets:FIXME-NO-DESC’

match('gas station', '') == 'Exp:Car', 'Assets:FIXME-NO-DESC'

Match description only

payee	description	account-source	account-destination
	t-shirt	Exp:Clothes

match('', 't-shirt') == 'Exp:Clothes', None

Partial match

payee	description	account-source	account-destination
	t-shirt	Exp:Clothes

match('', 'shirt') == 'Exp:Clothes', None

Exact match

payee	description	account-source	account-destination
Gas Station	Tank car #1	Exp:Car:Volvo	Assets:Bank
Gas stat	Tank car #1	Exp:Car:Porsche	Assets:Bank

The search payee is partially matched with the first one, and exactly matched with the second. Only the second is returned.

match('gas stat', 'tank car #1') == 'Exp:Car:Porsche', 'Assets:Bank'

Ambiguous description

payee	description	account-source	account-destination
Gas Station	Tank car #1	Exp:Car:Volvo	Assets:Bank
Gas Station	Tank car #2	Exp:Car:Saab	Assets:Bank
Gas Station		Exp:Car	Assets:FIXME-NO-DESC

The search description could match two cases, therefore it is ignored and the payee catch-all clause is matched.

match('gas station', 'tank car') == 'Exp:Car', 'Assets:FIXME-NO-DESC'

Ambiguous payee and description

payee	description	account-source	account-destination
Gas Station	Tank car #1	Exp:Car:Volvo	Assets:Bank
Gas Station	Tank car #2	Exp:Car:Saab	Assets:Bank
Gas Station		Exp:Car	Assets:FIXME-NO-DESC
Gas stat	Tank car #1	Exp:Car:Porsche	Assets:Bank
		Exp:CATCH-ALL	Assets:CATCH-ALL

Both search payee and description could match multiple cases, therefore they are both ignored and the general catch-all clause is matched.

match('gas', 'tank') == 'Exp:CATCH-ALL', 'Assets:CATCH-ALL'

Case not found

payee	description	account-source	account-destination
a.b.c.	service	Exp:Software	Assets:Cash
		Exp:CATCH-ALL	Assets:CATCH-ALL

Case is not found; while payee exists in the table, there isn’t a catch-all clause for that payee. Therefore, general catch-all clause applies.

match('a.b.c.', 'software') == 'Exp:CATCH-ALL', 'Assets:CATCH-ALL'

Reference

API guide (DRAFT)

class `SpreadsheetCategorizer`

method `init`

Take log_level as argument.

method `search_key`

Categorizer spreadsheet specifications

Must be an ODS spreadsheet (see Open Document Format for Office Applications (ODF).
Must have four columns containing:
- the payee to be matched
- the description to be matched
- the account source to be returned
- the account destination to be returned
If these four columns are not named with standard names, a dict with the preferred names must be passed to the constructor.
It could contain additional columns (e.g. comments, sorting keys).
It should not contain empty rows.
It must contain a catch-all clause when both payee and description are missing.
It should not contain duplicate rows.
Warnings will be issued if the import detect situations like missing catch-all payees that, while generating undesired results, do not compromize the logic.
The import stops if, instead, illegal cases are present, like duplicated cases.

External references

Future

Improve documentation.
- Add docstrings.
Package the module.
Add full example, including an importer to show full workflow.
Support input of orgmode tables.
Support Microsoft Excel spreadsheet.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
template		template
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org
TODO.org		TODO.org
spreadsheet_categorizer.py		spreadsheet_categorizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Explanation

Motivation

Tutorial

Install

Setup

Import

Initialize

Apply

How-to

Hook the categorizer to an importer

file ‘categorizer.py’ in your “importers” folder to be used by bean-extract

file ‘config.py’ used by bean-extract

Matching logic

Match payee and description, case insensitive

Match payee only

Match description only

Partial match

Exact match

Ambiguous description

Ambiguous payee and description

Case not found

Reference

API guide (DRAFT)

class `SpreadsheetCategorizer`

method `init`

method `search_key`

Categorizer spreadsheet specifications

External references

Future

About

Releases

Packages

Languages

License

erpreciso/beancount_spreadsheet_categorizer

Folders and files

Latest commit

History

Repository files navigation

Explanation

Motivation

Tutorial

Install

Setup

Import

Initialize

Apply

How-to

Hook the categorizer to an importer

file ‘categorizer.py’ in your “importers” folder to be used by bean-extract

file ‘config.py’ used by bean-extract

Matching logic

Match payee and description, case insensitive

Match payee only

Match description only

Partial match

Exact match

Ambiguous description

Ambiguous payee and description

Case not found

Reference

API guide (DRAFT)

class SpreadsheetCategorizer

method __init__

method search_key

Categorizer spreadsheet specifications

External references

Future

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

class `SpreadsheetCategorizer`

method `init`

method `search_key`

Packages