Long time user of beancount and the Beancount Categorizer, a need materialized to delegate categorization of entries to an unskilled other person.
Therefore, constraints:
- Data entry in spreadsheet-like format.
- Irrelevance of ordering of clauses, i.e. any sorting or ordering manipulation should not affect matching logic.
- Partial matching.
- Case insensitive.
- Punctuation allowed.
- Place the file
spreadsheet_categorizer.py
in yourPYTHONPATH
. Alternatively, use this line to add the folder where you saved the module toPYTHONPATH
, integrating <full path> with the full absolute path:sys.path.append("<full path>/projects/beancount_spreadsheet_categorizer")
- Create the categorizer spreadsheet starting from the template
categorizer.ods
. Add at least one entry.
from spreadsheet_categorizer import SpreadsheetCategorizer as SC
Initialize the categorizer object in the __init__
of your
importer, so it can be called by the extraction methods.
class MyImporter(importer.ImporterProtocol):
def __init__(self):
spreadsheet_path = 'test/categorizer.ods'
sheet_name = 'valid'
self.SC = SC(spreadsheet_path, sheet_name)
Your imported should have some code that create transactions using
information extracted from the source. Call the method match
from
the categorizer object initialized in init
, and pass the two
arguments payee
and description
(sometimes called ‘narration’ in
beancount code). One or both arguments can be None
and will be
matched as per matching logic.
As example, see below an importer class method that calls match
to
get the two transaction’s posting accounts.
def create_posting(self, line, cedente):
desc = line['Desc']
acc_source, _ = self.SC.match(cedente, desc)
if not acc_source:
acc_source = 'Expenses:FIXME'
price = float(line['Prezzo'])
tax = float(line['Aliquota IVA'])
price_withtax = round(price + price * tax / 100, 2)
amount = Amount(Decimal('{:.2f}'.format(price_withtax)), 'EUR')
meta = {'description': self.clean_string(desc)}
p = data.Posting(acc_source, amount, None, None, None, meta)
return p
In this case, the method is passed a dataframe line
with the columns
['Desc', 'Prezzo', 'Aliquota IVA']
; in addition, the payee is passed
as cedente
. The match
method is called to get only the source
account, and the destination account is ignored. If the source account
is not found, the posting is created with a Expenses:FIXME
account.
The destination account of the same transaction is extracted somewhere else in the code to create the main transaction:
_, acc_dest = self.SC.match(cedente, None)
In this case, the description is not passed as parameter since, as part of this importer matching logic, there is one and only one destination account by payee, and ‘catch-all’ clauses are present for every payee in the spreadsheet (behaviour of catch-all cases to be applied when missing payee or description can be controlled by setting them up in the spreadsheet, see matching logic.)
Below a simple example to use the spreadsheet categorizer in an importer.
It uses the ‘hook’ facility built with smart_importer.
Built on the `ImporterHook’ class in hooks.py
import pandas as pd
from beancount.core import data
from smart_importer.hooks import ImporterHook
from spreadsheet_categorizer import SpreadsheetCategorizer as SC
class AppliedSpreadsheetCategorizer(ImporterHook):
def __init__(self, spreadsheet, sheet_name):
self.asc = SC(spreadsheet, sheet_name)
def __call__(self, importer, file, imported_entries, existing_entries):
return [self._process(entry) for entry in imported_entries]
def _process(self, entry):
"""Process the entry by adding/replacing the posting(s).
If entry is not a transaction, pass it unchanged."""
# exclude all non-transaction entries
if type(entry) != data.Transaction or len(entry.postings) != 1:
return entry
# use the spreadsheet importer: match the payee and narration
acc_source, acc_dest = self.asc.match(entry.payee, entry.narration)
# create the main posting
source_posting = data.Posting(acc_source, None, None, None, None, None)
if acc_dest and not pd.isnull(acc_dest):
# if both accounts are specified in the spreadsheet, replace also
# the existing posting with the new one
units = entry.postings[0].units
meta = entry.postings[0].meta
dest_posting = data.Posting(acc_dest, units, None, None, None,
meta)
entry.postings.clear()
entry.postings.append(source_posting)
entry.postings.append(dest_posting)
else:
# otherwise, simply append or insert the new posting
if entry.postings[0].units.number > 0:
entry.postings.append(source_posting)
else:
entry.postings.insert(0, source_posting)
return entry
from smart_importer import apply_hooks
from smart_importer import PredictPostings as predict
# import my custom importer
from my_custom_importer_module import MyCustomImporter
# import the categorizer class built on smart_importer
from categorizer import AppliedSpreadsheetCategorizer as ASC
SPREADSHEET_PATH = "~/projects/spreadsheet-importer.ods"
spreadsheet_categorizer_instance = ASC(SPREADSHEET_PATH, 'categorizer-sheet')
# the below applies both the spreadsheet categorizer, and the suggested
# PredictPostings hook
CONFIG = [apply_hooks(MyCustomImporter(), [spreadsheet_categorizer_instance,
predict()])]
Matching is done with the method search_key
that does
- a regex search of the payee, therefore allowing partial match
- if more matches are found, a string match, allowing only exact match.
The regex is escaped before the matching, to allow for punctuation in the search payee or description. Case is always insensitive.
If only the payee, or only the description, or neither of them are passed, the method search for a ‘nan’ in place of the missing argument; during categorizer initialization, the catch-all cases are compiled using this ‘nan’ string to identify cases that should match everything.
Example cases are more eloquent. Below, the match and the
corresponding relevant entries. All these cases are in unittest
test/test.py
.
payee | description | account-source | account-destination |
---|---|---|---|
Gas Station | Tank car #1 | Exp:Car:Volvo | Assets:Bank |
match('gas station', 'tank car #1') == 'Exp:Car:Volvo', 'Assets:Bank'
payee | description | account-source | account-destination |
---|---|---|---|
Gas Station | Exp:Car | Assets:FIXME-NO-DESC’ |
match('gas station', '') == 'Exp:Car', 'Assets:FIXME-NO-DESC'
payee | description | account-source | account-destination |
---|---|---|---|
t-shirt | Exp:Clothes |
match('', 't-shirt') == 'Exp:Clothes', None
payee | description | account-source | account-destination |
---|---|---|---|
t-shirt | Exp:Clothes |
match('', 'shirt') == 'Exp:Clothes', None
payee | description | account-source | account-destination |
---|---|---|---|
Gas Station | Tank car #1 | Exp:Car:Volvo | Assets:Bank |
Gas stat | Tank car #1 | Exp:Car:Porsche | Assets:Bank |
The search payee is partially matched with the first one, and exactly matched with the second. Only the second is returned.
match('gas stat', 'tank car #1') == 'Exp:Car:Porsche', 'Assets:Bank'
payee | description | account-source | account-destination |
---|---|---|---|
Gas Station | Tank car #1 | Exp:Car:Volvo | Assets:Bank |
Gas Station | Tank car #2 | Exp:Car:Saab | Assets:Bank |
Gas Station | Exp:Car | Assets:FIXME-NO-DESC |
The search description could match two cases, therefore it is ignored and the payee catch-all clause is matched.
match('gas station', 'tank car') == 'Exp:Car', 'Assets:FIXME-NO-DESC'
payee | description | account-source | account-destination |
---|---|---|---|
Gas Station | Tank car #1 | Exp:Car:Volvo | Assets:Bank |
Gas Station | Tank car #2 | Exp:Car:Saab | Assets:Bank |
Gas Station | Exp:Car | Assets:FIXME-NO-DESC | |
Gas stat | Tank car #1 | Exp:Car:Porsche | Assets:Bank |
Exp:CATCH-ALL | Assets:CATCH-ALL |
Both search payee and description could match multiple cases, therefore they are both ignored and the general catch-all clause is matched.
match('gas', 'tank') == 'Exp:CATCH-ALL', 'Assets:CATCH-ALL'
payee | description | account-source | account-destination |
---|---|---|---|
a.b.c. | service | Exp:Software | Assets:Cash |
Exp:CATCH-ALL | Assets:CATCH-ALL |
Case is not found; while payee exists in the table, there isn’t a catch-all clause for that payee. Therefore, general catch-all clause applies.
match('a.b.c.', 'software') == 'Exp:CATCH-ALL', 'Assets:CATCH-ALL'
Take log_level
as argument.
- Must be an ODS spreadsheet (see Open Document Format for Office Applications (ODF).
- Must have four columns containing:
- the payee to be matched
- the description to be matched
- the account source to be returned
- the account destination to be returned
- If these four columns are not named with standard names, a dict with the preferred names must be passed to the constructor.
- It could contain additional columns (e.g. comments, sorting keys).
- It should not contain empty rows.
- It must contain a catch-all clause when both payee and description are missing.
- It should not contain duplicate rows.
- Warnings will be issued if the import detect situations like missing catch-all payees that, while generating undesired results, do not compromize the logic.
- The import stops if, instead, illegal cases are present, like duplicated cases.
- Beancount on github
- beancount-categorizer
- Diátaxis - A systematic approach to technical documentation authoring.
- Improve documentation.
- Add docstrings.
- Package the module.
- Add full example, including an importer to show full workflow.
- Support input of orgmode tables.
- Support Microsoft Excel spreadsheet.