Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic importer source #62

Merged
merged 15 commits into from
Aug 7, 2020
Merged

Conversation

dumbPy
Copy link
Contributor

@dumbPy dumbPy commented Aug 2, 2020

that imports transactions from beancount.ingest.importer.ImporterProtocol subclass importers

performs deduplication w.r.t journal entries and across multiple importable files
tested on my own importer. will be testing more in the following week.

should work as is with #60
It's my first OSS contribution, so opinions are most welcome @Zburatorul @jbms and others

Closes #18, Closes #23

that imports transactions from
beancount.ingest.importer.ImporterProtocol
subclass importers
@coveralls
Copy link

coveralls commented Aug 2, 2020

Pull Request Test Coverage Report for Build 161

  • 85 of 92 (92.39%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.3%) to 67.041%

Changes Missing Coverage Covered Lines Changed/Added Lines %
beancount_import/source/generic_importer_source.py 75 82 91.46%
Totals Coverage Status
Change from base Build 135: 0.3%
Covered Lines: 4549
Relevant Lines: 6577

💛 - Coveralls

is_posting_cleared should check for
source_desc and use that to determine if a posting is cleared.
This avoids clearing manually entered postings and helps reconcile them
@dumbPy
Copy link
Contributor Author

dumbPy commented Aug 2, 2020

I am still facing an issue with manually entered transactions.

2020-08-02 * "Headphones"
    Assets:INR:SBI:Saving -2000 INR
    Expenses:Misc

and entry from statement imported with this source

2020-08-02 * "Some Long ass stupid description"
    Assets:INR:SBI:Saving -2000 INR
        transaction_ref: 34567422445
        source_desc: "Some Long ass stupid description"
    Expenses:FIXME

These are matched as expected, but 1st line of both the candidates is 2020-08-02 * "Some Long ass stupid description"
I would like to keep 2020-08-02 * "Headphones" as is from the manually entered entry while merging the postings

This seems to work fine for other sources in your manually_entered example. what am I missing? @jbms

@jbms
Copy link
Owner

jbms commented Aug 5, 2020

At the moment there is no indication in the transaction itself whether the narration was manually entered or automatically generated. I'm not sure how to represent that in a way that doesn't create other problems.

Instead, beancount-import uses a heuristic that if merging an existing transaction in the journal with a new transaction being imported, then the narration from the journal is retained. However, if both transactions are already in the journal (e.g. one manually entered and one imported with the Expenses:FIXME account left uncorrected), then currently the narration of one will be chosen arbitrarily.

(See matching.py combine_transactions_using_match_set)

Possibly a different heuristic could be used in that case, e.g. preferring the narration from the transaction with the fewest Expenses:FIXME accounts. Maybe a different heuristic could be used, like checking whether the narration is the same/similar as another metadata field.

@jbms
Copy link
Owner

jbms commented Aug 5, 2020

Thanks for this contribution! Overall this looks good. Are you able to provide some tests, possibly based on examples in the beancount repository under beancount/ingest/importers? That would be really helpful in validating that it works an preventing it from breaking in the future.

The tests are relatively easy to create based on the existing framework --- you can refer to e.g. source/mint_test.py

@dumbPy
Copy link
Contributor Author

dumbPy commented Aug 5, 2020

Possibly a different heuristic could be used in that case, e.g. preferring the narration from the transaction with the fewest Expenses:FIXME accounts

I will try this out. and will try to write tests too. Thank you!!!

@dumbPy
Copy link
Contributor Author

dumbPy commented Aug 6, 2020

I was using beancount.ingest.similarity.SimilarityComparator() for finding transactions in the journal and check if they are cleared.

Scenario 1:

I have 2 very similar transactions in my statement

2020-01-01 * "by debit card-OTHPG 063441 GOOGLE CLOUD INDIA PVTTHANE-"
    Assets:INR:SBI:Saving -1 INR
    Expense:FIXME

; attention on (063441/063444) difference that will lead to different hash.  
2020-01-01 * "by debit card-OTHPG 063444 GOOGLE CLOUD INDIA PVTTHANE-"
    Assets:INR:SBI:Saving -1 INR
    Expense:FIXME

but only 1 cleared transaction in journal (say I reconciled midday on 01/01/2020 and then this occurred the next time I tried reconciling).
The SimilarityComparator matches both these transactions to the only cleared transaction in the journal.
This is expected since SimilarityComparator will ignore narration that the user might change while reconciling, and everything else is same.
To help with this, now I am matching source_desc of the cleared transaction with the imported transactions.

Scenario 2:

There are 2 exactly same transactions imported from the statements (i.e., same hash)

2020-01-01 * "BULK POSTING- 00000008237 250120 GOOGLE CLOUD INDIA PVT-"
  Assets:INR:SBI:Saving   1.00 INR
  Expenses:FIXME

2020-01-01 * "BULK POSTING- 00000008237 250120 GOOGLE CLOUD INDIA PVT-"
  Assets:INR:SBI:Saving   1.00 INR
  Expenses:FIXME

and one cleared transaction in the journal and optionally 1 uncleared(manually entered) transaction.

2020-02-26 * "Money returned by Google"
  Assets:INR:SBI:Saving   1.00 INR
    source_desc: "BULK POSTING- 00000008237 250120 GOOGLE CLOUD INDIA PVT-"
  Expenses:Misc          -1.00 INR

I am checking the number of cleared transaction that are similar, and have same source_desc as imported entry's narration and import the remaining transactions (here 1 since 1 of 2 is already cleared).

Will add these as tests

source_desc is used for checking and clearing postings.
this is the way!!
and some  very minor changes in ImporterSource
@dumbPy dumbPy changed the title WIP: Add generic importer source Add generic importer source Aug 6, 2020
@dumbPy
Copy link
Contributor Author

dumbPy commented Aug 6, 2020

Tests added as requested.
Not implemented test_invalid since InvalidSourceReference is not used in ImporterSource.
I think this is ready to be merged along with #63
Let me know if you think it needs anything more @jbms

@jbms
Copy link
Owner

jbms commented Aug 6, 2020

Thanks for adding the tests!

This is getting very close --- potentially could be simplified quite a bit by re-using more functionality from description_based_source.py, unless you think there is a reason why something different needs to be done here.

3rd revamp.
hashing now replaced with _get_key_from_imported_entry
date added to posting.meta
tests updated
test_invalid added
@dumbPy
Copy link
Contributor Author

dumbPy commented Aug 7, 2020

I was trying to keep things more close to beancount by using hashing instead of _get_key_from_entry and such. But again, I have a habit of over-engineering stuff. All resolved as per your suggestions.

@jbms
Copy link
Owner

jbms commented Aug 7, 2020

Looks great, thanks! The only thing remaining is to fix the mypy type annotation errors:

https://travis-ci.com/github/jbms/beancount-import/jobs/369680675#L617

@dumbPy
Copy link
Contributor Author

dumbPy commented Aug 7, 2020

mypy errors fixed. But the basic_test is failing in python 3.5 but passing in 3.6 and 3.7
I wonder what went wrong 😢

@jbms
Copy link
Owner

jbms commented Aug 7, 2020

It looks like the issue is that the order of the entries happens to differ in Python 3.5 --- in Python 3.5, iterating over a dict returns elements in a random order. In python 3.6 and later, it is guaranteed to be in insertion order. To ensure a deterministic result, you could use OrderedDict instead of defaultdict. Unfortunately you lose the convenience of defaultdict then, but you can just use the setdefault method which is almost as convenience.

@dumbPy
Copy link
Contributor Author

dumbPy commented Aug 7, 2020

Fixed!!

@jbms jbms merged commit 5748edd into jbms:master Aug 7, 2020
@jbms
Copy link
Owner

jbms commented Aug 7, 2020

Thanks!

One other thing would be to add this source to the list of sources in the README.md file, and to expand the comment at the top of the file to include for detailed instructions on how to use it, like the other data sources have, but that can be a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Import from bean-extract / beancount.ingest? Generic CSV importer
3 participants