Duplicate behavior #6

hoxbro · 2023-06-30T20:33:00Z

Observed that when you have identical _field_df in your database, it will drop duplicates. This is not a desired behavior.

Example on main:

import pandas as pd
import numpy as np
from holonote.annotate import Annotator

annotator1 = Annotator(spec={"TIME": np.datetime64}, fields=["description"])

times = pd.date_range("2022-06-09", "2022-06-13")
for t1, t2 in zip(times[:-1], times[1:]):
    annotator1.set_range(t1, t2)
    annotator1.add_annotation(description='A programmatically defined annotation')
    
annotator1.commit()
annotator1.annotation_table._field_df

annotator2 = Annotator(spec={"TIME": np.datetime64}, fields=["description"])
annotator2.annotation_table._field_df

Though, if I do use drop duplicates (like in this PR), a new annotator with the same connector will pretty fast. This is why the second test is marked xfail.

import pandas as pd
import numpy as np
from holonote.annotate import Annotator, SQLiteDB

conn = SQLiteDB(filename="test.db")
annotator1 = Annotator(spec={"TIME": np.datetime64}, fields=["description"], connector=conn)

times = pd.date_range("2022-06-09", "2022-06-13")
for t1, t2 in zip(times[:-1], times[1:]):
    annotator1.set_range(t1, t2)
    annotator1.add_annotation(description='A programmatically defined annotation')
    
annotator1.commit()
annotator1.annotation_table._field_df

for _ in range(10):
    annotator2 = Annotator(spec={"TIME": np.datetime64}, fields=["description"], connector=conn)
    print(annotator2.df.shape[0])

Also set key_list=None

…onnector

hoxbro · 2023-07-05T18:14:36Z

This is how I have tried to implement the different classes in this PR:

hoxbro · 2023-07-05T18:17:26Z

Update on results from the first post:

philippjfr · 2023-07-10T09:56:37Z

Just wanted to chime in here and say that your diagram and the way you structured things makes sense to me. From a high-level and after reviewing the code I do think the hierarchy should be inverted, i.e. the Connector class was owning too much logic related to the domain and moving methods like load_annotation_table onto the AnnotationTable makes perfect sense to me. Overall it seems to me like the Annotator is user-level API, the Tables are created internally to manage the domain specific data and the Connector should be a minimal wrapper around whatever DB technology it is wrapping without owning domain-specific details. A Connector owning the Table therefore seems to invert the logic and muddles the separation of concerns of the different classes, i.e. this PR makes things more compositional which is a good thing.

The changes you made to fix the drop_duplicates issues also make sense to me. The way I understand it the initialization could pull data multiple times which is both inefficient and meant that drop_duplicates was needed as a workaround. Avoiding this altogether is the correct solution although even a simple fix that correctly deduplicated by comparing UUIDs rather than the contents of each annotation would have been okay too.

jlstevens · 2023-07-10T17:49:17Z

Architecturally, I've summarized how I think we can improve things in #9. As for this PR, I think the ideas are sensible enough, e.g. moving more out of connector to the annotation table (such as load_annotation_table) but perhaps this can be rethought if we introduce DataSource (or similar).

As for the issue with the dataframe growing in size, I think that can be addressed either in the current or existing architecture: the use of drop_duplicates() was meant to be a temporary hack (and there is a FIXME comment there already). What needs to happen is that when each Annotator calls load, the annotation table needs to be loaded using the columns appropriate for that annotator kdim_dtype specification. Subsequent tables would load in more data from more columns into the annotation table as appropriate.

jlstevens · 2023-07-10T17:52:39Z

holonote/annotate/table.py

+                        self.define_points([kdim1, kdim2], df[f'point_{kdim1}'], df[f'point_{kdim2}'])
+        self.clear_edits()
+
+    def add_schema_to_conn(self, conn):


The problem here I think is that I don't think all connectors should support generate_schema. This code was on the SQLiteDB connector as I think only SQLiteDB should auto-generate schemas in this way (for convenience). For 'real' databases, you probably can't (and shouldn't) just create a bunch of tables automatically that are likely supposed to be transient.

Don't think that's necessarily true, I have seen many applications in customer environments that autogenerate tables whether that's Postgres or Snowflake.

I suppose it can be optional..but I wouldn't expect every connector to support this and SQLite would still be the default.

hoxbro added 17 commits June 29, 2023 18:41

Rewrite TestBasicRange1DAnnotator to pytest

78730df

Rewrite TestBasicRange2DAnnotator to pytest

37e8ab2

Convert test_table_region_df to pytest

309b6e4

Run ruff --select=pt

0fc2624

Update TestBasicPoint1DAnnotator

8f5bbd9

Update TestBasicPoint2DAnnotator

4c40e2b

Split up into basic and advanced

c707e32

Rewrite test_connectors

ee08b18

Remove commented out code

266ce67

Update test_annotator_advanced

e2347b9

Update test to give all output

674b4ba

update database fields

7bd831c

Refactor pytest fixtures

1fa0813

DRY fixtures

23fef76

Clean up test_connectors

cfe5e01

Remove database in tests

42c25e1

Duplicate behavior

d2d8a66

hoxbro marked this pull request as draft June 30, 2023 20:41

hoxbro added 12 commits July 4, 2023 11:57

Support all database for test_add_three_rows_delete_one

bd4bf38

Also set key_list=None

Merge branch 'use_pytest' into fix_duplicate

1d49c19

Remove duplicate code in test and remove list in init

cf55cb4

Move loading of annotation_table into its class

6bd72c0

Add init

c4d48bb

Remove mutable input

37ae038

Add return_commits

7c2c6ca

Fix multiple_region_annotator test

236baa2

Fix test_multiplot_add_annotation and remove annotation_tables from c…

f714169

…onnector

Don't initialize connector

4a38c19

Combine test_reconnect

c02b412

Remove all drop_duplicates

79a8efe

Clean up

fddae93

hoxbro force-pushed the fix_duplicate branch from ba4c42f to fddae93 Compare July 10, 2023 10:21

jlstevens reviewed Jul 10, 2023

View reviewed changes

Base automatically changed from use_pytest to main August 9, 2023 09:35

Merge branch 'main' into fix_duplicate

c729118

hoxbro mentioned this pull request Aug 17, 2023

Proposal for new Annotator API #16

Closed

Fix empty mask bug

7c5227b

hoxbro marked this pull request as ready for review August 28, 2023 15:47

hoxbro merged commit 90169aa into main Aug 28, 2023
9 checks passed

hoxbro deleted the fix_duplicate branch August 28, 2023 15:47

hoxbro restored the fix_duplicate branch September 20, 2023 10:04

hoxbro deleted the fix_duplicate branch September 20, 2023 10:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate behavior #6

Duplicate behavior #6

hoxbro commented Jun 30, 2023 •

edited

Loading

hoxbro commented Jul 5, 2023

hoxbro commented Jul 5, 2023

philippjfr commented Jul 10, 2023

jlstevens commented Jul 10, 2023

jlstevens Jul 10, 2023

philippjfr Jul 10, 2023

jlstevens Jul 10, 2023

Duplicate behavior #6

Duplicate behavior #6

Conversation

hoxbro commented Jun 30, 2023 • edited Loading

hoxbro commented Jul 5, 2023

hoxbro commented Jul 5, 2023

philippjfr commented Jul 10, 2023

jlstevens commented Jul 10, 2023

jlstevens Jul 10, 2023

Choose a reason for hiding this comment

philippjfr Jul 10, 2023

Choose a reason for hiding this comment

jlstevens Jul 10, 2023

Choose a reason for hiding this comment

hoxbro commented Jun 30, 2023 •

edited

Loading