Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate behavior #6

Merged
merged 32 commits into from
Aug 28, 2023
Merged

Duplicate behavior #6

merged 32 commits into from
Aug 28, 2023

Conversation

hoxbro
Copy link
Member

@hoxbro hoxbro commented Jun 30, 2023

Observed that when you have identical _field_df in your database, it will drop duplicates. This is not a desired behavior.

Example on main:
image

import pandas as pd
import numpy as np
from holonote.annotate import Annotator

annotator1 = Annotator(spec={"TIME": np.datetime64}, fields=["description"])

times = pd.date_range("2022-06-09", "2022-06-13")
for t1, t2 in zip(times[:-1], times[1:]):
    annotator1.set_range(t1, t2)
    annotator1.add_annotation(description='A programmatically defined annotation')
    
annotator1.commit()
annotator1.annotation_table._field_df

annotator2 = Annotator(spec={"TIME": np.datetime64}, fields=["description"])
annotator2.annotation_table._field_df

Though, if I do use drop duplicates (like in this PR), a new annotator with the same connector will pretty fast. This is why the second test is marked xfail.

image

import pandas as pd
import numpy as np
from holonote.annotate import Annotator, SQLiteDB

conn = SQLiteDB(filename="test.db")
annotator1 = Annotator(spec={"TIME": np.datetime64}, fields=["description"], connector=conn)

times = pd.date_range("2022-06-09", "2022-06-13")
for t1, t2 in zip(times[:-1], times[1:]):
    annotator1.set_range(t1, t2)
    annotator1.add_annotation(description='A programmatically defined annotation')
    
annotator1.commit()
annotator1.annotation_table._field_df

for _ in range(10):
    annotator2 = Annotator(spec={"TIME": np.datetime64}, fields=["description"], connector=conn)
    print(annotator2.df.shape[0])

@hoxbro hoxbro marked this pull request as draft June 30, 2023 20:41
@hoxbro
Copy link
Member Author

hoxbro commented Jul 5, 2023

This is how I have tried to implement the different classes in this PR:
1

@hoxbro
Copy link
Member Author

hoxbro commented Jul 5, 2023

Update on results from the first post:
image

image

@philippjfr
Copy link
Member

Just wanted to chime in here and say that your diagram and the way you structured things makes sense to me. From a high-level and after reviewing the code I do think the hierarchy should be inverted, i.e. the Connector class was owning too much logic related to the domain and moving methods like load_annotation_table onto the AnnotationTable makes perfect sense to me. Overall it seems to me like the Annotator is user-level API, the Tables are created internally to manage the domain specific data and the Connector should be a minimal wrapper around whatever DB technology it is wrapping without owning domain-specific details. A Connector owning the Table therefore seems to invert the logic and muddles the separation of concerns of the different classes, i.e. this PR makes things more compositional which is a good thing.

The changes you made to fix the drop_duplicates issues also make sense to me. The way I understand it the initialization could pull data multiple times which is both inefficient and meant that drop_duplicates was needed as a workaround. Avoiding this altogether is the correct solution although even a simple fix that correctly deduplicated by comparing UUIDs rather than the contents of each annotation would have been okay too.

@jlstevens
Copy link
Contributor

Architecturally, I've summarized how I think we can improve things in #9. As for this PR, I think the ideas are sensible enough, e.g. moving more out of connector to the annotation table (such as load_annotation_table) but perhaps this can be rethought if we introduce DataSource (or similar).

As for the issue with the dataframe growing in size, I think that can be addressed either in the current or existing architecture: the use of drop_duplicates() was meant to be a temporary hack (and there is a FIXME comment there already). What needs to happen is that when each Annotator calls load, the annotation table needs to be loaded using the columns appropriate for that annotator kdim_dtype specification. Subsequent tables would load in more data from more columns into the annotation table as appropriate.

self.define_points([kdim1, kdim2], df[f'point_{kdim1}'], df[f'point_{kdim2}'])
self.clear_edits()

def add_schema_to_conn(self, conn):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here I think is that I don't think all connectors should support generate_schema. This code was on the SQLiteDB connector as I think only SQLiteDB should auto-generate schemas in this way (for convenience). For 'real' databases, you probably can't (and shouldn't) just create a bunch of tables automatically that are likely supposed to be transient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think that's necessarily true, I have seen many applications in customer environments that autogenerate tables whether that's Postgres or Snowflake.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it can be optional..but I wouldn't expect every connector to support this and SQLite would still be the default.

Base automatically changed from use_pytest to main August 9, 2023 09:35
@hoxbro hoxbro marked this pull request as ready for review August 28, 2023 15:47
@hoxbro hoxbro merged commit 90169aa into main Aug 28, 2023
9 checks passed
@hoxbro hoxbro deleted the fix_duplicate branch August 28, 2023 15:47
@hoxbro hoxbro restored the fix_duplicate branch September 20, 2023 10:04
@hoxbro hoxbro deleted the fix_duplicate branch September 20, 2023 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants