Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add include_empty keyword argument #250

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

HDictus
Copy link
Contributor

@HDictus HDictus commented Nov 29, 2023

When using get-like functions (edges.pathway_edges, nodes.get) it is common to concatenate the outputs of the functions to mimic the old bluepysnap interface. However, if no nodes/edges satisfy the query, instead of resulting in an empty dataframe, this approach will result in an error, because no dataframes at all are yielded by the generator.

The include_empty keyword argument allows the user to specify that they still want to see a dataframe if the query turns up nothing.

One example where this is useful for me is when I use the length of a connection dataframe to determine connection probability.

When using get-like functions (edges.pathway_edges, nodes.get) it is common to concatenate
the outputs of the functions to mimic the old bluepysnap interface.
However, if no nodes/edges satisfy the query, instead of resulting in an empty dataframe,
this approach will result in an error, because no dataframes at all are yielded by the generator.

The include_empty keyword argument allows the user to specify that they still want to see a dataframe if the query turns up nothing.

One example where this is useful for me is when I use the length of a connection dataframe to determine connection probability.
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c094e4f) 100.00% compared to head (1d96b29) 100.00%.

Additional details and impacted files
@@            Coverage Diff            @@
##            master      #250   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           32        32           
  Lines         2449      2449           
=========================================
  Hits          2449      2449           
Flag Coverage Δ
pytest 100.00% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mgeplf
Copy link
Contributor

mgeplf commented Nov 30, 2023

Thanks for the PR.

it is common to concatenate the outputs of the functions to mimic the old bluepysnap interface

I'm generally a bit leary to add more arguments to functions, since it's a bit a of a slipperly slope (but I'm not ruling this out).
Out of curiousity, is it not possible to do something like this to filter the empty ones:

dfs = (df for df in foo.get({"layer": 10}) if not df.empty)

@HDictus
Copy link
Contributor Author

HDictus commented Dec 5, 2023

dfs = (df for df in foo.get({"layer": 10}) if not df.empty)

The current behavior does not yield anything if the query comes up empty. So foo.get({"layer": 10}) would give an empty iterator. If .get instead yielded empty dataframes then I would not need the keyword argument. However, that would be a potentially breaking change, which is why I've put a kwarg and default = False. Having the iterator include empty dataframes without a kwarg would be preferable for me as well, if you think it would be ok to change the behavior that way I would be happy to change the PR to that.

@joni-herttuainen
Copy link
Contributor

joni-herttuainen commented Dec 5, 2023

If I understood correctly, the use case is to pd.concat the dataframes. If this is the use case where you have a wrapper for the dataframes, you can do something like:

def _concat(iterator):
    dfs = [df for _,df in iterator]
    return pd.concat(dfs) if dfs else pd.DataFrame([])  # or if you just want to check the length, you might just have "else []"

df = _concat(c.nodes.get(query))

or, you can use dict(...).values():

def _concat(iterator):
    dfs = dict(iterator).values() 
    return pd.concat(dfs) if dfs else pd.DataFrame([])  # or "else []" if just for length

You can even one-line it, albeit the readability takes a hit:

def _concat(iterator):
     return pd.concat(d.values()) if (d := dict(iterator)) else pd.DataFrame([])

@HDictus
Copy link
Contributor Author

HDictus commented Dec 5, 2023

The issue with this is that downstream code which expects the dataframe to have columns will break because it has none, and so I will need to add a if df.empty then do something else clause to everything.

@joni-herttuainen
Copy link
Contributor

No worries, you can set any columns to empty dataframes, too:

columns = list(circuit.nodes.property_names)
pd.DataFrame(columns=columns)

I'm afraid I don't understand why do you need to process empty dataframes. If you can point me to your code and especially those parts in which the empty dataframe would cause issues, I can take a look and see what can we do so those needn't be processed.

@HDictus
Copy link
Contributor Author

HDictus commented Dec 6, 2023

I often only retrieve a subset of the columns, so then I'll need to pass this subset to the function as well. In this case I may be best off wrapping the .get method call as a whole

def get_dataframe(networkobj, *args, properties=None, attr='get'):
     dfs = [df for _ df in getattr(networkobj, attr)(*args, properties=properties)]
     if len(dfs) == 0:
        return pd.DataFrame(columns=properties or networkobj.property_names)
     return pd.concat(dfs)

So strictly speaking it isn't necessary to have the empty dataframes, just simpler and more concise.

@joni-herttuainen
Copy link
Contributor

If wrapping the .get (or other functionality) solves your issue, nice.

strictly speaking it isn't necessary to have the empty dataframes, just simpler and more concise

Yeah, I get it. But, to be fair, it depends on the point of view: adding/passing/handling extra keyword arguments for covering one particular use case is not simpler, nor more concise for us. I'd rather always return empty dataframes than add an extra behaviour handled by an extra argument if this is absolutely necessary. If not, I wouldn't break the current behavior.

Anyway, I still don't understand why do you need to keep processing the empty dataframes and, again, if you'd like me to take a look at the code to possibly not process them, let me know, I'm happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants