Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise on out of bounds range access #1061

Merged
merged 8 commits into from
Feb 3, 2025
Merged

Conversation

billylanchantin
Copy link
Member

Fixes: #1005

In the example from that issue we now get:

iex> df = Explorer.DataFrame.new(a: [1, 2, 3])
iex> df[1..1]
# ** (ArgumentError) range 1..1 is out of bounds for a dataframe with 1 column(s)
#     (explorer 0.11.0-dev) lib/explorer/shared.ex:221: Explorer.Shared.to_existing_columns/3
#     (explorer 0.11.0-dev) lib/explorer/data_frame.ex:387: Explorer.DataFrame.fetch/2
#     (elixir 1.18.1) lib/access.ex:322: Access.get/3
#     iex:3: (file)

@billylanchantin
Copy link
Member Author

The bounds check on ranges was a little fiddly since Enum.slice/2 accepts negative bounds. I think I got it but I'm not 100% sure.

[slice_min, slice_pseudo_max] =
[first, last]
|> Enum.map(&if(&1 < 0, do: n_cols + &1, else: &1))
|> Enum.sort()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because ranges can be reverse ordered, I am not sure how much we want to rely on sort. My suggestion would be to validate first and last independently:

for pos <- [first, last] do
  cond do
    pos >= 0 and pos < n_cols -> :ok
    pos < 0 and pos >= -n_cols -> :ok
    true -> raise "..."
  end
end

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum.slice/2 says it doesn't support reverse ordered ranges yet. Do you want to support them here in anticipation of their future support in Elixir 2.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point. If the code above works (big if), then it should be simpler and more future proof. Worth giving it a try?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Unfortunately though, I think your algorithm doesn't work for the following situation:

df = DF.new(a: [1, 2, 3], b: ["a", "b", "c"], c: [4.0, 5.1, 6.2])

n_cols = DF.n_columns(df) #=> 3
slice = 1..10//10

assert slice.last > n_cols
assert DF.to_columns(df[slice]) == %{"b" => ["a", "b", "c"]}

Here slice is a valid subset of the columns because even though slice.last > n_cols, Enum.to_list(slice) == [1]. This only happens when step > 1 of course.

We can still try and work out a future proof approach though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can adjust the last for a range start..finish//step, the last element can be calculated using this formula:
start + (div(finish - start, step) * step)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be best to leave this to until we add a slice! to Elixir? 🤔

I think I'd prefer to merge this if possible. The semantics of out-of-bounds are currently different between Nx and Explorer. This PR attempts to bring them in line.

The other one is an empty range altogether. And then there are ranges that raise when out of index.

Sorry if I'm not understanding. To my eyes, 10..0//1 is both empty on its own and out of bounds for a 2-column dataframe and so should raise. Can you clarify what you think should happen with these two cases?

df = DF.new(a: [1], b: [2]) # 2 columns

df[1..0//1] # empty, but within bounds
df[9..0//1] # empty, but out of bounds

I think df[1..0//1] should return an empty dataframe and df[9..0//1] should raise. The other option is that all empty ranges (post normalizing) should return an empty dataframe regardless of bounds.

I admit this is a weird case and I don't feel strongly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that, if 1..10//10 should not raise for being outbounds because 10 is not actually included, so should not 10..1//1 for the same reason?

I agree that it doesn’t really matter which, but I think for those cases it should be consistent. I think always checking first and last, regardless of step, is the simplest way to go about it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that, if 1..10//10 should not raise for being outbounds because 10 is not actually included, so should not 10..1//1 for the same reason?

Ah great point. It's like whack-a-mole with these cases!

I think always checking first and last, regardless of step, is the simplest way to go about it?

I think you're right. I'll go ahead and do that.

The results will differ slightly from Enum.slice/2. But we need to do what makes sense for us.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to think of slice as an implementation detail. I think my proposed implementation should work for checking indexes too!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to think of slice as an implementation detail.

That makes sense!

I think my proposed implementation should work for checking indexes too!

It did! Hope it's ok that I played a little code golf with it. I realized yours was equivalent to:

if max(abs(first), abs(last)) >= n_cols do
  raise ...
end

This won't be supported until Elixir 2.0. But
we can make it future proof now.
Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super clean!

@billylanchantin billylanchantin merged commit 0d59ddc into main Feb 3, 2025
3 checks passed
@billylanchantin billylanchantin deleted the bl-raise-on-oob-range branch February 3, 2025 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accessing a DataFrame with an out of bounds range should raise
2 participants