Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add a set data type #264

Open
dieortin opened this issue Sep 5, 2023 · 10 comments · May be fixed by #290
Open

proposal: add a set data type #264

dieortin opened this issue Sep 5, 2023 · 10 comments · May be fixed by #290

Comments

@dieortin
Copy link

dieortin commented Sep 5, 2023

A set data structure, similar to the one in Python, would be a useful addition. For example, it is sometimes required to ensure no duplicate values exist in a list, and a set is the most ergonomic solution.

names = ["Thomas", "David", "Michelle", "David"] # contains duplicate names
unique_names = set(names) # building a set from the list removes duplicates

Other common use cases are:

  • Efficiently checking for membership
  • Filtering
  • Performing mathematical operations (difference, intersection, symmetric difference, union...) among sets

Thoughts? Should I make an attempt at writing a proposal?

Use case example

As an example, I have run into this need while adding automatic dependency exploration in my fork of kklochkov/rules_qt, a ruleset for building Qt applications with Bazel. When generating a rule instantiation, the deps attribute cannot contain duplicate values. Because my feature analyzes Qt shared libraries in the system, and they can be duplicated with different names (for example, libQt6Quick.so and libQt6Quick.so.6.4.3) This introduces the need to enforce each dependency is only added once. Of course, there are ways to work around this problem, but it would be easily solved with a set.

@dieortin dieortin changed the title Add a set data type proposal: add a set data type Sep 5, 2023
@ndmitchell
Copy link
Contributor

The standard recommendation thus far has been to use a dict with None as the values. I've used that in practice and while it's not quite as nice, it's perfectly workable.

@suganoi
Copy link

suganoi commented Sep 8, 2023

@dieortin @ndmitchell It's already half-done in Go: google/starlark-go#492

@dieortin
Copy link
Author

@suganoi that's nice! I have also just seen #20, but it doesn't seem to include the existence of set.

IMO the set API selected for the Go version makes sense and should be added to the general spec. @laurentlb what do you think?

@laurentlb
Copy link
Contributor

laurentlb commented Jan 6, 2024

For the record, the example above can be written:

names = ["Thomas", "David", "Michelle", "David"] # contains duplicate names
unique_names = {n: None for n in names}.keys()

In many cases, people want sets just to remove duplicates; this workaround may be sufficient for that use-case.

In Bazel BUILD files, there's a tradition of using lists in most places. For example, deps and srcs are lists, even when the order is not relevant and duplicates are forbidden. In Bazel rules, sets should usually be avoided, and depsets should be used most of the time. So there were concerns that adding a native type set could cause more problems to Bazel. When a set is really needed in Bazel, a set implementation is provided in skylib (https://github.com/bazelbuild/bazel-skylib/blob/main/lib/new_sets.bzl).

As mentioned previously, Starlark-Go has a implementation for sets (https://github.com/google/starlark-go/blob/master/doc/spec.md#set - https://github.com/google/starlark-go/blob/90ade8b19d09805d1b91a9687198869add6dfaa1/starlark/library.go#L990). It can be used using a flag.

@tetromino
Copy link
Collaborator

I'm not opposed to a set data type, although if we do add it, I think we ought to also support the literal {foo, bar, ...} syntax. I regularly field questions from new macro and rule authors asking for where is Starlark's set data type :) In Bazel, it would entail a bit of work to chase down the native methods that currently require a starlark list as argument but which could also take a set as argument where it makes sense. And of course we'd need clear documentation for when to use sets vs. depsets.

@stepancheg
Copy link
Contributor

stepancheg commented Apr 10, 2024

I think we ought to also support the literal {foo, bar, ...} syntax

This is not as easy decision: people constantly make mistakes by writing {} assuming it is an empty set (and in many cases it behaves like empty set, e.g. in in operator or iterator, making it harder to debug). And writing set([foo, bar, ...]) is OK-ish, and has no perf difference if optimized properly.

I suggest to have separate discussions/decisions for set() builtin, and {...} syntax for sets.

@tomaspipes
Copy link

Given the discussions ... is there any status on the set data type feature?

@tetromino
Copy link
Collaborator

@tomaspipes - given the positive reaction from the community here and elsewhere, I want to take a few days this summer to add a set data type to the Java implementation, and have it enabled in Bazel 8. The Go implementation already has set behind a flag, we may want to flip the flag to enabled.

@tetromino
Copy link
Collaborator

tetromino commented Nov 8, 2024

Didn't have time to look at this in summer - but I finally wrote an experimental implementation for Java / Bazel: https://bazel.googlesource.com/bazel/+/0651a84f3b1a8d8ad06eb30d96ed298e14dde74f

In light of this, let's hammer out a language spec. Current feature matrix:

  • set literals, set comprehensions
    • Supported: python3
    • Not supported: starlark-go, starlark-rust, bazel
  • comparison operators implement a partial order via subset relation
    • Supported: python3, starlark-go
    • Not supported: starlark-rust
    • Undecided: bazel
  • update() method
    • Supported: python3, starlark-rust, bazel
    • Not supported: starlark-go
  • isdisjoint(),intersection_update(), difference_update(), symmetric_difference_update()
    • Supported: python3, bazel
    • Not supported: starlark-go, starlark-rust
  • Multiple-argument form of union(), intersection(), difference() (and corresponding _update methods, if implemented):
    • Supported: python3, bazel
    • Not supported: starlark-go, starlark-rust

For the spec, I propose the following:

  • Do not support set literals and set comprehensions, per @stepancheg
  • Do not support comparison operators - their behavior is surprising/confusing
  • Support isdisjoint(), update(), intersection_update(), difference_update(), symmetric_difference_update() - since these are generally useful, easy to implement, and help memory efficiency
  • Not sure about multiple-argument form of union(), intersection(), difference() etc. - these are generally useful, but while they are completely trivial to implement in java, it seems that it would be rather awkward to add them starlark-go (due to how starlark-go does its argument parsing for builtins), so perhaps we can live without them. (Did not look at starlark-rust.)

tetromino added a commit to tetromino/starlark that referenced this issue Nov 9, 2024
Modeled on the Python 3 set type and the existing implementations in Go,
Rust, and the proposed one for Java, with the following differences:

* set literals and set comprehensions *not* supported (unlike python3)
* `copy()` method *not* supported (unlike python3) because we do not
  have it on lists or dictionaries.
* comparison operators *not* supported (unlike starlark-go and python3)
* `update()` method supported (unlike starlark-go)
* `isdisjoint()`,`intersection_update()`, `difference_update()`,
  `symmetric_difference_update()` method supported (unlike starlark-go
  and starlark-rust)
* multiple-argument form of `union()`, `intersection()`, `difference()`
  and corresponding _update methods supported (unlike starlark-go and
  starlark-rust).

Fixes bazelbuild#264
@tetromino tetromino linked a pull request Nov 9, 2024 that will close this issue
@tetromino
Copy link
Collaborator

Proposed spec (with multiple-argument form of union() etc.): #290

copybara-service bot pushed a commit to bazelbuild/bazel that referenced this issue Nov 12, 2024
Experimental feature guarded by --experimental_enable_starlark_set.

This is the Bazel implementation of bazelbuild/starlark#264

Replicates the Python 3 set API and (in almost all respects) the starlark-go
implementation, with the notable exception of explicitly not supporting
(partial) ordering of sets.

Note that there are no set-valued attributes (nor plans to add any), and
set-valued select() expressions are not supported.

RELNOTES: Add a set data type to Starlark, guarded by the --experimental_enable_starlark_set flag.
PiperOrigin-RevId: 695886977
Change-Id: Id1e178bd3dd354619f188c4375d8a1256bd55f75
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants