-
Notifications
You must be signed in to change notification settings - Fork 27
geoarrow-schema crate, plus implementation of upstream ExtensionType trait #1015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome!
The main thing that strikes me here is that there are very few tests compared to the volume of code. I know you have tests in other places and in Python, but the perception when browsing this code base is that the test coverage is non-existent or low and I wonder if this refactor is a reasonable juncture to ensure that there's at least one "is it plugged in" check for each function.
#[derive(Default, Debug, Clone, Copy, PartialEq, Eq, Hash)] | ||
pub enum CoordType { | ||
/// Interleaved coordinates. | ||
#[default] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applying this default somewhere else would probably better communicate the tradeoffs associated with this choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's valid. Resolved in 8105925
(#1015)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where else would you suggest defining this default?
In #1016 I'm working on refactoring the existing codebase to use this PR and I'm finding myself copy-pasting CoordType::Interleaved
in >100 places where I previously used CoordType::default
. I'm not sure this is a net positive because all of those places may not agree with each other. Most of these places have a way to override the default (and all of them should), but I don't think all end users should always need to make a choice which to use.
But because of rust's orphan rules, if any implementation of Default
exists, it has to be done in this crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to see it as an argument whenever creating an array or type, although I'm certainly guilty of embedding the opposite default in a few places in Python/C++ (although the argument is still surfaced everywhere, for example in geoarrow.pyarrow.as_geoarrow(..., coord_type = )
.
Mostly, I think you may have identified 50+ places where there are functions I can't use because you've assumed a default that I don't want. Obviously removing those is probably going to cause havoc so feel free not to do it today, but I think the answer is to parameterize those functions and push the default to a higher level (e.g., Python) where there's a chance to configure it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe:
impl CoordType {
fn default_interleaved() -> Self { Self::Interleaved }
}
...so it's easier to find later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm saying that Rust users (of those lower level functions) should choose and that there shouldn't be a default. For higher level functions (e.g., read_flatgeobuf()
) I get that forcing a choice is not useful.
Can you template the builders on <T>
and then do:
pub trait BuildersTrait <T> {
fn point() -> PointBuilder<T> { ... }
}
struct Builders {}
impl BuildersTrait<CoordType::Interleaved> for Builders {}
(maybe templates don't quite work like this in rust 😬 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm saying that Rust users (of those lower level functions) should choose and that there shouldn't be a default
I agree that Rust users should always be able to choose. But that some Rust users still won't want to always have to choose. This is the rationale behind having both a PointBuilder::new
and PointBuilder::new_with_options
. You probably would always use new_with_options
, while some other Rust user might not care what coordinate layout the underlying data use.
Can you template the builders on
<T>
and then do:
If you "template" it, you'd have to make it generic, which means you'd have to make CoordType
a trait instead of an enum. The difference is that making it generic allows you to/makes you choose at compile time, while having the enum is more flexible.
Unless the raw coordinate buffers are also able to be parametrized at compile time with a generic (so there's never an enum lookup at runtime for how the coordinates are laid out), I don't think it's worth the development overhead to make CoordType
generic.
I went through a lot of pain previously to take out the O: OffsetSizeType
generic on all our array types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I'm with you that not everybody needs to make this choice...I'm just trying to workshop how we can make the act of applying a default choice more visible/documented/not copy/pasted into dozens of places. I think we can!
How about wrapping Dimensions
, CoordType
, and Metadata
into a struct:
/// Maybe a better name
struct TypeOptions {
coord_type: CoordType,
dimensions: Dimensions,
metadata: Metadata
}
impl TypeOptions {
/// Documentation of what defaults are applied and why
fn defaults() -> Self { ... }
}
I think we could also use it for the ExtensionType
implementations for all the places where these values are repeated. Then the builder definitions collapse to:
impl PointBuilder {
fn new(options: TypeOptions) -> Self { ... }
fn new_with_capacity(capacity: usize, options: TypeOptions) -> Self { ... }
}
This is also roughly how DataFusion handles its read_xxx()
functions (e.g., let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I can give that change a shot if you're open to it!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about wrapping Dimensions, CoordType, and Metadata into a struct:
#1021 implements this. We don't have to create a new struct since we can reuse the existing type structs from the geoarrow_schema
crate, which I think is clean.
Yeah, I mean that's a fair critique. I think for a lot of the existing codebase, I kept wondering if the design was wrong, and wanted to hold off putting too much effort into tests if they'd need to all be refactored. And similarly a part of the codebase could be considered "experimental". But I agree this I added doctests for conversions to I think at this point it's stable and well tested and documented enough to merge, and we can iterate in follow up PRs (#1016 started the big refactor in the rest of the crate to actually use this) |
pub struct WkbType { | ||
metadata: Arc<Metadata>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paleolimbot Should there be a 1:1 correspondence from geoarrow type to DataType
?
For all other types, the CoordType
and Dimension
uniquely identify the corresponding DataType
but this is not true for WkbType
and WktType
currently.
Should WkbType
and WktType
contain an attribute like large
that defines whether it uses i32 or i64 offsets? I assume not because we want supports_data_type
to allow either DataType
presumably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need that to ensure that we can resolve a WkbType()
into a storage type? Maybe the parameterization is just data_type
here...I believe that DataFusion's Parquet reader reads binary as binary views by default (it at least reads strings as string views by default) and so it's at least plausible that we'd support that some day.
I will try to help here but I do think it's reasonable for every module to at least have a |
I think there are a few problems with
I think a way to break through this impasse is to select relatively small, well-defined subsets of GeoArrow functionality and break them into subcrates. For one, this forces more thought about public APIs because across crates you can't access any In a spectrum of more stable to less stable
Is there a well-defined subset of this project that you think you would use if it were more stable? Is there a piece that you're interested in that we could work on together to make stable? |
### Changes to `geoarrow_schema` - Add accessors `crs_type` and `crs_value` to `Crs` - Make `CrsType` and `Edges` `Copy` - Implement `TryFrom<&Field> for Metadata` - Add `PointType::to_field` and similar for all geometry types - Fix `Geometry` type not containing GeometryCollection ### Change list - Update `NativeType` enum variants to hold the new types from #1015. This means that the `NativeType` now stores GeoArrow metadata as well, not just coordinate type and dimension. - Update geometry array structs to hold the new extension type from #1015. So `PointArray::data_type` is a `PointType`, not a `NativeType::Point`. - Update imports to import from `geoarrow_schema`. We **don't** currently re-export types from `geoarrow_schema` through `geoarrow` so that it's clear what pieces of code do or do not actually need the full `geoarrow` crate. - Skip some Python tests temporarily to get CI to pass. Post #1015
Change list
geoarrow_schema
crateCoordType
,Dimension
,Crs
,Edges
,Metadata
, to new crate.ExtensionType
trait for each of these new structs.Closes #987
cc @paleolimbot
Will probably leave for future PRs integrating this into existing
geoarrow
code because it'll require changing our current use ofNativeType
(see #1016 draft)