Skip to content

geoarrow-schema crate, plus implementation of upstream ExtensionType trait #1015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 3, 2025

Conversation

kylebarron
Copy link
Member

@kylebarron kylebarron commented Apr 2, 2025

Change list

  • Add new geoarrow_schema crate
  • Move CoordType, Dimension, Crs, Edges, Metadata, to new crate.
  • Add tests for metadata serialization and deserialization
  • Add new geometry-type-specific structs: BoxType, GeometryCollectionType, GeometryType, LineStringType, MultiLineStringType, MultiPointType, MultiPolygonType, PointType, PolygonType, WkbType, WktType
  • Implement the upstream ExtensionType trait for each of these new structs.

Closes #987

cc @paleolimbot

Will probably leave for future PRs integrating this into existing geoarrow code because it'll require changing our current use of NativeType (see #1016 draft)

@kylebarron kylebarron marked this pull request as ready for review April 2, 2025 22:46
@kylebarron kylebarron changed the title geoarrow-schema crate geoarrow-schema crate, plus implementation of upstream ExtensionType trait Apr 2, 2025
Copy link
Contributor

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!

The main thing that strikes me here is that there are very few tests compared to the volume of code. I know you have tests in other places and in Python, but the perception when browsing this code base is that the test coverage is non-existent or low and I wonder if this refactor is a reasonable juncture to ensure that there's at least one "is it plugged in" check for each function.

#[derive(Default, Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum CoordType {
/// Interleaved coordinates.
#[default]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying this default somewhere else would probably better communicate the tradeoffs associated with this choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's valid. Resolved in 8105925 (#1015)

Copy link
Member Author

@kylebarron kylebarron Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where else would you suggest defining this default?

In #1016 I'm working on refactoring the existing codebase to use this PR and I'm finding myself copy-pasting CoordType::Interleaved in >100 places where I previously used CoordType::default. I'm not sure this is a net positive because all of those places may not agree with each other. Most of these places have a way to override the default (and all of them should), but I don't think all end users should always need to make a choice which to use.

But because of rust's orphan rules, if any implementation of Default exists, it has to be done in this crate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to see it as an argument whenever creating an array or type, although I'm certainly guilty of embedding the opposite default in a few places in Python/C++ (although the argument is still surfaced everywhere, for example in geoarrow.pyarrow.as_geoarrow(..., coord_type = ).

Mostly, I think you may have identified 50+ places where there are functions I can't use because you've assumed a default that I don't want. Obviously removing those is probably going to cause havoc so feel free not to do it today, but I think the answer is to parameterize those functions and push the default to a higher level (e.g., Python) where there's a chance to configure it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe:

impl CoordType {
  fn default_interleaved() -> Self { Self::Interleaved } 
}

...so it's easier to find later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm saying that Rust users (of those lower level functions) should choose and that there shouldn't be a default. For higher level functions (e.g., read_flatgeobuf()) I get that forcing a choice is not useful.

Can you template the builders on <T> and then do:

pub trait BuildersTrait <T> {
  fn point() -> PointBuilder<T> { ... }
}

struct Builders {}
impl BuildersTrait<CoordType::Interleaved> for Builders {}

(maybe templates don't quite work like this in rust 😬 )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm saying that Rust users (of those lower level functions) should choose and that there shouldn't be a default

I agree that Rust users should always be able to choose. But that some Rust users still won't want to always have to choose. This is the rationale behind having both a PointBuilder::new and PointBuilder::new_with_options. You probably would always use new_with_options, while some other Rust user might not care what coordinate layout the underlying data use.

Can you template the builders on <T> and then do:

If you "template" it, you'd have to make it generic, which means you'd have to make CoordType a trait instead of an enum. The difference is that making it generic allows you to/makes you choose at compile time, while having the enum is more flexible.

Unless the raw coordinate buffers are also able to be parametrized at compile time with a generic (so there's never an enum lookup at runtime for how the coordinates are laid out), I don't think it's worth the development overhead to make CoordType generic.

I went through a lot of pain previously to take out the O: OffsetSizeType generic on all our array types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I'm with you that not everybody needs to make this choice...I'm just trying to workshop how we can make the act of applying a default choice more visible/documented/not copy/pasted into dozens of places. I think we can!

How about wrapping Dimensions, CoordType, and Metadata into a struct:

/// Maybe a better name
struct TypeOptions {
  coord_type: CoordType,
  dimensions: Dimensions,
  metadata: Metadata
}

impl TypeOptions {
  /// Documentation of what defaults are applied and why
  fn defaults() -> Self { ... }
}

I think we could also use it for the ExtensionType implementations for all the places where these values are repeated. Then the builder definitions collapse to:

impl PointBuilder {
  fn new(options: TypeOptions) -> Self { ... }
  fn new_with_capacity(capacity: usize, options: TypeOptions) -> Self { ... }
}

This is also roughly how DataFusion handles its read_xxx() functions (e.g., let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I can give that change a shot if you're open to it!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about wrapping Dimensions, CoordType, and Metadata into a struct:

#1021 implements this. We don't have to create a new struct since we can reuse the existing type structs from the geoarrow_schema crate, which I think is clean.

@kylebarron
Copy link
Member Author

kylebarron commented Apr 3, 2025

The main thing that strikes me here is that there are very few tests compared to the volume of code.

Yeah, I mean that's a fair critique. I think for a lot of the existing codebase, I kept wondering if the design was wrong, and wanted to hold off putting too much effort into tests if they'd need to all be refactored. And similarly a part of the codebase could be considered "experimental".

But I agree this geoarrow-schema crate is a good candidate to have a stable API, and so I agree it makes more sense to put effort into docs.

I added doctests for conversions to DataType and tests for serializing to/from Metadata. I think that's good for now. It's kinda hard to test the GeometryType operations because the data type is so big that you end up copying all the code to create the data type.

I think at this point it's stable and well tested and documented enough to merge, and we can iterate in follow up PRs (#1016 started the big refactor in the rest of the crate to actually use this)

@kylebarron kylebarron enabled auto-merge (squash) April 3, 2025 06:20
Comment on lines +1288 to +1290
pub struct WkbType {
metadata: Arc<Metadata>,
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paleolimbot Should there be a 1:1 correspondence from geoarrow type to DataType?
For all other types, the CoordType and Dimension uniquely identify the corresponding DataType but this is not true for WkbType and WktType currently.

Should WkbType and WktType contain an attribute like large that defines whether it uses i32 or i64 offsets? I assume not because we want supports_data_type to allow either DataType presumably.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need that to ensure that we can resolve a WkbType() into a storage type? Maybe the parameterization is just data_type here...I believe that DataFusion's Parquet reader reads binary as binary views by default (it at least reads strings as string views by default) and so it's at least plausible that we'd support that some day.

@kylebarron kylebarron merged commit d7cfa0c into main Apr 3, 2025
21 checks passed
@kylebarron kylebarron deleted the kyle/arrow-schema branch April 3, 2025 06:26
@paleolimbot
Copy link
Contributor

I will try to help here but I do think it's reasonable for every module to at least have a tests for us and external contributors to build on + set expectations for code quality (we do expect contributors to write tests, right?). For low-level things like this it's often faster to rewrite it the part of it you need with tests than to use the dependency (i.e., because serious users need to depend on tested code, you're basically forcing your users to write the tests for you).

@kylebarron
Copy link
Member Author

I think there are a few problems with geoarrow-rs:

  • there's a wide variety of code at different states of production-readiness
  • Partly due to me trying to do too much, struggling a bit with how best to model GeoArrow in general, and also learning Rust through this whole process, there's a whole lot of code that is decidedly not production ready.
  • Because it's all one Rust crate geoarrow, there's no clear lines between what is (closer to) production-ready and what is not.

I think a way to break through this impasse is to select relatively small, well-defined subsets of GeoArrow functionality and break them into subcrates. For one, this forces more thought about public APIs because across crates you can't access any pub(crate) attributes. It lets us more clearly document which subsets we expect to be more stable and tested. And external users like yourself can start to build on only those pieces without even bringing in the dependencies for the full geoarrow crate.

In a spectrum of more stable to less stable

  • Core types conforming to the spec, like what is now in geoarrow-schema
  • "primitive" Array layouts like Point/LineString etc
  • "complex" Array layouts like Geometry and GeometryCollection
  • Array builders
  • Conversions between GeoArrow memory and geo, WKB, and WKT
  • Reading/writing Parquet
  • Reading/writing FlatGeobuf
  • Conversions between GeoArrow memory and geos
  • Geometry operations using geo
  • Casting
  • Geometry operations using geos
  • Reading/writing other geo formats
  • Reading/writing to PostGIS

Is there a well-defined subset of this project that you think you would use if it were more stable? Is there a piece that you're interested in that we could work on together to make stable?

kylebarron added a commit that referenced this pull request Apr 3, 2025
### Changes to `geoarrow_schema`

- Add accessors `crs_type` and `crs_value` to `Crs`
- Make `CrsType` and `Edges` `Copy`
- Implement `TryFrom<&Field> for Metadata`
- Add `PointType::to_field` and similar for all geometry types
- Fix `Geometry` type not containing GeometryCollection

### Change list

- Update `NativeType` enum variants to hold the new types from #1015.
This means that the `NativeType` now stores GeoArrow metadata as well,
not just coordinate type and dimension.
- Update geometry array structs to hold the new extension type from
#1015. So `PointArray::data_type` is a `PointType`, not a
`NativeType::Point`.
- Update imports to import from `geoarrow_schema`. We **don't**
currently re-export types from `geoarrow_schema` through `geoarrow` so
that it's clear what pieces of code do or do not actually need the full
`geoarrow` crate.
- Skip some Python tests temporarily to get CI to pass.

Post #1015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor NativeType to support upstream ExtensionType information
2 participants