Skip to content

kylebarron/arrow-js-ffi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

d6d0101 · Sep 20, 2023

History

50 Commits
Sep 2, 2022
Sep 20, 2023
Sep 1, 2023
Jul 3, 2023
Jul 3, 2023
Aug 16, 2023
Aug 17, 2022
Aug 16, 2023
Sep 19, 2023
Jul 9, 2023
Jul 9, 2023
Sep 19, 2023

Repository files navigation

arrow-js-ffi

Interpret Arrow memory across the WebAssembly boundary without serialization.

Why?

Arrow is a high-performance memory layout for analytical programs. Since Arrow's memory layout is defined to be the same in every implementation, programs that use Arrow in WebAssembly are using the same exact layout that Arrow JS implements! This means we can use plain ArrayBuffers to move highly structured data back and forth to WebAssembly memory, entirely avoiding serialization.

I wrote an interactive blog post that goes into more detail on why this is useful and how this library implements Arrow's C Data Interface in JavaScript.

Usage

This package exports two functions, parseField for parsing the ArrowSchema struct into an arrow.Field and parseVector for parsing the ArrowArray struct into an arrow.Vector.

parseField

Parse an ArrowSchema C FFI struct into an arrow.Field instance. The Field is necessary for later using parseVector below.

  • buffer (ArrayBuffer): The WebAssembly.Memory instance to read from.
  • ptr (number): The numeric pointer in buffer where the C struct is located.
const WASM_MEMORY: WebAssembly.Memory = ...
const field = parseField(WASM_MEMORY.buffer, fieldPtr);

parseVector

Parse an ArrowArray C FFI struct into an arrow.Vector instance. Multiple Vector instances can be joined to make an arrow.Table.

  • buffer (ArrayBuffer): The WebAssembly.Memory instance to read from.
  • ptr (number): The numeric pointer in buffer where the C struct is located.
  • dataType (arrow.DataType): The type of the vector to parse. This is retrieved from field.type on the result of parseField.
  • copy (boolean): If true, will copy data across the Wasm boundary, allowing you to delete the copy on the Wasm side. If false, the resulting arrow.Vector objects will be views on Wasm memory. This requires careful usage as the arrays will become invalid if the memory region in Wasm changes.
const WASM_MEMORY: WebAssembly.Memory = ...
const wasmVector = parseVector(WASM_MEMORY.buffer, arrayPtr, field.type);
// Copy arrays into JS instead of creating views
const wasmVector = parseVector(WASM_MEMORY.buffer, arrayPtr, field.type, true);

parseRecordBatch

Parse an ArrowArray C FFI struct plus an ArrowSchema C FFI struct into an arrow.RecordBatch instance. Note that the underlying array and field must be a Struct type. In essence a Struct array is used to mimic a RecordBatch while only being one array.

  • buffer (ArrayBuffer): The WebAssembly.Memory instance to read from.
  • arrayPtr (number): The numeric pointer in buffer where the array C struct is located.
  • schemaPtr (number): The numeric pointer in buffer where the field C struct is located.
  • copy (boolean): If true, will copy data across the Wasm boundary, allowing you to delete the copy on the Wasm side. If false, the resulting arrow.Vector objects will be views on Wasm memory. This requires careful usage as the arrays will become invalid if the memory region in Wasm changes.
const WASM_MEMORY: WebAssembly.Memory = ...
// Pass `true` to copy arrays across the boundary instead of creating views.
const recordBatch = parseRecordBatch(WASM_MEMORY.buffer, arrayPtr, fieldPtr, true);

Type Support

Most of the unsupported types should be pretty straightforward to implement; they just need some testing.

Primitive Types

  • Null
  • Boolean
  • Int8
  • Uint8
  • Int16
  • Uint16
  • Int32
  • Uint32
  • Int64
  • Uint64
  • Float16
  • Float32
  • Float64

Binary & String

  • Binary
  • Large Binary (Not implemented by Arrow JS but supported by downcasting to Binary.)
  • String
  • Large String (Not implemented by Arrow JS but supported by downcasting to String.)
  • Fixed-width Binary

Decimal

  • Decimal128 (failing a test)
  • Decimal256 (failing a test)

Temporal Types

  • Date32
  • Date64
  • Time32
  • Time64
  • Timestamp (with timezone)
  • Duration
  • Interval

Nested Types

  • List
  • Large List (Not implemented by Arrow JS but supported by downcasting to List.)
  • Fixed-size List
  • Struct
  • Map
  • Dense Union
  • Sparse Union
  • Dictionary-encoded arrays

Extension Types

  • Field metadata is preserved.

TODO:

  • Call the release callback on the C structs. This requires figuring out how to call C function pointers from JS.