-
Notifications
You must be signed in to change notification settings - Fork 980
EVF Value Vectors
Drill's value vectors are at the core of the Drill execution engine. Vectors are of multiple types. We'll work through each type one-by-one.
Without the enhanced vector framework (EVF), when you write code that writes to, or reads from, vectors, you must take care to handle all the cases described below. With the EVF, you use a simple JSON-like interface to read or write data, and the framework handles the details of each kind of vector.
The EVF is designed to automatically manage all of the cases above so that you can focus on your task: doing something useful to write data into vectors, read data from vectors or both. For example, when writing a scan operator, you will write data into vectors. (This is an unfortunate aspect of Drill terminology: scan operators read data and write vectors.) If you write a client, you will just read data. If you write an internal operator, you will both read and write data.
The term "scalar" above refers to vectors that hold a single value per row.
Non-nullable fixed-width vectors: provide a simple array of values:
Non-nullable variable-width vectors: a combination of two vectors: a buffer that contains variable-size chunks of data, along with an offset vector that points to the start of each data value:
(Thanks to the Drill documentation team for the images!)
In both cases, the call to set()
(for writing) or get()
for reading converts the vector value to/from the corresponding Java data type. Typically the data type is obvious (int
, String
, double
, etc.)
The accessors convert most integral types to a Java int
(TinyInt, SmallInt, Int, UInt1, UInt2). (There is no advantage to having, say, setShort()
or setByte()
methods.)
Larger integrals use long
(BigInt, UInt4).
Floating point values use double
(Float4, Float8).
Date/time types use the Joda classes. (Limitations of the Java 8 date/time classes prevent their use with Drill's vectors.)
Thus far we've shown how to work with non-nullable columns and values. Our name
column is nullable, however. How do we work with nulls? Drill defines two kinds of nullable vectors:
Nullable, fixed-width vectors: a combination of a two fixed width vectors: one for the data, another for the "null bits" (really, the is-set bit: the value is 1 if set, 0 if null.)
Note that, in actual practice, the is-set flags are bytes, not bits as suggested by the diagram.
Nullable variable-width vectors: a combination of two vectors (one of which itself contains two vectors): an is-set vector and a variable-width vector.
When writing, we can either set a column to null explicitly:
nameWriter.setNull();
Or, we can simply omit writing any value to the column:
idWriter.setInt(1);
// No value set for the `name` column.
writer.save();
When reading, we must first ask if the column is NULL:
ColumnReader nameReader = reader.scalar("name");
if (nameReader.isNull()) {
print("null");
} else {
print(nameReader.getString());
}
Arrays in Drill use an offset vector, similar to variable-width vectors To model this, array accessors introduce another level of structure, as in JSON: the array writer and reader. You use the array accessor to traverse the array, then a value-specific accessor (typically scalar) to work with each value. First define a schema:
@Test
public void arrayExample() {
final TupleMetadata schema = new SchemaBuilder()
.add("id", MinorType.INT)
.addArray("names", MinorType.VARCHAR)
.buildSchema();
}
We can use the RowSetBuilder
with some convenience functions:
import static org.apache.drill.test.rowSet.RowSetUtilities.strArray;
...
final RowSet rowSet = new RowSetBuilder(fixture.allocator(), schema)
.addRow(1, strArray("apple", "manzana"))
.addRow(2, strArray("watermelon", "sandía"))
.build();
We can use the column writers:
RowSetWriter writer = rs.writer();
ArrayWriter nameArray = writer.array("names");
ScalarWriter arrayWriter = arrayWriter.scalar();
writer.scalar("id").setInt(1);
nameWriter.setString("apple");
arrayWriter.save();
nameWriter.setString("manzana");
arrayWriter.save();
writer.save();
Notes:
- Notice the two-level structure as described earlier: the
ArrayWriter
that contains aScalarWriter
. - The
ArrayWriter
iterates over the array. The writer starts pointing to the first entry for the current row. Callsave()
to advance to the next position. - As before, the writers live for the life of the
RowSetWriter
and can be cached if desired.
To read an array:
final RowSetReader reader = rowSet.reader();
ArrayReader arrayReader = reader.array("names");
ScalarReader nameReader = arrayReader.scalar();
while (reader.next()) {
print(reader.scalar("id").getInt());
while (arrayReader.next()) {
print(nameReader.getString());
}
}
The final type you should now are Drill maps. As we've said multiple times, a Drill "Map" is not a true map: it is closer to a C or Hive "struct". Every row has the same set of columns. (In a true map, each row would have an independent set of name/value pairs.) The map is, in fact, little different from Drill's top-level row: both contain columns indexed by name (and, in the EVF, by position.) As a result, both are created using the mechanisms:
-
TupleSchema
to describe both a row and a struct. -
TupleWriter
to write both a row and a struct. -
TupleReader
to read both a row and a struct.
In fact, working with maps is nearly identical to working with rows (except that maps don't contain the row-specific methods.)
To define a schema:
@Test
public void mapExample() {
final TupleMetadata schema = new SchemaBuilder()
.add("id", MinorType.INT)
.addMap("names")
.addNullable("english", MinorType.VARCHAR)
.addNullable("spanish", MinorType.VARCHAR)
.resumeSchema()
.buildSchema();
}
Notes:
- The
addMap()
method creates the map and returns a builder for that map. - Build the map just as you built the top-level row.
- Call the
resumeSchema()
method to mark the map as complete and to return to building the top-level row.
You can build a map using the RowSetBuilder
:
import static org.apache.drill.test.rowSet.RowSetUtilities.mapValue;
...
final RowSet rowSet = new RowSetBuilder(fixture.allocator(), schema)
.addRow(1, mapValue("apple", "manzana"))
.addRow(2, mapValue("watermelon", "sandía"))
.build();
Using the column accessors:
RowSetWriter writer = rs.writer();
TupleWriter nameMap = writer.tuple("names");
ScalarWriter englishWriter = nameMap.scalar("english");
ScalarWriter spanishWriter = nameMap.scalar("spanish");
writer.scalar("id").setInt(1);
englishWriter .setString("apple");
spanishWriter .setString("manzana");
writer.save();
Notes:
- You access writers within the map exactly as you do for those in the row.
- There is no
save()
method to call for the map since there is exactly one map value per row.
To read a map:
final RowSetReader reader = rowSet.reader();
TupleReader mapReader= reader.tuple("names");
ScalarReader englishReader = mapReader.scalar("english");
ScalarReader spanishReader = mapReader.scalar("spanish");
while (reader.next()) {
print(reader.scalar("id").getInt());
print(englishReader.getString());
print(spanishReader .getString());
}
Drill provides a number of advanced data types that you can also use:
- Repeated Map: An array of maps. Represented as an array accessor that contains a map accessor .
- Repeated List: An array of arrays; a nested collection of multiple offset vectors on top of a data vector. Represented as an array accessor that contains an array accessor.
- Union: An unordered collection of vectors keyed by type, with "0-or-1 hot" semantics. Represented as a "Variant" accessor which acts like a map accessor, except that the members are indexed by type rather than name.
- List: Essentially a repeated union. Represented as a array accessor that contains a union accessor.
Of these, only Repeated Map is fully supported in Drill. Although the accessors work for all types (some required considerable bug fixes to make the underlying vectors work), most Drill operators do not support these vector types. Unions and Lists are listed as "experimental" in the documentation (and have been for many years.)
The List type is particularly complex since it can act like a Repeated vector if it has just one type, or a Repeated Union if it has multiple types.
The general advice is to stick to scalar types, repeated scalars, maps and repeated maps. Expect considerable work throughout Drill to get the other types to work.