-
Notifications
You must be signed in to change notification settings - Fork 980
EVF Tutorial Type Conversion
The log format plugin was created before Drill added "provided schemas" in Drill 1.16. Instead, the log format plugin created its own ad-hoc type system in which the user defines the type of columns in the storage plugin config. The plugin provides its own logic (via a set of "column" classes) to convert the strings for each regex group into the type specified in the plugin config.
When converting to the EVF, we can leverage the EVF's built-in conversion rules. We just have to tell tell the EVF:
- The output schema: the column types we want the scan operator to produce (and send to the user.)
- The reader schema: the column types that the reader itself can provide, in our case, strings.
Normally, the output schema comes from the provided schema file. But, we can provide our own output schema as we will do here.
Any one scan operator can read multiple files. The reader and output schemas apply to all readers. So, let's define them in the plugin at the point we create the reader framework. We'll do so by moving and adapting the schema building code that, up until now, was in the batch reader:
@Override
protected FileScanBuilder frameworkBuilder(
OptionManager options, EasySubScan scan) throws ExecutionSetupException {
// Pattern and schema identical across readers; define
// up front.
Pattern pattern = setupPattern();
Matcher m = pattern.matcher("test");
int capturingGroups = m.groupCount();
TupleMetadata outputSchema = defineOutputSchema(capturingGroups);
TupleMetadata readerSchema = defineReaderSchema(outputSchema);
...
}
/**
* Define the output schema: the schema after type conversions.
* Does not include the special columns as those are added only when
* requested, and are always VARCHAR.
*/
private TupleMetadata defineOutputSchema(int capturingGroups) {
List<String> fields = formatConfig.getFieldNames();
for (int i = fields.size(); i < capturingGroups; i++) {
fields.add("field_" + i);
}
SchemaBuilder builder = new SchemaBuilder();
for (int i = 0; i < capturingGroups; i++) {
makeColumn(builder, fields.get(i), i);
}
TupleMetadata schema = builder.buildSchema();
// Populate the date formats, if provided.
if (formatConfig.getSchema() == null) {
return schema;
}
for (int i = 0; i < formatConfig.getSchema().size(); i++) {
ColumnMetadata col = schema.metadata(i);
switch (col.type()) {
case DATE:
case TIMESTAMP:
case TIME:
break;
default:
continue;
}
String format = formatConfig.getDateFormat(i);
if (format == null) {
continue;
}
col.setProperty(ColumnMetadata.FORMAT_PROP, format);
}
return schema;
}
/**
* Define the simplified reader schema: this is the format that the reader
* understands. All columns are VARCHAR, and the reader can offer the
* two special columns.
*/
private TupleMetadata defineReaderSchema(TupleMetadata outputSchema) {
SchemaBuilder builder = new SchemaBuilder();
for (int i = 0; i < outputSchema.size(); i++) {
builder.addNullable(outputSchema.metadata(i).name(), MinorType.VARCHAR);
}
builder.addNullable(LogBatchReader.RAW_LINE_COL_NAME, MinorType.VARCHAR);
builder.addNullable(LogBatchReader.UNMATCHED_LINE_COL_NAME, MinorType.VARCHAR);
TupleMetadata schema = builder.buildSchema();
// Exclude special columns from wildcard expansion
schema.metadata(LogBatchReader.RAW_LINE_COL_NAME).setBooleanProperty(
ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
schema.metadata(LogBatchReader.UNMATCHED_LINE_COL_NAME).setBooleanProperty(
ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
return schema;
}
This code is unique to the needs of the log reader: we make one pass over the schema provided in the plugin config to build the output schema (which includes types). We then make a second pass, over the output schema, to create an input schema with types set to VARCHAR
. The input schema includes the two special columns, which are always VARCHAR
. Some details are omitted, you can see the full code here.
Note: Links won't work until DRILL-7293 is committed to the master branch.
We have the two schemas, now just pass them to the EVF:
@Override
protected FileScanBuilder frameworkBuilder(
OptionManager options, EasySubScan scan) throws ExecutionSetupException {
...
// Use the file framework to enable support for implicit and partition
// columns.
FileScanBuilder builder = new FileScanBuilder();
// Pass along the class that will create a batch reader on demand for
// each input file.
builder.setReaderFactory(new LogReaderFactory(this, pattern, readerSchema));
// The default type of regex columns is nullable VarChar,
// so let's use that as the missing column type.
builder.setNullType(Types.optional(MinorType.VARCHAR));
// This plugin was created before the concept of "provided schema" was
// available. Use the schema obtained from config as the provided schema.
builder.typeConverterBuilder().providedSchema(outputSchema);
return builder;
}
We pass the regex pattern and the reader schema to each batch reader via the LogReaderFactory
. Again, see here for the trivial details.
Finally, we can now remove from the log batch reader the column classes that did handle conversion. This is entirely specific to the log reader; see here for details.
The important change is that we now write each regex group directly to a scalar column writer, and let the column writer convert to the target type:
private void loadVectors(Matcher m, RowSetLoader rowWriter) {
for (int i = 0; i < capturingGroups; i++) {
String value = m.group(i + 1);
if (value != null) {
rowWriter.scalar(i).setString(value);
}
}
}
Run the log reader unit tests unit test to verify that everything still works, including type conversion.