The parsing stage of dbt produces:
- the 'manifest', a Python structure in the code
- the project
target/manifest.json
file
- Load the macro manifest (MacroManifest):
- Inputs: SQL files in the 'macros' directory
- Outputs: the MacroManifest encapsulated in a BaseAdapter; a pared down Manifest-type file which contains only macros and files (to be copied into the full Manifest)
The macro manifest takes the place of the full-fledged Manifest and contains only macros, which are used for resolving tests during parsing. The 'adapter' code retains a MacroManifest, while the mainline parsing code will have a full manifest. The class MacroParser assists in this process.
-
Loads all internal 'projects' (i.e. has a
dbt_project.yml
) and all projects in the project path. -
Create a ManifestLoader object from the project dependencies and the current config.
-
Read and parse the project files. Results are loaded into the ManifestLoader. ManifestLoader loads the MacroManifest object into a BaseAdapter object.
-
Write the partial parse results (the pickle file). This writes out the 'results' from the ManifestLoader, so the "create the manifest" step has not occured yet. Things yet to happen include patching the nodes, patching the macros, and processing refs, sources, and docs.
-
Sources are patched. First, source tests are parsed. Nodes, sources, macros, docs, exposures, metadata, files, and selectors are copied into the Manifest by the ManifestLoader. And finally, nodes (from
results.patches
) are "patched" and macros too (fromresults.macro_patches
). -
Process the manifest (refs, sources, docs).
- Loops through all nodes, for each node find the node that matches
- the sources refs, and adds the unique id of the source to each
- node's 'depends_on'
-
Check the Manifest and check for resource uniqueness.
-
Build the flat graph
-
Write out the target/manifest.json file.
There are several parser-type objects. Each "parser" gets a list of of matching files specified by directory and ('dbt_project.yml', '.sql', '.yml', *.csv, or *.md)
code: core/dbt/parser/models.py. Most of the code is in SimpleSQLParser.
paths: source_paths + *.sql
Manifest: nodes
code: core/dbt/parser/snapshots.py
paths: snapshot_paths + *.sql
code: core/dbt/parser/analysis.py
paths: analysis_paths + *.sql
Manifest: nodes
code: core/dbt/parser/data_test.py
paths: test_paths + *.sql
Manifest: nodes
code: core/dbt/parser/hooks.py
paths: 'dbt_project.yml'
code: core/dbt/parser/seeds.py
paths: data_paths + *.csv
code: core/dbt/parser/docs.py
paths: docs_paths, *.md
Manifest: docs
- Input: yaml files
- Output: various nodes in the Manifest and 'patch' files in the ParseResult/Manifest
code: core/dbt/parser/schemas.py
paths: all_source_paths = source_paths, data_paths, snapshot_paths, analysis_paths, macro_paths, *.yml
This "parses" the .yml
(property) files in the dbt project. It loops through each yaml file and pulls out the tests in the various config sections and jinja renders them.
A different sub-parser is called for each main dictionary key in the yaml.
- 'models' - TestablePatchParser
- plus 'patches' at create manifest time
- Manifest: nodes
- 'seeds' - TestablePatchParser
- 'snapshots' - TestablePatchParser
- 'sources' - SourceParser
- plus 'patch_sources' at create manifest times
- Manifest: sources
- 'macros' - MacroPatchParser
- plus 'macro_patches' at create manifest time
- Manifest: macros
- 'analyses' - AnalysisPatchParser
- 'exposures' - SchemaParser.parse_exposures
- no 'patches'
- Manifest: exposures
- 'groups' - SchemaParser.parse_groups
- no 'patches'
- Manifest: groups
These have executable SQL attached.
Models
- Are generated from SQL files in the 'models' directory
- have a unique_id starting with 'model.'
- Final object is a ModelNode
Singular Tests
- Are generated from SQL files in 'tests' directory
- have a unique_id starting with 'test.'
- Final object is a SingularTestNode
Generic Tests
- Are generated from 'tests' in schema yaml files, which ultimately derive from tests in the 'macros' directory
- Have a unique_id starting with 'test.'
- Final object is a GenericTestNode
- fqn is .schema_test.
Hooks
- come from 'on-run-start' and 'on-run-end' config attributes.
- have a unique_id starting with 'operation.'
- FQN is of the form: ["dbt_labs_internal_analytics","hooks","dbt_labs_internal_analytics-on-run-end-0"]
Analysis
- comes from SQL files in 'analysis' directory
- Final object is a AnalysisNode
RPC Node
-
This is a "node" representing the bit of Jinja-SQL that gets passed into the run_sql or compile_sql methods. When you're using the Cloud IDE, and you're working in a scratch tab, and you just want to compile/run what you have there: it needs to be parsed and executed, but it's not actually a model/node in the project, so it's this special thing. This is a temporary addition to the running manifest.
-
Object is a RPCNode
- comes from 'sources' sections in yaml files
- Final object is a SourceDefinition node
- have a unique_id starting with 'source.'
- comes from SQL files in 'macros' directory
- Final object is a Macro node
- have a unique_id starting with 'macro.'
- Test macros are used in schema tests
- comes from .md files in 'docs' directory
- Final object is a Documentation
- comes from 'exposures' sections in yaml files
- Final object is a Exposure node
The information in these structures is stored here by the schema parser, but should be resolved before the final manifest is written and do not show up in the written manifest. Ideally we'd like to skip this step and apply the changes directly to the nodes, macros, and sources instead. With the current staged parser we have to save this with the manifest information.
Selectors are set in config yaml files and can be used to determine which nodes should be 'compiled' and run. Selectors can also be done on the command line and will be in cli args.
Models, sources, or other nodes that the user has disabled by setting enabled: false. They should be completely ignored by dbt, as if they don't exist. In its simplest/silliest form, it's a way to keep code around without deleting it. Some folks do cleverer things, like dynamically enabling/disabling certain models based on the database adapter type or the value of --vars.
This contains a list of all of the files that were processed with links to the nodes, docs, macros, sources, exposures, patches, macro_patches, source_patches, i.e. all of the other places that data is stored in the manifest. It also has a checksum of the contents. The 'files' structure is in the saved manifest, but not in the manifest.json file that is written out. It is used in partial parsing to determine whether to use previously generated nodes.
From the ManifestMetadata class. Contains dbt_schema_version, project_id, user_id, send_anonymous_usage_stats, adapter_type
Used during execution in context.common (?). Builds dictionaries of nodes and sources. Not sure why this is used instead of the original nodes and sources. Not in the written manifest.
This used to be in ParseResults (not committed yet). The saved version of this is compared against the current version to see if we can use the saved Manifest. Contains var_hash, profile_hash, and project_hashes, to compare to the saved Manifest to see if things have changed that would invalidate it.