Skip to content

Commit

Permalink
Rc/0.3.2 (#102)
Browse files Browse the repository at this point in the history
* Update CHANGELOG and VERSION.

* Hotfix/optional parquet sources (#86)

* Update optional file check in FileSource to build an empty dataframe if an empty folder is passed.

* Remove explicit file check in compile.

* Re-add filesize check in FileSource.execute().

* Move FtpSource connect from compile to execute.

* Fix attribute naming bug.

* Fix bug.

* Allow filepaths to be passed in optional FileSources, and check the existance of the path before loading the dataframe.

* Update CHANGELOG.

* fix add_columns typo in readme

* update changelog

* Feature/union all columns (#94)

* Add 'fill_missing' optional field to UnionOperation that uses default Pandas concat logic without erroring out. Still raise a debug message when applicable.

* Rename new field to 'fill_missing_columns' for clarity.

* Update dataframe.py

Rename fill_missing_columns to fill_missing.

* Update dataframe.py

* Update CHANGELOG.md

* Update CHANGELOG.md

* Rename UnionOperation's fill_missing field to fill_missing_columns; update README.

* Git clone timeout when running `earthmover deps` (#93)

* try using subprocess with timeout

* Update error message

* tweak timeouts

* switch to makedirs

* don't error if dir already exists

* remove package path on failure

* adjust deletes

* typo

* switch to rmtree

* remove gitpython dependency

* remove unused import

* remove unused var

* add optional git timeout config

* reverse accidentally removed kwargs

* add notes on git_auth_timeout config to readme

* code cleanup

* Update README.

---------

Co-authored-by: jayckaiser <[email protected]>

* Update changelog.

* Fix escape chars in output when `linearize: False` (#98)

* fixes a bug where escape characters were present in the output file when linearize is False

* remove unneeded Dask import

* update return value and comment based on notes from Jay

---------

Co-authored-by: Tom Reitz <[email protected]>

* fixing a bug introduced in the last version where nested JSON would be loaded as a stringified Python dictionaty, which is difficult to use in downstream Jinja (#97)

Co-authored-by: Tom Reitz <[email protected]>

* Only write `earthmover_compiled.yaml` on compile, not run (#91)

* only write to disk on compile, not run

* update readme with change to earthmover_compiled.yaml

* Add `earthmover clean` command and some CLI error handling (#87)

* add 'clean' command and clean up CLI messaging

* comment justifying dictionary

* update changlog

* remove skip_mkdir, make compiled_yaml_file a class attribute

* replace dict with list of constntas

---------

Co-authored-by: Jay Kaiser <[email protected]>

* Update CHANGELOG with new features.

* Fix `__row_data__` in `add_columns` and `modify_columns` operations (#99)

* fix __row_data__ in Jinja expressions of add_columns and modify_columns operations

* update how __row_data__ is added to prefent an error about modifying row

---------

Co-authored-by: Tom Reitz <[email protected]>

* Feature: Refactor Destination Execute (#95)

* Update config parsing to use ErrorHandler.assert_get_key() for all fields; move and unify Jinja template processing to execute.

* Update destination.py

* Update CHANGELOG.

* makes destination template optional (#88)

* makes destination template optional; when not specified, each row is turned into a JSON object where column names become object properties

* implement changes based on feedback from Jay

* bugfix

* Minor cleanup.

---------

Co-authored-by: Tom Reitz <[email protected]>
Co-authored-by: jayckaiser <[email protected]>

* Update CHANGELOG.

* adding the `debug` operation (#100)

* adding debug operation

* Update dataframe.py

Refactor code to improve readability and reference to existing Node attributes.

---------

Co-authored-by: Tom Reitz <[email protected]>
Co-authored-by: Jay Kaiser <[email protected]>

* Use Node.full_name in Node.check_expectations(), instead of redefining the string manually.

* Update CHANGELOG.

* Feature/flatten operation whitespace cleanup (#101)

* adding a flatten_operation

* README tweak

* implement changes based on feedback from Jay

* Clean up comments and whitespace in new FlattenOperation.

* Add print statements to debug tuple problem.

* Minor cleanup.

* Minor cleanup.

* Add single quotes to strip and trim variables in FlattenOperation.

* Fix single quote representation in trim_whitespace.

---------

Co-authored-by: Tom Reitz <[email protected]>

* Update CHANGELOG.

---------

Co-authored-by: johncmerfeld <[email protected]>
Co-authored-by: Samantha LeBlanc <[email protected]>
Co-authored-by: Tom Reitz <[email protected]>
Co-authored-by: Tom Reitz <[email protected]>
  • Loading branch information
5 people authored Jun 14, 2024
1 parent 6094c1b commit 736d315
Show file tree
Hide file tree
Showing 15 changed files with 404 additions and 121 deletions.
28 changes: 27 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,29 @@
### v0.3.2
<details>

<summary>Released 2024-06-14</summary>

* feature: Add `DebugOperation` for logging data head, tail, columns, or metadata midrun
* feature: Add `FlattenOperation` for splitting and exploding string columns into values
* feature: Add optional 'fill_missing_columns' field to `UnionOperation` to fill disjunct columns with nulls, instead of raising an error (default `False`)
* feature: Add `git_auth_timeout` config when entering Git credentials during package composition
* feature: [Add `earthmover clean` command that removes local project artifacts](https://github.com/edanalytics/earthmover/pull/87)
* feature: only output compiled template during `earthmover compile`
* feature: Render full row into JSON lines when `template` is undefined in `FileDestination`
* internal: Move `FileSource` size-checking and `FtpSource` FTP-connecting from compile to execute
* internal: Move template-file check from compile to execute in `FileDestination`
* internal: Allow filepaths to be passed to an optional `FileSource`, and check for file before creating empty dataframe
* internal: Build an empty dataframe if an empty folder is passed to an optional `FileSource`
* internal: fix some examples in README
* internal: remove GitPython dependency
* bugfix: fix bug in `FileDestination` where `linearize: False` resulted in BOM characters
* bugfix: fix bug where nested JSON would be loaded as a stringified Python dictionary
* bugfix: [Ensure command list in help menu and log output is always consistent](https://github.com/edanalytics/earthmover/pull/87)
* bugfix: fix bug in `ModifyColumnsOperation` where `__row_data__` was not exposed in Jinja templating

</details>


### v0.3.1
<details>

Expand Down Expand Up @@ -195,4 +221,4 @@
<summary>Released 2022-09-22</summary>

* initial release
</details>
</details>
92 changes: 83 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,8 @@ transA: transA:
- operation: add_columns - operation: add_columns
source: $sources.A
columns: columns:
- A: "a" - A: "a"
- B: "b" - B: "b"
A: "a" A: "a"
B: "b" B: "b"
- operation: union - operation: union
sources: sources:
- $transformations.transA
Expand Down Expand Up @@ -133,6 +133,7 @@ config:
parameter_defaults:
SOURCE_DIR: ./sources/
show_progress: True
git_auth_timeout: 120
```
* (optional) `output_dir` determines where generated JSONL is stored. The default is `./`.
Expand All @@ -148,7 +149,7 @@ config:
* (optional) Specify Jinja `macros` which will be available within any Jinja template content throughout the project. (This can slow performance.)
* (optional) Specify `parameter_defaults` which will be used if the user fails to specify a particular [parameter](#command-line-parameters) or [environment variable](#environment-variable-references).
* (optional) Specify whether to `show_progress` while processing, via a Dask progress bar.

* (optional) Specify the `git_auth_timeout` (in seconds) to wait for the user to enter Git credentials if needed during package installation; default is 60. See [project composition](#project-composition) for more details on package installation.

### **`definitions`**
The `definitions` section of the [YAML configuration](#yaml-configuration) is an optional section you can use to define configurations which are reused throughout the rest of the configuration. `earthmover` does nothing special with this section, it's just interpreted by the YAML parser. However, this can be a very useful way to keep your YAML configuration [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) &ndash; rather than redefine the same values, Jinja phrases, etc. throughout your config, define them once in this section and refer to them later using [YAML anchors, aliases, and overrides](https://www.linode.com/docs/guides/yaml-anchors-aliases-overrides-extensions/).
Expand Down Expand Up @@ -318,7 +319,10 @@ Concatenates the transformation source with one or more sources sources of the s
- $sources.courses_list_1
- $sources.courses_list_2
- $sources.courses_list_3
fill_missing_columns: False
```
By default, unioning sources with different columns raises an error.
Set `fill_missing_columns` to `True` to union all columns into the output dataframe.
</details>


Expand Down Expand Up @@ -371,10 +375,10 @@ Adds columns with specified values.
```yaml
- operation: add_columns
columns:
- new_column_1: value_1
- new_column_2: "{%raw%}{% if True %}Jinja works here{% endif %}{%endraw%}"
- new_column_3: "{%raw%}Reference values from {{AnotherColumn}} in this new column{%endraw%}"
- new_column_4: "{%raw%}{% if col1>col2 %}{{col1|float + col2|float}}{% else %}{{col1|float - col2|float}}{% endif %}{%endraw%}"
new_column_1: value_1
new_column_2: "{%raw%}{% if True %}Jinja works here{% endif %}{%endraw%}"
new_column_3: "{%raw%}Reference values from {{AnotherColumn}} in this new column{%endraw%}"
new_column_4: "{%raw%}{% if col1>col2 %}{{col1|float + col2|float}}{% else %}{{col1|float - col2|float}}{% endif %}{%endraw%}"
```
Use Jinja: `{{value}}` refers to this column's value; `{{AnotherColumn}}` refers to another column's value. Any [Jinja filters](https://jinja.palletsprojects.com/en/3.1.x/templates/#builtin-filters) and [math operations](https://jinja.palletsprojects.com/en/3.0.x/templates/#math) should work. Reference the current row number with `{{__row_number__}}` or a dictionary containing the row data with `{{__row_data__['column_name']}}`. *You must wrap Jinja expressions* in `{%raw%}...{%endraw%}` to avoid them being parsed at YAML load time.
</details>
Expand Down Expand Up @@ -565,6 +569,59 @@ By default, rows are sorted ascendingly. Set `descending: True` to reverse this
</details>


<details>
<summary><code>flatten</code></summary>

Split values in a column and create a copy of the row for each value.
```yaml
- operation: flatten
flatten_column: my_column
left_wrapper: '["' # characters to trim from the left of values in `flatten_column`
right_wrapper: '"]' # characters to trim from the right of values in `flatten_column`
separator: "," # the string by which to split values in `flatten_column`
value_column: my_value # name of the new column to create with flattened values
trim_whitespace: " \t\r\n\"" # characters to trim from `value_column` _after_ flattening
```
The defaults above are designed to allow flattening JSON arrays (in a string) with simply
```yaml
- operation: flatten
flatten_column: my_column
value_column: my_value
```
Note that for empty string values or empty arrays, a row will still be preserved. These can be removed in a second step with a `filter_rows` operation. Example:
```yaml
# Given a dataframe like this:
# foo bar to_flatten
# --- --- ----------
# foo1 bar1 "[\"test1\",\"test2\",\"test3\"]"
# foo2 bar2 ""
# foo3 bar3 "[]"
# foo4 bar4 "[\"test4\",\"test5\",\"test6\"]"
#
# a flatten operation like this:
- operation: flatten
flatten_column: to_flatten
value_column: my_value
# will produce a dataframe like this:
# foo bar my_value
# --- --- --------
# foo1 bar1 test1
# foo1 bar1 test2
# foo1 bar1 test3
# foo2 bar2 ""
# foo3 bar3 ""
# foo4 bar4 test4
# foo4 bar4 test5
# foo4 bar4 test6
#
# and you can remove the blank rows if needed with a further operation:
- operation: filter_rows
query: my_value == ''
behavior: exclude
```
</details>


#### Group operations

<details>
Expand Down Expand Up @@ -608,6 +665,23 @@ Note the difference between `min()`/`max()` and `str_min()`/`str_max()`: given a
</details>


#### Debug operation

<details>
<summary><code>debug</code></summary>

Sort rows by one or more columns.
```yaml
- operation: debug
function: head | tail | describe | columns # default=head
rows: 10 # (optional, default=5; ignored if function=describe|columns)
transpose: True # (default=False; ignored when function=columns)
skip_columns: [a, b, c] # to avoid logging PII
keep_columns: [x, y, z] # to look only at specific columns
```
`function=head|tail` displays the `rows` first or last rows of the dataframe, respectively. (Note that on large dataframes, these may not truly be the first/last rows, due to Dask's memory optimizations.) `function=describe` shows statistics about the values in the dataframe. `function=columns` shows the column names in the dataframe. `transpose` can be helpful with very wide dataframes. `keep_columns` defaults to all columns, `skip_columns` defaults to no columns.
</details>


### **`destinations`**
The `destinations` section of the [YAML configuration](#yaml-configuration) specifies how transformed data is materialized to files.
Expand Down Expand Up @@ -700,7 +774,7 @@ transformations:
operations:
- operation: add_columns
columns:
- source_file: {{i}}
source_file: {{i}}
{% endfor %}
stacked:
source: $transformations.source1
Expand Down Expand Up @@ -891,7 +965,7 @@ Some details of the design of this tool are discussed below.

Note that due to step (3) above, *runtime* Jinja expressions (such as column definitions for `add_columns` or `modify_columns` operations) should be wrapped with `{%raw%}...{%endraw%}` to avoid being parsed when the YAML is being loaded.

The parsed YAML is written to a file called `earthmover_compiled.yaml` in your working directory during a `compile` or `run` command. This file can be used to debug issues related to compile-time Jinja or [project composition](#project-composition).
The parsed YAML is written to a file called `earthmover_compiled.yaml` in your working directory during a `compile` command. This file can be used to debug issues related to compile-time Jinja or [project composition](#project-composition).


## DAG
Expand Down
2 changes: 1 addition & 1 deletion earthmover/VERSION.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.3.1
0.3.2
55 changes: 41 additions & 14 deletions earthmover/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,13 @@

from earthmover.earthmover import Earthmover

# Any new command should be added to this list
RUN = "run"
COMPILE = "compile"
DEPS = "deps"
CLEAN = "clean"
ALLOWED_COMMANDS = [RUN, COMPILE, DEPS, CLEAN]
command_list = ", ".join(f"`{c}`" for c in ALLOWED_COMMANDS)

class ExitOnExceptionHandler(logging.StreamHandler):
"""
Expand Down Expand Up @@ -50,7 +57,7 @@ def main(argv=None):
parser.add_argument('command',
nargs="?",
type=str,
help='the command to run: `run`, `compile`, `visualize`'
help=f'the command to run. One of: {command_list}'
)
parser.add_argument("-c", "--config-file",
nargs="?",
Expand Down Expand Up @@ -103,9 +110,17 @@ def main(argv=None):
})

### Parse the user-inputs and run Earthmover, depending on the command and subcommand passed.
args, remaining_argv = parser.parse_known_args()
args, unknown_args = parser.parse_known_args()
if len(unknown_args) > 0:
unknown_args_str = ', '.join(f"`{c}`" for c in unknown_args)
print(f"unknown arguments {unknown_args_str} passed, use -h flag for help")
exit(1)

if args.command is not None and args.command not in ALLOWED_COMMANDS:
print(f"unknown command '{args.command}' passed, use -h flag for help")
exit(1)

# Command: Version
# -v / --version
if args.version:
em_dir = os.path.dirname(os.path.abspath(__file__))
version_file = os.path.join(em_dir, 'VERSION.txt')
Expand All @@ -114,7 +129,7 @@ def main(argv=None):
print(f"earthmover, version {VERSION}")
exit(0)

# Command: Test
# -t / --test
if args.test:
tests_dir = os.path.join( os.path.realpath(os.path.dirname(__file__)), "tests" )

Expand All @@ -130,7 +145,7 @@ def main(argv=None):
em.logger.info('tests passed successfully.')
exit(0)

### Otherwise, initialize Earthmover for a main run.
### Otherwise, initialize Earthmover to execute a command.
if not args.config_file:
for file in DEFAULT_CONFIG_FILES:
test_file = os.path.join(".", file)
Expand Down Expand Up @@ -162,15 +177,15 @@ def main(argv=None):
force=args.force,
skip_hashing=args.skip_hashing,
cli_state_configs=cli_state_configs,
results_file=args.results_file
results_file=args.results_file,
)

except Exception as err:
logger.exception(err, exc_info=True)
raise # Avoids linting error

# Subcommand: deps (parse Earthmover YAML and compile listed packages)
if args.command == 'deps':
# Command: deps (parse Earthmover YAML and compile listed packages)
if args.command == DEPS:
em.logger.info(f"installing packages...")
if args.selector != '*':
em.logger.info("selector is ignored for package install.")
Expand All @@ -182,23 +197,35 @@ def main(argv=None):
logger.exception(e, exc_info=em.state_configs['show_stacktrace'])
raise

# Subcommand: compile (parse Earthmover YAML and build graph)
elif args.command == 'compile':
# Command: compile (parse Earthmover YAML and build graph)
elif args.command == COMPILE:
em.logger.info(f"compiling project...")
if args.selector != '*':
em.logger.info("selector is ignored for compile-only run.")

try:
em.compile()
em.compile(to_disk=True)
em.logger.info("looks ok")

except Exception as e:
logger.exception(e, exc_info=em.state_configs['show_stacktrace'])
raise

# Subcommand: run (compile + execute)
elif args.command == CLEAN:
em.logger.info(f"removing local artifacts...")
if args.selector != '*':
em.logger.info("selector is ignored for project cleaning.")

try:
em.clean()
em.logger.info("done!")
except Exception as e:
logger.exception(e, exc_info=em.state_configs['show_stacktrace'])
raise

# Command: run (compile + execute)
# This is the default if none is specified.
elif args.command == 'run' or not args.command:
elif args.command == RUN or not args.command:
if not args.command:
em.logger.warning("[no command specified; proceeding with `run` but we recommend explicitly giving a command]")

Expand All @@ -212,7 +239,7 @@ def main(argv=None):
raise

else:
logger.exception("unknown command, use -h flag for help")
logger.exception(f"unknown command '{args.command}', use -h flag for help")
raise


Expand Down
Loading

0 comments on commit 736d315

Please sign in to comment.