Rc/0.3.2 (#102)

* Update CHANGELOG and VERSION. * Hotfix/optional parquet sources (#86) * Update optional file check in FileSource to build an empty dataframe if an empty folder is passed. * Remove explicit file check in compile. * Re-add filesize check in FileSource.execute(). * Move FtpSource connect from compile to execute. * Fix attribute naming bug. * Fix bug. * Allow filepaths to be passed in optional FileSources, and check the existance of the path before loading the dataframe. * Update CHANGELOG. * fix add_columns typo in readme * update changelog * Feature/union all columns (#94) * Add 'fill_missing' optional field to UnionOperation that uses default Pandas concat logic without erroring out. Still raise a debug message when applicable. * Rename new field to 'fill_missing_columns' for clarity. * Update dataframe.py Rename fill_missing_columns to fill_missing. * Update dataframe.py * Update CHANGELOG.md * Update CHANGELOG.md * Rename UnionOperation's fill_missing field to fill_missing_columns; update README. * Git clone timeout when running `earthmover deps` (#93) * try using subprocess with timeout * Update error message * tweak timeouts * switch to makedirs * don't error if dir already exists * remove package path on failure * adjust deletes * typo * switch to rmtree * remove gitpython dependency * remove unused import * remove unused var * add optional git timeout config * reverse accidentally removed kwargs * add notes on git_auth_timeout config to readme * code cleanup * Update README. --------- Co-authored-by: jayckaiser <[email protected]> * Update changelog. * Fix escape chars in output when `linearize: False` (#98) * fixes a bug where escape characters were present in the output file when linearize is False * remove unneeded Dask import * update return value and comment based on notes from Jay --------- Co-authored-by: Tom Reitz <[email protected]> * fixing a bug introduced in the last version where nested JSON would be loaded as a stringified Python dictionaty, which is difficult to use in downstream Jinja (#97) Co-authored-by: Tom Reitz <[email protected]> * Only write `earthmover_compiled.yaml` on compile, not run (#91) * only write to disk on compile, not run * update readme with change to earthmover_compiled.yaml * Add `earthmover clean` command and some CLI error handling (#87) * add 'clean' command and clean up CLI messaging * comment justifying dictionary * update changlog * remove skip_mkdir, make compiled_yaml_file a class attribute * replace dict with list of constntas --------- Co-authored-by: Jay Kaiser <[email protected]> * Update CHANGELOG with new features. * Fix `__row_data__` in `add_columns` and `modify_columns` operations (#99) * fix __row_data__ in Jinja expressions of add_columns and modify_columns operations * update how __row_data__ is added to prefent an error about modifying row --------- Co-authored-by: Tom Reitz <[email protected]> * Feature: Refactor Destination Execute (#95) * Update config parsing to use ErrorHandler.assert_get_key() for all fields; move and unify Jinja template processing to execute. * Update destination.py * Update CHANGELOG. * makes destination template optional (#88) * makes destination template optional; when not specified, each row is turned into a JSON object where column names become object properties * implement changes based on feedback from Jay * bugfix * Minor cleanup. --------- Co-authored-by: Tom Reitz <[email protected]> Co-authored-by: jayckaiser <[email protected]> * Update CHANGELOG. * adding the `debug` operation (#100) * adding debug operation * Update dataframe.py Refactor code to improve readability and reference to existing Node attributes. --------- Co-authored-by: Tom Reitz <[email protected]> Co-authored-by: Jay Kaiser <[email protected]> * Use Node.full_name in Node.check_expectations(), instead of redefining the string manually. * Update CHANGELOG. * Feature/flatten operation whitespace cleanup (#101) * adding a flatten_operation * README tweak * implement changes based on feedback from Jay * Clean up comments and whitespace in new FlattenOperation. * Add print statements to debug tuple problem. * Minor cleanup. * Minor cleanup. * Add single quotes to strip and trim variables in FlattenOperation. * Fix single quote representation in trim_whitespace. --------- Co-authored-by: Tom Reitz <[email protected]> * Update CHANGELOG. --------- Co-authored-by: johncmerfeld <[email protected]> Co-authored-by: Samantha LeBlanc <[email protected]> Co-authored-by: Tom Reitz <[email protected]> Co-authored-by: Tom Reitz <[email protected]>
edanalytics · Jun 14, 2024 · 736d315 · 736d315
1 parent 6094c1b
commit 736d315
Show file tree

Hide file tree

Showing 15 changed files with 404 additions and 121 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,29 @@
+### v0.3.2
+<details>
+
+<summary>Released 2024-06-14</summary>
+
+* feature: Add `DebugOperation` for logging data head, tail, columns, or metadata midrun
+* feature: Add `FlattenOperation` for splitting and exploding string columns into values
+* feature: Add optional 'fill_missing_columns' field to `UnionOperation` to fill disjunct columns with nulls, instead of raising an error (default `False`)
+* feature: Add `git_auth_timeout` config when entering Git credentials during package composition
+* feature: [Add `earthmover clean` command that removes local project artifacts](https://github.com/edanalytics/earthmover/pull/87)
+* feature: only output compiled template during `earthmover compile`
+* feature: Render full row into JSON lines when `template` is undefined in `FileDestination`
+* internal: Move `FileSource` size-checking and `FtpSource` FTP-connecting from compile to execute
+* internal: Move template-file check from compile to execute in `FileDestination`
+* internal: Allow filepaths to be passed to an optional `FileSource`, and check for file before creating empty dataframe
+* internal: Build an empty dataframe if an empty folder is passed to an optional `FileSource`
+* internal: fix some examples in README
+* internal: remove GitPython dependency
+* bugfix: fix bug in `FileDestination` where `linearize: False` resulted in BOM characters
+* bugfix: fix bug where nested JSON would be loaded as a stringified Python dictionary
+* bugfix: [Ensure command list in help menu and log output is always consistent](https://github.com/edanalytics/earthmover/pull/87)
+* bugfix: fix bug in `ModifyColumnsOperation` where `__row_data__` was not exposed in Jinja templating
+
+</details>
+
+
 ### v0.3.1
 <details>
 
@@ -195,4 +221,4 @@
 <summary>Released 2022-09-22</summary>
 
 * initial release
-</details>
+</details>
diff --git a/README.md b/README.md
@@ -85,8 +85,8 @@ transA:                           transA:
     - operation: add_columns          - operation: add_columns
       source: $sources.A
       columns:                          columns:
-        - A: "a"                          - A: "a"
-        - B: "b"                          - B: "b"
+        A: "a"                            A: "a"
+        B: "b"                            B: "b"
     - operation: union                - operation: union
       sources:                          sources:
       - $transformations.transA
@@ -133,6 +133,7 @@ config:
   parameter_defaults:
     SOURCE_DIR: ./sources/
   show_progress: True
+  git_auth_timeout: 120
 
 ```
 * (optional) `output_dir` determines where generated JSONL is stored. The default is `./`.
@@ -148,7 +149,7 @@ config:
 * (optional) Specify Jinja `macros` which will be available within any Jinja template content throughout the project. (This can slow performance.)
 * (optional) Specify `parameter_defaults` which will be used if the user fails to specify a particular [parameter](#command-line-parameters) or [environment variable](#environment-variable-references).
 * (optional) Specify whether to `show_progress` while processing, via a Dask progress bar.
-
+* (optional) Specify the `git_auth_timeout` (in seconds) to wait for the user to enter Git credentials if needed during package installation; default is 60. See [project composition](#project-composition) for more details on package installation.
 
 ### **`definitions`**
 The `definitions` section of the [YAML configuration](#yaml-configuration) is an optional section you can use to define configurations which are reused throughout the rest of the configuration. `earthmover` does nothing special with this section, it's just interpreted by the YAML parser. However, this can be a very useful way to keep your YAML configuration [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) &ndash; rather than redefine the same values, Jinja phrases, etc. throughout your config, define them once in this section and refer to them later using [YAML anchors, aliases, and overrides](https://www.linode.com/docs/guides/yaml-anchors-aliases-overrides-extensions/).
@@ -318,7 +319,10 @@ Concatenates the transformation source with one or more sources sources of the s
           - $sources.courses_list_1
           - $sources.courses_list_2
           - $sources.courses_list_3
+        fill_missing_columns: False
 ```
+By default, unioning sources with different columns raises an error.
+Set `fill_missing_columns` to `True` to union all columns into the output dataframe.
 </details>
 
 
@@ -371,10 +375,10 @@ Adds columns with specified values.
 ```yaml
       - operation: add_columns
         columns:
-          - new_column_1: value_1
-          - new_column_2: "{%raw%}{% if True %}Jinja works here{% endif %}{%endraw%}"
-          - new_column_3: "{%raw%}Reference values from {{AnotherColumn}} in this new column{%endraw%}"
-          - new_column_4: "{%raw%}{% if col1>col2 %}{{col1|float + col2|float}}{% else %}{{col1|float - col2|float}}{% endif %}{%endraw%}"
+          new_column_1: value_1
+          new_column_2: "{%raw%}{% if True %}Jinja works here{% endif %}{%endraw%}"
+          new_column_3: "{%raw%}Reference values from {{AnotherColumn}} in this new column{%endraw%}"
+          new_column_4: "{%raw%}{% if col1>col2 %}{{col1|float + col2|float}}{% else %}{{col1|float - col2|float}}{% endif %}{%endraw%}"
 ```
 Use Jinja: `{{value}}` refers to this column's value; `{{AnotherColumn}}` refers to another column's value. Any [Jinja filters](https://jinja.palletsprojects.com/en/3.1.x/templates/#builtin-filters) and [math operations](https://jinja.palletsprojects.com/en/3.0.x/templates/#math) should work. Reference the current row number with `{{__row_number__}}` or a dictionary containing the row data with `{{__row_data__['column_name']}}`. *You must wrap Jinja expressions* in `{%raw%}...{%endraw%}` to avoid them being parsed at YAML load time.
 </details>
@@ -565,6 +569,59 @@ By default, rows are sorted ascendingly. Set `descending: True` to reverse this
 </details>
 
 
+<details>
+<summary><code>flatten</code></summary>
+
+Split values in a column and create a copy of the row for each value.
+```yaml
+      - operation: flatten
+        flatten_column: my_column
+        left_wrapper: '["' # characters to trim from the left of values in `flatten_column`
+        right_wrapper: '"]' # characters to trim from the right of values in `flatten_column`
+        separator: ","  # the string by which to split values in `flatten_column`
+        value_column: my_value # name of the new column to create with flattened values
+        trim_whitespace: " \t\r\n\"" # characters to trim from `value_column` _after_ flattening
+```
+The defaults above are designed to allow flattening JSON arrays (in a string) with simply
+```yaml
+      - operation: flatten
+        flatten_column: my_column
+        value_column: my_value
+```
+Note that for empty string values or empty arrays, a row will still be preserved. These can be removed in a second step with a `filter_rows` operation. Example:
+```yaml
+# Given a dataframe like this:
+#   foo     bar    to_flatten
+#   ---     ---    ----------
+#   foo1    bar1   "[\"test1\",\"test2\",\"test3\"]"
+#   foo2    bar2   ""
+#   foo3    bar3   "[]"
+#   foo4    bar4   "[\"test4\",\"test5\",\"test6\"]"
+# 
+# a flatten operation like this:
+      - operation: flatten
+        flatten_column: to_flatten
+        value_column: my_value
+# will produce a dataframe like this:
+#   foo     bar    my_value
+#   ---     ---    --------
+#   foo1    bar1   test1
+#   foo1    bar1   test2
+#   foo1    bar1   test3
+#   foo2    bar2   ""
+#   foo3    bar3   ""
+#   foo4    bar4   test4
+#   foo4    bar4   test5
+#   foo4    bar4   test6
+#
+# and you can remove the blank rows if needed with a further operation:
+      - operation: filter_rows
+        query: my_value == ''
+        behavior: exclude
+```
+</details>
+
+
 #### Group operations
 
 <details>
@@ -608,6 +665,23 @@ Note the difference between `min()`/`max()` and `str_min()`/`str_max()`: given a
 </details>
 
 
+#### Debug operation
+
+<details>
+<summary><code>debug</code></summary>
+
+Sort rows by one or more columns.
+```yaml
+      - operation: debug
+        function: head | tail | describe | columns # default=head
+        rows: 10 # (optional, default=5; ignored if function=describe|columns)
+        transpose: True # (default=False; ignored when function=columns)
+        skip_columns: [a, b, c] # to avoid logging PII
+        keep_columns: [x, y, z] # to look only at specific columns
+```
+`function=head|tail` displays the `rows` first or last rows of the dataframe, respectively. (Note that on large dataframes, these may not truly be the first/last rows, due to Dask's memory optimizations.) `function=describe` shows statistics about the values in the dataframe. `function=columns` shows the column names in the dataframe. `transpose` can be helpful with very wide dataframes. `keep_columns` defaults to all columns, `skip_columns` defaults to no columns.
+</details>
+
 
 ### **`destinations`**
 The `destinations` section of the [YAML configuration](#yaml-configuration) specifies how transformed data is materialized to files.
@@ -700,7 +774,7 @@ transformations:
     operations:
       - operation: add_columns
         columns:
-          - source_file: {{i}}
+          source_file: {{i}}
 {% endfor %}
   stacked:
     source: $transformations.source1
@@ -891,7 +965,7 @@ Some details of the design of this tool are discussed below.
 
 Note that due to step (3) above, *runtime* Jinja expressions (such as column definitions for `add_columns` or `modify_columns` operations) should be wrapped with `{%raw%}...{%endraw%}` to avoid being parsed when the YAML is being loaded.
 
-The parsed YAML is written to a file called `earthmover_compiled.yaml` in your working directory during a `compile` or `run` command. This file can be used to debug issues related to compile-time Jinja or [project composition](#project-composition).
+The parsed YAML is written to a file called `earthmover_compiled.yaml` in your working directory during a `compile` command. This file can be used to debug issues related to compile-time Jinja or [project composition](#project-composition).
 
 
 ## DAG

diff --git a/earthmover/VERSION.txt b/earthmover/VERSION.txt
@@ -1 +1 @@
-0.3.1
+0.3.2
diff --git a/earthmover/__main__.py b/earthmover/__main__.py
@@ -5,6 +5,13 @@
 
 from earthmover.earthmover import Earthmover
 
+# Any new command should be added to this list
+RUN = "run"
+COMPILE = "compile"
+DEPS = "deps"
+CLEAN = "clean"
+ALLOWED_COMMANDS = [RUN, COMPILE, DEPS, CLEAN]
+command_list = ", ".join(f"`{c}`" for c in ALLOWED_COMMANDS)
 
 class ExitOnExceptionHandler(logging.StreamHandler):
     """
@@ -50,7 +57,7 @@ def main(argv=None):
     parser.add_argument('command',
         nargs="?",
         type=str,
-        help='the command to run: `run`, `compile`, `visualize`'
+        help=f'the command to run. One of: {command_list}'
         )
     parser.add_argument("-c", "--config-file",
         nargs="?",
@@ -103,9 +110,17 @@ def main(argv=None):
     })
 
     ### Parse the user-inputs and run Earthmover, depending on the command and subcommand passed.
-    args, remaining_argv = parser.parse_known_args()
+    args, unknown_args = parser.parse_known_args()
+    if len(unknown_args) > 0:
+        unknown_args_str = ', '.join(f"`{c}`" for c in unknown_args)
+        print(f"unknown arguments {unknown_args_str} passed, use -h flag for help")
+        exit(1)
+
+    if args.command is not None and args.command not in ALLOWED_COMMANDS:
+        print(f"unknown command '{args.command}' passed, use -h flag for help")
+        exit(1)
 
-    # Command: Version
+    # -v / --version
     if args.version:
         em_dir = os.path.dirname(os.path.abspath(__file__))
         version_file = os.path.join(em_dir, 'VERSION.txt')
@@ -114,7 +129,7 @@ def main(argv=None):
             print(f"earthmover, version {VERSION}")
         exit(0)
 
-    # Command: Test
+    # -t / --test
     if args.test:
         tests_dir = os.path.join( os.path.realpath(os.path.dirname(__file__)), "tests" )
 
@@ -130,7 +145,7 @@ def main(argv=None):
         em.logger.info('tests passed successfully.')
         exit(0)
 
-    ### Otherwise, initialize Earthmover for a main run.
+    ### Otherwise, initialize Earthmover to execute a command.
     if not args.config_file:
         for file in DEFAULT_CONFIG_FILES:
             test_file = os.path.join(".", file)
@@ -162,15 +177,15 @@ def main(argv=None):
             force=args.force,
             skip_hashing=args.skip_hashing,
             cli_state_configs=cli_state_configs,
-            results_file=args.results_file
+            results_file=args.results_file,
         )
 
     except Exception as err:
         logger.exception(err, exc_info=True)
         raise  # Avoids linting error
 
-    # Subcommand: deps (parse Earthmover YAML and compile listed packages)
-    if args.command == 'deps':
+    # Command: deps (parse Earthmover YAML and compile listed packages)
+    if args.command == DEPS:
         em.logger.info(f"installing packages...")
         if args.selector != '*':
             em.logger.info("selector is ignored for package install.")
@@ -182,23 +197,35 @@ def main(argv=None):
             logger.exception(e, exc_info=em.state_configs['show_stacktrace'])
             raise
 
-    # Subcommand: compile (parse Earthmover YAML and build graph)
-    elif args.command == 'compile':
+    # Command: compile (parse Earthmover YAML and build graph)
+    elif args.command == COMPILE:
         em.logger.info(f"compiling project...")
         if args.selector != '*':
             em.logger.info("selector is ignored for compile-only run.")
 
         try:
-            em.compile()
+            em.compile(to_disk=True)
             em.logger.info("looks ok")
 
         except Exception as e:
             logger.exception(e, exc_info=em.state_configs['show_stacktrace'])
             raise
 
-    # Subcommand: run (compile + execute)
+    elif args.command == CLEAN:
+        em.logger.info(f"removing local artifacts...")
+        if args.selector != '*':
+            em.logger.info("selector is ignored for project cleaning.")
+
+        try:
+            em.clean()
+            em.logger.info("done!")
+        except Exception as e:
+            logger.exception(e, exc_info=em.state_configs['show_stacktrace'])
+            raise
+
+    # Command: run (compile + execute)
     # This is the default if none is specified.
-    elif args.command == 'run' or not args.command:
+    elif args.command == RUN or not args.command:
         if not args.command:
             em.logger.warning("[no command specified; proceeding with `run` but we recommend explicitly giving a command]")
 
@@ -212,7 +239,7 @@ def main(argv=None):
             raise
 
     else:
-        logger.exception("unknown command, use -h flag for help")
+        logger.exception(f"unknown command '{args.command}', use -h flag for help")
         raise