Skip to content

Releases: allenai/tango

v0.13.0

08 Sep 00:39
Compare
Choose a tag to compare

What's new

Added 🎉

  • You can now reference into a particular index of the result of another step in a config. For example: {type: "ref", ref: "some_previous_step", key: 0}.
    The key field can be an integer if the result of the referenced step is a list or tuple, or a string if the result of the referenced step is a dictionary.
  • Added priority parameter to Beaker executor for setting the default task priority for Beaker jobs.
  • Added Workspace.step_result() method for getting a step's result from the latest
    run.
  • tango run will now display a URL to the logs for failed steps when you use the BeakerExecutor.

Changed ⚠️

  • The TorchTrainStep now enables monitoring arbitrary model outputs during training. TorchTrainEngine.forward_train now returns a tuple loss, model_outputs for each micro batch and the list of model outputs for all micro batches in a batch is passed to the TrainCallback.log_batch and TrainCallback.post_batch.
  • Tango will now automatically search Python modules in the current working directory
    for registered classes so that you don't always need to use the --include-package setting.
  • The minimum supported Python version is now 3.8.
  • Added support for PyTorch Lightning 1.7.x
  • The Beaker Executor will no-longer live-stream logs from Beaker jobs, but logs will be viewable on Beaker and more readable.
  • Only the Beaker executor requires a clean working directory

Fixed ✅

  • Fixed a bug that did not allow a wandb artifact's type to be set from a step's metadata dictionary.
  • Fixed a bug with how the Beaker executor streams log lines from Beaker which sometimes resulted in messages missing some starting characters, and tqdm lines being duplicated.
  • Fixed a bug in the Beaker workspace where the lock dataset wouldn't be removed if the step
    was found to be in an invalid state.
  • Improved cluster choice logic in BeakerExecutor to ensure greater diversity of clusters when submitting many steps at once.
  • Fixed bug where sub-processes of the multicore executor would use the wrong executor if executor was defined in a tango.yml file.

Commits

4f89d55 Improve Beaker cluster choice logic (#392)
e1ceae2 Display URL to logs for failed steps (#390)
3dc9591 Bump black from 22.6.0 to 22.8.0 (#380)
c9ce257 Catch when Beaker experiments are stopped (#389)
0fe12e9 Fix issues with WandbWorkspace causing CI crash (#388)
342eb26 Keep parameters in Params objects to make error messages more readable (#375)
92f0354 Simplified beaker logging (#383)
fd9d3cc Only the Beaker executor needs clean working directories (#373)
06f26ae Update wandb artifact type (#378)
f6a6b70 Update base images, get us out of the latest infinite loop of pip madness (#382)
306986b Catch all errors when attempting log record decode (#379)
628caff Allowing indexing into step results in config (#371)
858cef8 Minor improvement to Beaker logging (#377)
7a5619e Add Workspace.step_result() method (#374)
0750d76 Fix bugs with how BeakerExecutor streams logs (#372)
6e8b107 Detailed train outputs (#369)
bcd50d8 Update pytorch-lightning requirement from <1.7,>=1.6 to >=1.6,<1.8 (#349)
8ed0c86 Bump fairscale from 0.4.6 to 0.4.8 (#347)
62f2746 Python minimum version is 3.8 (#368)
45e02fe Auto import local Python modules when searching for registered classes (#367)

v0.12.0

24 Aug 00:06
Compare
Choose a tag to compare

What's new

Added 🎉

  • Step resources:
    • Added a step_resources parameter to the Step class which should be used to describe the computational resources required to run a step.
      Executor implementations can use this information. For example, if your step needs 2 GPUs, you should set
      step_resources=StepResources(gpu_count=2) ("step_resources": {"gpu_count": 2} in the configuration language).
    • Added a Step.resources() property method. By default this returns the value specified by the step_resources parameter.
      If your step implementation always requires the same resources, you can just override this method so you don't have to provide
      the step_resources parameter.
  • Step execution:
    • Added an executor field to the tango.yml settings. You can use this to define the executor you want to use by default.
    • Added a Beaker Executor to the Beaker integration, registered as an Executor with the name "beaker".
      To use this executor, add these lines to your tango.yml file:
      executor:
        type: beaker
        beaker_workspace: ai2/my-workspace
        clusters:
          - ai2/general-cirrascale
      See the docs for the BeakerExecutor for more information on the input parameters.
  • Step class:
    • Added a metadata field to the step class API. This can be set through the class
      variable METADATA or through the constructor argument step_metadata.
  • Weights & Biases integration:
    • You can now change the artifact kind for step result artifacts by adding a field
      called "artifact_kind" to a step's metadata.
      For models, setting "artifact_kind" to "model" will add the corresponding artifact to W&B's new model zoo.

Changed ⚠️

  • CLI:
    • The tango run command will throw an error if you have uncommitted changes in your repository, unless
      you use the --allow-dirty flag.
    • The tango run command will use the lightweight base executor (single process) by default.
      To use the multi-process executor, set -j/--parallelism to 1 or higher or -1 to use all available CPU cores.

Fixed ✅

  • Fixed bug where StepInfo environment and platform metadata could be out-of-date if a step is run again due to failure.
  • Fixed a bug where an unfortunate combination of early stopping and decreasing model performance could result in a crash in the torch trainer.

Commits

befb00a Add workspace_metadata arg to Step class, allow changing artifact kind in W&B workspace (#363)
5ab1c2a Fix undefined behavior with TorchTrainStep (#366)
bf3c1a0 Update filelock requirement from <3.8,>=3.4 to >=3.4,<3.9 (#354)
b4e48a7 Update jsonpickle requirement from <2.2.0,>=2.1.0 to >=2.1.0,<2.3.0 (#351)
1c491f0 Update wandb requirement from <0.13,>=0.12 to >=0.12,<0.14 (#350)
93d5eb4 Bump allenai/setup-beaker from 1 to 2 (#359)
dc0f89a Fix #355 - ensure git metadata is up-to-date (#361)
258e880 Raise better error msg from step_result_for_run() (#360)
43916d1 Print debugging information about the repo used. (#353)
928aa7a Add BeakerExecutor (#340)

v0.11.0

04 Aug 22:32
Compare
Choose a tag to compare

What's new

Added 🎉

  • Added a Flax integration along with an example config.

Commits

b4cd2b3 Flax Integration (#313)
b9a7422 Bump sphinx from 5.0.2 to 5.1.1 (#346)
d7952ef Bump mypy from 0.961 to 0.971 (#339)
6a58bfd Put PIP install instructions first (#348)

v0.10.1

26 Jul 20:57
Compare
Choose a tag to compare

What's new

Fixed ✅

  • Fixed issue where the StepInfo config argument could be parsed into a Step.
  • Restored capability to run tests out-of-tree.

Commits

2498318 Fix issue where StepInfo config could be parsed into a Step (#344)
57096b2 Make tests runnable out-of-tree for help with conda-packaging (#307)

v0.10.0

07 Jul 21:04
Compare
Choose a tag to compare

What's new

Changed ⚠️

  • Renamed workspace parameter of BeakerWorkspace class to beaker_workspace.
  • Executor class is now a Registrable base class. MulticoreExecutor is registered as "multicore".

Removed 👋

  • Removed StepExecutionMetadata. Its fields have been absorbed into StepInfo.

Fixed ✅

  • Improved Step.ensure_result() such that the step's result doesn't have to be read from the cache.
  • Fixed an issue with the output from MulticoreExecutor such that it's now consistent with the default Executor for steps that were found in the cache.
  • One of our error messages referred to a configuration file that no longer exists.
  • Improved performance of BeakerWorkspace.

Added 🎉

  • Added the ability to train straight Model instead of just Lazy[Model]

Commits

4e809f5 Eager models (#319)
361777b Metadata changes, make executor registrable (#331)
a6b0be9 Beaker workspace performance (#328)
f43e5ea Update torch requirement from <1.12,>=1.9 to >=1.9,<1.13 (#330)
8495c64 update dev dependencies (#333)
712d862 Make multicore executor output consistent with default (#325)
903569c Refer to the right config file (#324)
bd9e4be Modernize our issue templates (#323)

v0.9.1

24 Jun 16:05
Compare
Choose a tag to compare

What's new

Fixed ✅

  • Fixed non-deterministic behavior in TorchTrainStep.
  • Fixed bug in BeakerWorkspace where .step_info(step) would raise a KeyError if the step hasn't been registered as part of a run yet.
  • Fixed a bug in BeakerWorkspace where it would send too many requests to the beaker service.
  • Fixed a bug where WandbWorkspace.step_finished() or .step_failed() would crash if called
    from a different process than .step_starting().
  • Fixed a bug in WandbWorkspace.step_finished() which led to a RuntimeError sometimes while
    caching the result of a step.

Commits

c6fc5be Fix bugs with Workspace and WandbWorkspace, specifically (#321)
80c90ca Beaker DOS fix (#315)
8b75591 Log from BeakerStepLock at WARNING level (#316)
4d46d67 fix non-deterministic behavior in TorchTrainStep (#314)
c59b6b3 Bump actions/setup-python from 3 to 4 (#311)
b02cf40 Bump sphinx from 4.5.0 to 5.0.1 (#305)
4501815 Bump furo from 2022.6.4 to 2022.6.4.1 (#309)
da9c29c Fix bug in Beaker workspace (#312)
e8422cb Bump mypy from 0.960 to 0.961 (#308)
8256a74 Bump myst-parser from 0.17.2 to 0.18.0 (#310)
44ae92e Bump furo from 2022.4.7 to 2022.6.4 (#306)
39923ae Update protobuf requirement from <=3.20.0 to <4.22.0 (#301)
e7ef1f5 Registerables first steps eg (#304)

v0.9.0

01 Jun 21:34
Compare
Choose a tag to compare

What's new

Added 🎉

  • Added a Beaker integration that comes with BeakerWorkspace, a remote Workspace implementation that uses Beaker Datasets under the hood.
  • Added a datasets::dataset_remix step that provides the split remixing functionality of tango.steps.datasest_remix.DatasetRemixStep now for Huggingface DatasetDict.

Changed ⚠️

  • If you try to import something from a tango integration that is not fully installed due to missing dependencies, an IntegrationMissingError will be raised
    instead of ModuleNotFound.
  • You can now set -j 0 in tango run to disable multicore execution altogether.

Fixed ✅

  • Improved how steps and workspaces handle race conditions when different processes are competing to execute the same step. This would result in a RuntimeError before with most workspaces, but now it's handled gracefully.
  • Fixed bug which caused GradScaler state to not be saved and loaded with checkpoints.

Commits

0ddd2ac Add Beaker integration (#296)
6bdd1dd Updates the Euler example (#297)
bc89470 GradScaler state saving and loading (#293)
b8562db fix old filename in CONTRIBUTING.md (#300)
4aff1bb Dataset remix (#298)
eb1fcd8 Bump mypy from 0.950 to 0.960 (#295)
903741e Update filelock requirement from <3.7,>=3.4 to >=3.4,<3.8 (#284)
b58b823 Handle missing integrations (#292)

v0.8.0

20 May 00:52
Compare
Choose a tag to compare

What's new

Added 🎉

  • Added a Weights & Baises remote Workspace implementation: WandbWorkspace, registered as "wandb".
    This can be instantiated from a workspace URL in the form "wandb://entity/project".
  • Added a method Workspace.step_result_for_run which gives the result of a step given the run name and step name within that run.
  • Added property Workspace.url, which returns a URL for the workspace that can be used to instantiate the exact same workspace using Workspace.from_url(). Subclasses must implement this.

Changed ⚠️

  • StepInfo start and end times will be always be in UTC now.
  • WandbTrainCallback now logs system metrics from each worker process in distributed training.
  • StepCache.__contains__() and StepCache.__getitem__() now take accept either a Step or StepInfo as an argument (Union[Step, StepInfo]).
  • Refactored tango.step_graph.StepGraph to allow initialization from a Dict[str, Step].
  • Executor.execute_step_graph() now attempts to execute all steps and summarizes success/failures.

Fixed ✅

  • Fixed bug with LocalWorkspace.from_parsed_url() (#278).
  • Deprecation warnings will now be logged from tango CLI.
  • Fixed the text format in the case of serializing an iterator of string.
  • Added missing default value of None to TangoGlobalSettings.find_or_default().
  • Mypy has become incompatible with transformers and datasets, so we have to disable the checks in some places.
  • The VERSION member of step arguments that were wrapped in Lazy were not respected. Now they are.

Commits

3069226 Makes sure the VERSION parameter of classes is respected even when we construct them inside of a Lazy object. (#289)
dd71446 Add Weights & Baises remote workspace (#232)
e3f2bd2 Adds a dependency that's missing from transformers (#285)
25919e1 Fixes the text format (#283)
381de74 Add missing default to TangoGlobalSettings.find_or_default() (#282)
9ac708a Update click requirement from <8.1.3,>=7.0 to >=7.0,<8.1.4 (#277)
749357e Bump mypy from 0.942 to 0.950 (#276)
2c59c96 Bump allenai/beaker-run-action from 1.0 to 1.1 (#274)
53ffe80 refactor (#275)

v0.7.0

19 Apr 22:52
Compare
Choose a tag to compare

What's new

Added 🎉

  • Added the "-n/--name" option to tango run. This option allows the user to give the run an arbitrary name.
  • Added a convenience property .workspace to Step class that can be called from a step's .run() method to get the current Workspace being used.
  • Gave FromParams objects (which includes all Registrable objects) the ability to version themselves.
  • Added CLI option to run a single step in a config using --step-name or -s.
  • Added a MultiCoreExecutor that executes steps in parallel.
  • Added an ExecutorOutput dataclass that is returned by Executor.execute_step_graph().
  • StepGraph now prints itself in a readable way.
  • Tango now automatically detects when it's running under a debugger, and disables multicore support accordingly. Many debuggers can't properly follow sub-processes, so this is a convenience for people who love debuggers.
  • Added more models to the stuff we can import from the transformers library.
  • Added new example for finetuning text-to-text models.

Changed ⚠️

  • Renamed click_logger to cli_logger, and we now use rich's logging Handler as the default handler, which means prettier output, better tracebacks, and you can use rich's markup syntax with the cli_logger to easily add style to text.
  • Refactored tango.step_graph.StepGraph to allow initialization from a Dict[str, Step].
  • Executor.execute_step_graph() now attempts to execute all steps and summarizes success/failures.
  • Upgraded PyTorch version in tango Docker image to latest v1.11.0+cu113.
  • RunGeneration now allows model object as input.

Fixed ✅

  • Fixed bug that mistakenly disallowed fully-qualified names containing "_" (underscores) in the config.
  • Fixed bug where TorchTrainStep working directory would be left in an unrecoverable state if training failed after saving the final model weights.
  • Fixed bug in FromParams where **kwargs might be passed down to the constructors of arguments.
  • Fixed bug in the way dependencies are tracked between steps.
  • Fixed bug that caused MulticoreExecutor to hang in case of a failing step that was required recursively (not directly) downstream.
  • Fixed bug in the way dependencies are tracked between steps
  • Compatibility with PyTorch Lightning 1.6

Commits

1083049 Finetuning (#255)
42b1dba Bug fix with failing steps (#257)
7bd251a Bump myst-parser from 0.17.0 to 0.17.2 (#273)
cc9a1dd Bump actions/upload-artifact from 2 to 3 (#262)
66777d9 Bump actions/download-artifact from 2 to 3 (#261)
14d4adb use new beaker-action for building test image (#265)
af47287 Update pytorch-lightning requirement from <1.6,>=1.5 to >=1.5,<1.7 (#248)
b1df9a4 use beaker-run action for GPU Tests (#263)
0a7468e fix release job (#260)
c1b16b2 Bump furo from 2022.3.4 to 2022.4.7 (#259)
b55aaf2 use beaker-py to submit GPU tests (#258)
b2a93a9 Logging part 2: denoising run logging and making Dirk happy (#252)
ff6be8d Update click requirement from <=8.0.4,>=7.0 to >=7.0,<8.1.3 (#254)
83d78cc Bump mypy from 0.941 to 0.942 (#243)
3769327 Bump sphinx from 4.4.0 to 4.5.0 (#245)
81fc5c5 Bump black from 21.12b0 to 22.3.0 (#246)
e46059b Update tqdm requirement from <4.64,>=4.62 to >=4.62,<4.65 (#256)
bbdeb6f Revert "Set $TEMP (#241)"
b9fd9e9 Fix tracking dependencies between steps (#249)
53502e1 Pretty-print a step graph (#250)
d5328c9 Fix dissimilar objects hashing to the same thing (#240)
ccc37ce Autodetect debugger and turn off multicore (#251)
5c39f61 Pin click
5bb0fad Logging improvements (#233)
037e4a0 fix bug with FromParams (#242)
e142530 Bump actions/cache from 2 to 3 (#236)
878402d Set $TEMP (#241)
2d9fa0c fix bug w/ TorchTrainStep working dir (#238)
410faeb Multicore Parallelism (#204)
9e8e99f Update datasets requirement from <2,>=1.12 to >=1.12,<3 (#234)
40e0a1a Bump mypy from 0.940 to 0.941 (#230)
ede7428 add name to changelog workflow
4bb659b Bump actions/setup-python from 2 to 3 (#229)
5db1a6a Bump actions/checkout from 1 to 3 (#228)
8049104 Update torch version where it's hard-coded, add an automatic remind to do this stuff in the future (#227)
fe05449 add back intersphinx inventory links for HF libraries (#222)
9927749 Bump mypy from 0.931 to 0.940 (#226)
29ab68b Update torch requirement from <1.11,>=1.9 to >=1.9,<1.12 (#225)
a3fc83b Bump furo from 2022.2.23 to 2022.3.4 (#218)
28e839e Bump fairscale from 0.4.5 to 0.4.6 (#224)
f18d393 Update tqdm requirement from <4.63,>=4.62 to >=4.62,<4.64 (#213)
54c4a8d automatically keep copyright up-to-date (#221)
06adb07 Allow setting the run name as a command-line option (#212)
71e0639 Update cached-path requirement from <1.1,>=1.0 to >=1.0,<1.2 (#217)
5d4660a Temporarily remove intersphinx links to HF docs (#220)
13c7f3f Merge pull request #216 from allenai/VersionForFromParams
0027cb2 Merge pull request #215 from allenai/fix-fully-qualified-name-recognition
76f9922 Add "Step.workspace" property (#210)

v0.6.0

25 Feb 21:35
Compare
Choose a tag to compare

What's new

Added 🎉

  • New example that finetunes a pre-trained ResNet model on the Cats & Dogs dataset.
  • Added a '@requires_gpus' decorator for marking tests as needing GPUs. Tests marked with this will be run in the "GPU Tests" workflow
    on dual k80 GPUs via Beaker.
  • Added the "-w/--workspace" option to tango run and tango server commands. This option takes a path or URL, and instantiates the workspace from the URL using the newly added Workspace.from_url() method.
  • Added the "workspace" field to TangoGlobalSettings.
  • Added the "environment" field to TangoGlobalSettings for setting environment variables each
    time tango is run.
  • Added a utility function to get a StepGraph directly from a file.
  • Added tango.settings module and tango settings group of commands.
  • A format for storing sequences as SqliteSparseSequence
  • A way to massage kwargs before they determine the unique ID of a Step

Changed ⚠️

  • local_workspace.ExecutorMetadata renamed to StepExecutionMetadata and now saved as execution-metadata.json.
  • tango run without the option "-w/--workspace" or "-d/--workspace-dir" will now use a MemoryWorkspace instead of a LocalWorkspace in a temp directory, unless you've specified
    a default workspace in a TangoGlobalSettings file.
  • Moved tango.workspace.MemoryWorkspace and tango.local_workspace.LocalWorkspace to tango.workspaces.*.
  • Moved tango.step_cache.MemoryStepCache and tango.step_cache.LocalStepCache to tango.step_caches.*.
  • Deprecated the -d/--workspace-dir command-line option. Please use -w/--workspace instead.

Fixed ✅

  • Fixed a small bug LocalWorkspace would fail to capture the conda environment in our Docker image.
  • Fixed activation of FILE_FRIENDLY_LOGGING when set from the corresponding environment variable.
  • Fixed setting log level via the environment variable TANGO_LOG_LEVEL.
  • Use relative paths within the work_dir for symbolic links to the latest and the best checkpoints in TorchTrainStep.
  • Fixed some scenarios where Tango can hang after finishing all steps.
  • distributed_port and log_every parameters won't factor into TorchTrainStep's unique ID.
  • MappedSequence now works with slicing.
  • MappedSequence now works with Huggingface Dataset.
  • Uncacheable steps are now visible in Tango UI.
  • Fixed bug in Registrable.list_available() where an error might be raised if the default implementation hadn't been explicitly imported.
  • Fixed issue where having a default argument to the run() method wasn't getting applied to the step's unique ID.

Commits

f9da0af Merge pull request #211 from allenai/Massage
e78dcbe Allow setting environment variables in tango settings, fix bug with TANGO_LOG_LEVEL env var (#209)
82404b6 Re-create LICENSE so GitHub will show it (#208)
0fadecf Bump furo from 2022.2.14.1 to 2022.2.23 (#207)
787b6e6 Merge pull request #206 from allenai/settings
c3401f2 Merge pull request #205 from allenai/RobustnessFixes
7ceda9c Merge pull request #201 from allenai/workspace-prep
6dd7d86 Merge pull request #200 from allenai/uncacheable-steps-in-server
5ad3f44 Bump furo from 2022.1.2 to 2022.2.14.1 (#199)
3528230 Update filelock requirement from <3.5,>=3.4 to >=3.4,<3.7 (#202)
21d6d40 Merge pull request #193 from allenai/StepGraphFromFile
258a7d2 skip 'distributed_port' and 'log_every' in unique ID (#197)
dd4c47f Merge pull request #192 from allenai/CloseSqliteHarder
cc94e1c Merge pull request #156 from allenai/DocumentationRefresh
5cc86b8 Rename "ExecutorMetadata" -> "StepExecutionMetadata" (#195)
6aecab7 Bump myst-parser from 0.16.1 to 0.17.0 (#191)
6478293 make pushing test image to Beaker more robust (#190)
7c1ac5b Finetune resnet Example for Tango (#150)
5187b01 update docs for integration tests and gpu tests timeout
95b78b5 Add new manually triggered workflow for integration tests, other bug fixes (#188)
19f7b31 Merge pull request #189 from allenai/fix-checkpoint-path-link
a438b26 Workflow quickfix
671a6dc verify exit code of beaker job (#187)
7ccad94 Merge pull request #186 from allenai/add-tests
bf6ecd0 Run GPU tests on Beaker (#183)