Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

files as dependency #867

Open
alperyilmaz opened this issue Jun 11, 2021 · 21 comments · May be fixed by #1906
Open

files as dependency #867

alperyilmaz opened this issue Jun 11, 2021 · 21 comments · May be fixed by #1906

Comments

@alperyilmaz
Copy link

Hi,
I'm glad to discover just since make is painful to write. However, I still like some dependencies to be actual files. The recipe should run if file exists or has been updated. "File exists" is easy to find a workaround and there are some workarounds for "if file changed" cases such as watchexec or tricks mentioned in #424. But I was still wondering if it's possible to use a file as dependency so that dependees run if file has changed or file is missing?

Just a hypothetical example:

process-data: input.csv
    #!/usr/bin/env Rscript
    library(tidyverse)
    input <- read_csv("input.csv")
    #some analysis
    write_csv("output.csv")

Is it possible that just process-data does not run if input.csv has not changed?

@casey
Copy link
Owner

casey commented Jun 11, 2021

I definitely see why this would be useful, and it's a useful feature of Make.

It's tricky though! Let's use a simplified example. In this justfile, I'm using a string dependency to mean "this dependency is actually a path to an input file". Here's a simple recipe that just copies an input file to an output file:

foo: "input.csv"
  cat input.csv > output.csv

However, this isn't yet enough information. just needs to be told what the output file is, because it should only run this recipe if the output file is older than the input file. Here's the same justfile, but with > indicating an output file:

foo: "input.csv" > "output.csv"
  cat input.csv > output.csv

Now just knows the input file and the output file, and can only run the recipe when input.csv is newer than ouput.csv.

There are some gotchas though. If you were using a compiler, and updated the compiler to a new version, just wouldn't know that since the output file depended both on the input file and the installed compiler, it should run the recipe even though, strictly speaking, the output file was fresher than the input file:

foo: "main.c" > "a.out"
  cc main.c

Similarly, if you were using compiler flags that changed, just consider those flags to be part of the "input". For example, if you changed -O3 to -Os in the following justfile:

foo: "main.c" > "a.out"
  cc -O3 main.c

My hunch is that this is just too hard to get right enough for it to be worth adding, and that there would be a confusing long-tail of issues.

It's not a satisfying answer, but you could always create a Makefile and call it from your justfile. I've also used tup which is like Make but a bit saner, so that could be another option. There's also ninja, which I think is not meant to be used directly, but lots of people do anyways.

In order for me to be convinced that this was a good idea, I think someone would have to present a design which was relatively simple, with straightforward limitations that would make sense to users, and didn't look like it would lead to a long-tail of weird issues.

I'll leave this open in the hope that someone comes up with such a design!

@casey
Copy link
Owner

casey commented Jun 11, 2021

You could also embed a ninja file inside a justfile, which is actually kind of nice:

ninja:
  #!/usr/bin/env ninja -f
  cflags = -Wall

  rule cc
    command = gcc $cflags -c $in -o $out

  build main.o: cc main.c

@casey
Copy link
Owner

casey commented Jun 11, 2021

(Perhaps not super useful though, since you're using a shebang recipe, and I don't think it's possible to embed non-shell commands in ninja.)

@9SMTM6
Copy link

9SMTM6 commented Oct 20, 2022

This is the reason I'm not really considering just despite otherwise agreeing with what I've seen from its documentation.

Input>output date based caching isn't perfect, but still it's portable, it's easy to understand, and it doesn't rely on some stateful process that won't survive a reboot etc.

It's not as if just is perfect. You mentioned the possibility of an compiler update. Well, considering you call commands just with a name, a simple alias (perhaps not shell alias but still) with different behavior can also break things perfectly fine.

Provided your commands don't crap out with unexpected input, at worst a input>output caching will fail with unexpected error output and you'll have to go back to the only behavior currently officially supported, just (is that where the name is from?) running everything.

I agree that the default behavior and often the syntax of make often are problematic. But still, that basic behavior is, while not perfect, too valuable for me too give up. Not every command reuses old results like eg cargo, and in that case just with properly declared dependencies becomes unusable for many workflows, and the proposed solutions, just using another build tool, or keeping a watch process running, are not sufficient IMO and have more issues than when input>output would be supported.

So, after that long rant, I'm far from a finished solution, but I believe that idea might be beneficial (though probably would require a lot of work, even if hopefully only additive in behavior).

One could add a way to enable caching given some "was changed" validation returns false. Combined with a helper - provided with appropriate warnings that it's not perfect - that implements the input date>output date validation, and perhaps the easy ability to define other helpers (eg one that checks for changes of used compilers) and combine them, would solve that issue for me and might be acceptable to you.

Where I am seeing most issues is in terms of syntax and in how to handle these validators without introducing a complex system. Perhaps one could define them similar to normal tasks, but still they'd behave differently, as they have to somehow signify true/false.

@casey
Copy link
Owner

casey commented Oct 20, 2022

I'm open to this being implemented, but it's a lot of work, and there are a lot of tricky open questions. Make is a kitchen sink of weird features needed to support edge cases, and many aspects of its design are widely regard as very bad.

The best way forward is for someone to figure out how to make just work with redo. Redo is a simple make-like build system with a few implementations, and which has a number of nice properties.

I suspect that just + redo can be made to work relatively easily, and with minimal changes to just.

@runeimp
Copy link

runeimp commented Oct 21, 2022

I believe Redo, Ninja and others keep a hash for source files and when processing is called a check against the cached hash and the current source files hash is checked to know if it's been modified since the last processing run.

I created a standalone tool for allowing shell scripts and such to be able to have the same functionality with https://github.com/runeimp/filedelta . This sort of functionality obviates the need to check if the source is newer than the target and so only the source needs to be checked, which is handy for dependency checking. And still works if the target is accidentally "touched" after the last modification to the source. The basic process is pretty simple and could be extended to take "context" hash or string as well in cases where the source file may not have changed but needs to be run in a new context.

@naps62
Copy link

naps62 commented Dec 2, 2022

Anytime I use a Justfile for a project, I end up limited about this issue.
Don't get me wrong, the tool is great, but I think this is really something worth considering more deeply, despite all the caveats already mentioned. So let me try to provide some input

The only concrete syntax proposed here was:

foo: "input.csv" > "output.csv"
  cat input.csv > output.csv

which I'm personally not a fan of. For a few reasons:

  • in my opinion, makes the rules much less readable
  • it ends up mixing up with the syntax for dependencies on other recipes (test: build). will this lead to the need for PHONY feature as well?
  • how would it even play out with the && operator?

Attributes syntax

Looking at the existing syntax already supported by Just, I think a good place to think about this would be Recipe Attributes. A quick (and probably bad) example:

[if(updated(input.csv))]
foo:
  cat input.csv > output.csv

I disagree with @casey that the output file is needed info for this to work. We can choose to compute a hash for the input, store it somewhere, and run only once it changes.
We can also provide something like an --always flag which could be a good quality-of-life feature to skip all these conditions explicity and do a full run after a compiler update.

Custom user commands

As an alternative, or maybe even an addition, we can also opens up the possibility of people specifying their own arbitrary conditions:

needs_build := `find -name input.csv -newer output.csv`

[if(needs_compilation)]
build:

@xonixx
Copy link

xonixx commented Dec 9, 2022

Just can consider incorporating an approach from makesure.
(full disclosure, I'm the author).

Makesure has @reached_if directive that can be added to any goal. This allows skipping goal execution if it's already satisfied.

Thus you can easily use bash's -nt (newer than) / -ot (older than) test operators like so:

@goal regenerate_file2_based_on_file1
@reached_if [[ file2 -nt file1 ]]
	do_something file1 > file2

Here is an example from real project:

@goal inventory_generated
@depends_on output_ready__instance_public_ip
@reached_if [[ -f "$INVENTORY" ]] && [[ "$INVENTORY" -nt 'inventory.tpl.yml' ]]
@doc 'inventory-<ENV>.yml generated from terraform data'
  instance_public_ip="$(cat "$INSTANCE_PUBLIC_IP")"
  awk -v instance_public_ip="$instance_public_ip" '
{
  gsub(/\$instance_public_ip/, instance_public_ip)
  gsub(/\$ENV/, ENVIRON["ENV"])
  print
}
' inventory.tpl.yml > "$INVENTORY"

Obvious cons of this approach: -nt/-ot are bash-specific, and not present in POSIX shell, and I know nothing of powershell.
Pros: the @reached_id approach is more generic (and so powerful) than just this case since it can evaluate any scriptlet as a condition check.

@arve0
Copy link

arve0 commented Jan 11, 2023

I disagree with @casey that the output file is needed info for this to work. We can choose to compute a hash for the input, store it somewhere, and run only once it changes.

I like this idea.

What’s the benefit of storing a hash vs the last modified time?

One could also store checksums for the justfile task, to check if you’ve changed it since last run. Storing modified times for binaries used should also be possible.

@runeimp
Copy link

runeimp commented Jan 14, 2023

The main benefit for hashing is simply that it doesn't require access to the target file to determine if the source changed. This can be important on certain file systems where modification times are beyond your ability to guarantee are stable. This is rarely the case but when it is hashing the source can guarantee a needed build (or whatever) happens.

This can also be important when building the source is incredibly costly in some way and definitely only want to rebuild (or whatever) when the source has definitely changed as apposed to someone opening the source to review, make NO changes, and accidentally saving it out of habit, thus changing it's modification time to newer than the target.

@tgross35
Copy link
Contributor

tgross35 commented Dec 5, 2023

I think it could be nice to use a functionlike syntax to indicate source files. This way it is unambiguous - you don't have questions about whether your shell variable gets treated as a string, and you can mix and match file targets with other targets:

othertarget:
    echo done

foo: files(a.txt, b.c, d.csv, $foo_file) othertarget > files(x.o)
    cc ...

Or, just create something that is like the opposite of .PHONY in makefiles. Often this kind of usage requires wildcards in the target, which I do not think is currently supported (probably need some form of wildcards for any kind of file-based targets regardless of syntax).

[file]
%.c:

[file]
%.o: %.c
    gcc ...

# Maybe regex could be used? Much more flexible and well known than Make's `%` 
[file]
.*\.o: $1\.c:
    gcc ...

Doing these sort of things with Make is always horrid because of how easy it is to mistype a pattern and get "no rules to make target" somewhere in the graph. Hopefully Just could support something better.

@nmay231
Copy link

nmay231 commented Feb 10, 2024

I think I discovered the simplest solution with the most power:

Add a recipe attribute [cached] that only reruns when the content of the recipe changes, where the content is checked after just-variables are expanded! (or if the last recipe run fails)

It's simple to understand with no extra syntax introduced.

Why is this so powerful? Because it gives access to EVERY function you can use in just

  • sha256_file() (and the upcoming blake3_file())
  • path_exists()
  • we can compare build tool versions with build-tool --version
  • we can add a last_modified() and/or is_older_than(a, b) function(s) to support that work flow
  • we can also make sha256_file() take a directory or a glob to specify recursive hashing, etc.
  • we can use env_var() if is used by the commands in the recipe implicitly.

How do we include these "change detectors" in the recipe content without changing what runs? Put it in a comment.

[cached]
build:
   @# {{sha256_file("input.c")}} Prefix with @ to not print the comment, or don't if, for example, you want to see the build-tool version
   gcc input.c -o output

(We can also add something in the rare case a #! shebang process doesn't have a syntax for comments)

The only weakness of this strategy is when recipe arguments change what is done, not how it is done. For example:

[cached]
build BIN *FLAGS:
   RUSTFLAGS="{{FLAGS}}" cargo build --release --bin {{BIN}}

In the example, building binary1, then binary2 does not mean you need to rebuild binary1 (unless there is some weird dependency between them). However, changing FLAGS certainly means you have to rebuild any previous binary.

The only way to solve it I think is to allow the [cached] attribute to provide ways to "expand" the cache key. For example:

[cached("BIN")]
build BIN *FLAGS:

will evaluate {{BIN}} and include that as part of the cache key, along with the recipe name and project directory absolute path.

I'm gonna start implementing this if there are no major objections and after I finish some other open-source work I've been doing.

@casey
Copy link
Owner

casey commented Feb 11, 2024

This is pretty interesting! It seems like it could be an interesting and low-impact way to support this.

@tgross35
Copy link
Contributor

@nmay231 this is what I suggested at #1861, and Casey mentioned a few extra possible issues there. It sounds like a great idea though, awesome if you can implement it :)

@nmay231
Copy link

nmay231 commented Feb 12, 2024

@tgross35 You're absolutely right 😅; they're practically the same. I had apparently only skimmed your solution not fully understanding it (sub-conscience plagiarism?)

I think the main difference between yours and mine is yours would only rerun if the directory name changes instead of when the directory contents change (sha256() vs sha256_file()). Of course, that would be more of a user error and not the fault of just.

@tgross35
Copy link
Contributor

@tgross35 You're absolutely right 😅; they're practically the same. I had apparently only skimmed your solution not fully understanding it (sub-conscience plagiarism?)

Great minds think alike 🙂

I think the main difference between yours and mine is yours would only rerun if the directory name changes instead of when the directory contents change (sha256() vs sha256_file()). Of course, that would be more of a user error and not the fault of just.

Indeed, that was just an example of how I am implementing pseudo-cached recipes (cmake handles the actual changed files in that example). I think that if there is a basic implementation of cached recipes, then it should be easy enough to improve the ergonomics later. Maybe:

  • A builtin glob+hash cache_key("**/*.c") function to make purpose obvious
  • Something like cache_env("CARGO_.*") to hash the value of environment variables

@nmay231
Copy link

nmay231 commented Feb 13, 2024

A builtin glob+hash cache_key("**/*.c") function to make purpose obvious

Instead of a whole new function, I was thinking of expanding sha256_file/blake3_file to allow directory names for recursive checks and/or allowing globs (currently, we error when given a directory). @casey what are your thoughts? If it is a new function, I would bikeshed it should be hash_files or files_hash instead of cache_key so that it sounds less like a verb, "cache this key: ...".

Something like cache_env for a glob/regex of env vars will definitely be needed though.

In any case, I'll start working on recipe caching based on evaluated content since that isn't affected by these decisions.

Sidenote @tgross35 : We have used the term "cache key" in different ways (don't know/care which is "correct").

  • Yours was: when a cache key changes, the recipe will be rerun. So, a just-variable is a cache key for example.
  • Mine: I meant it like you append it to the literal key when looking up the previous recipe hash, e.g. recipe_hashes[justfile_directory, recipe_name, example_cache_key1, ...] = calc_hash(). This allows multiple targets of a recipe to be cached without requiring reruns each time (like my example above).

Just wanted to mention it before it caused confusion.

@tgross35
Copy link
Contributor

Instead of a whole new function, I was thinking of expanding sha256_file/blake3_file to allow directory names for recursive checks and/or allowing globs (currently, we error when given a directory). @casey what are your thoughts? If it is a new function, I would bikeshed it should be hash_files or files_hash instead of cache_key so that it sounds less like a verb, "cache this key: ...".

Something like cache_env for a glob/regex of env vars will definitely be needed though.

In any case, I'll start working on recipe caching based on evaluated content since that isn't affected by these decisions.

I was loosely thinking that a specific cache_key function could be used to explicitly append something to the cache key,
rather than implicitly adding all interpolations (cache_env would append to the key as well). Agreed that having the *_file functions accept a glob could be better, my example really should have been cache_key(blake3_files("**/*.c")).

Sidenote @tgross35 : We have used the term "cache key" in different ways (don't know/care which is "correct").

* Yours was: when a cache key changes, the recipe will be rerun. So, a just-variable is a cache key for example.

* Mine: I meant it like you append it to the literal key when looking up the previous recipe hash, e.g. `recipe_hashes[justfile_directory, recipe_name, example_cache_key1, ...] = calc_hash()`. This allows multiple targets of a recipe to be cached without requiring reruns each time (like my example above).

By cache key I just meant whatever gets hashed to represent the state - variables/expressions, other computed hashes (file contents), or string literals. I'm not exactly sure what the recipe_hashes example is showing, but I think that is getting at the same thing.

All in all, I imagined something like this being stored in the cache directory, if key_hash gets changed then the recipe gets rerun:

{
  "cached_recipes": [
    { 
      "path": "/home/user/project/justfile",
      "recipe": "configure",
      // this is `blake3sum(blake3_files("**/*.c") + hash_env("CARGO.*") + some_other_var + ...)
      "key_hash": "f7b2f545fb75d120c0dac039aff99ff472c21b170bf3d0714e7b9a34113e7f04",
      "last_run": "2024-01-21T08:40:52Z"
    }
    // ...
  ]
}

@nmay231 nmay231 linked a pull request Feb 16, 2024 that will close this issue
18 tasks
@nmay231
Copy link

nmay231 commented Feb 16, 2024

I was loosely thinking that a specific cache_key function could be used to explicitly append something to the cache key,
rather than implicitly adding all interpolations

I see. I recently thought of cache_ignore() as a function that is just the identity function when running the recipe but excludes it when calculating the recipe hash. I think this would be better overall since you most likely will want to exclude fewer things from the hash than what you want to include.

I'm not exactly sure what the recipe_hashes example is showing, but I think that is getting at the same thing.

Actually, it's kinda the opposite :) Here is the basic json structure I went with for the cache file:

{
  "working_directory": "/typically/the/project/dir",
  "recipe_caches": {
    "recipe_name": {
      "hash": "2a680e74556e82b6f206e3e39c6abe3e28a6530c8494ea46e642a41b9ef7424a"
    }
  }
}

So the hash is still the hash of the whole body, but when arguments are added to [cached] (assuming my PR is eventually approved), this is how the file would change using my very first example (depending on implementation details):

{
  "build:BIN=client": {
    "hash": "..."
  },
  "build:BIN=server": {
    "hash": "..."
  }
}

Note: every justfile gets its own cachefile (per working directory if --working-directory is set). Also, I didn't add last_run field yet, but I don't know if it's really necessary. Maybe it could be helpful for user debugging I guess, but I didn't want to confuse the user on what causes a recipe to be cached.

@qtfkwk
Copy link

qtfkwk commented Jun 9, 2024

I created https://crates.io/crates/mkrs primarily for this reason (and to write the Makefile in markdown); it seems like the same approach may not fit in just's design. Mkrs distinguishes between file and non-file targets, and runs the recipe for a file target if -B (force processing), file does not exist, or is outdated (any of target's source files are newer than the target file). Please feel free to use as it as a reference. Thanks!

@al1-ce
Copy link

al1-ce commented Jul 9, 2024

Might I contribute a temporary and compact solution that doesn't require any dependencies except what you'd usually find on linux machine (md5sum, head and grep) (works only on linux and possibly under mingw):

# This is task that you'd call
build: (file "main.c") (file "other.c")
    echo "Finished compiling"

# Use this task to add hashed file dependencies
# And do actual compilation
[private]
file filepath: (track filepath) && (hash filepath)
    echo "Compiling {{filepath}}"

# Don't forget to add '.hashes' to gitignore
[private]
[no-exit-message]
track file:
    #!/usr/bin/env bash
    [ ! -f .hashes ] && touch .hashes
    [[ "$(md5sum {{file}} | head -c 32)" == "$(grep " {{file}}$" .hashes | head -c 32)" ]] && exit 1 || exit 0

[private]
hash file: (track file)
    #!/usr/bin/env bash
    echo "$(grep -v " {{file}}$" .hashes)" > .hashes && md5sum {{file}} >> .hashes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.