Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

listjobs.json: args in running jobs and finished jobs #529

Open
jpmckinney opened this issue Jul 26, 2024 · 0 comments
Open

listjobs.json: args in running jobs and finished jobs #529

jpmckinney opened this issue Jul 26, 2024 · 0 comments

Comments

@jpmckinney
Copy link
Contributor

jpmckinney commented Jul 26, 2024

#205 added the user-submitted args and settings to the listjobs.json response for pending jobs. We can consider doing the same for other jobs. I had started work on this, but noticed a few things to consider.

Running (and finished) jobs differ from pending jobs in that their args looks like:

["/usr/bin/python", "-m", "scrapyd.runner", "crawl", "s2", "-s", "DOWNLOAD_DELAY=2", "-a", "arg1=val1"]
  • This is specific to the implementation of the default Launcher service. Other configurations might not have the same format. We need to be careful about hardcoding behavior that is specific to the default Launcher.
  • It contains details that are not user-submitted and are implementation-specific, like the Python path. The information might be hard for users to read or use, since it doesn't match what they submitted.
  • If we add args, we'll need to implement a migration in SqliteJobStorage, similar to what was in 1a0cb2b#diff-40a7dd64b23747429cf84c808a761ad8185bd2a1b96400a512800b7bb0ae6f8fR145-R152
    def ensure_insert_time_column(self):
        q = "SELECT sql FROM sqlite_master WHERE type='table' AND name='%s'" % self.table
        if 'insert_time TIMESTAMP' not in self.conn.execute(q).fetchone()[0]:
            q = "ALTER TABLE %s ADD COLUMN insert_time TIMESTAMP" % self.table
            self.conn.execute(q)
            q = "UPDATE %s SET insert_time=CURRENT_TIMESTAMP" % self.table
            self.conn.execute(q)
            self.conn.commit()

Opening issue for discussion.


Note: Running (and finished) jobs also store an env, but this should not be published via API, because Scrapyd adds the main process' environment variables, which might have secrets (e.g. an admin might have deployed Scrapyd with secrets in env vars that spiders need to log in to remote services, and Scrapyd users might not otherwise have access to those).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant