Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors that we should handle more gracefully #149

Open
trevorcampbell opened this issue Oct 8, 2021 · 5 comments
Open

Errors that we should handle more gracefully #149

trevorcampbell opened this issue Oct 8, 2021 · 5 comments

Comments

@trevorcampbell
Copy link
Contributor

The current vsn of rudaux (prefect_by_assignment branch, which will soon be on master) had the following error happen:

[2021-10-07 23:21:06+0000] INFO - prefect.TaskRunner | FAIL signal raised: FAIL('Docker error autograding submission dsci100-001-num: tutorial_04-num : 
Student: exited with status exited,  
[AutogradeApp | INFO] Creating/updating student with ID \'stuentx\': {}\n
[AutogradeApp | INFO] Overwriting files with master versions from the source directory\n
[AutogradeApp | INFO] Copying /home/jupyter/source/./tutorial_04/mcdonalds.jpg -> /home/jupyter/autograded/studentx/tutorial_04/mcdonalds.jpg\n
[AutogradeAp | INFO] Copying /home/jupyter/source/./tutorial_04/tests_tutorial_04.R -> /home/jupyter/autograded/studentx/tutorial_04/tests_tutorial_04.R\n
[AutogradeApp | INFO] Copying /home/jupyter/source/./tutorial_04/data/fast_food.csv -> /home/jupyter/autograded/studentx/tutorial_04/data/fast_food.csv\n
[AutogradeApp | INFO] Copying /home/jupyter/source/./tutorial_04/data/insurance.csv -> /home/jupyter/autograded/studentx/tutorial_04/data/insurance.csv\n
[AutogradeApp | INFO] Copying /home/jupyter/source/./tutorial_04/clea$up_tutorial_04.R -> /home/jupyter/autograded/studentx/tutorial_04/cleanup_tutorial_04.R\n
[AutogradeApp | INFO] Sanitizing /home/jupyter/submitted/studentx/tutorial_04/tutorial_04.ipynb\n
[AutogradeApp | INFO] Converting notebook /home/jupyter/submitted/stu$entx/tutorial_04/tutorial_04.ipynb\n
[AutogradeApp | WARNING] Attribute \'checksum\' for cell cell-a5dc720f9e5813bd has changed! (should be: ad910ac09ba62c22f1e065b68b22ee7c, got: 0616a476231c2f9c184924467c066b11)\n
[AutogradeApp | INFO] Writing 43686 bytes to /h$me/jupyter/autograded/studentx/tutorial_04/tutorial_04.ipynb\n
[AutogradeApp | INFO] Autograding /home/jupyter/autograded/studentx/tutorial_04/tutorial_04.ipynb\n
[AutogradeApp | INFO] Converting notebook /home/jupyter/autograded/studentx/tutorial_04/tutorial_04.ipynb\nR_zmq_bind errno: 98 strerror: Address already in use\n
[AutogradeApp | INFO] Executing notebook with kernel: ir\nWarning message:\nIn zmq.bind(sockets$iopub, url_with_port("iopub_port")) :\n  zmq.bind fails, tcp://127.0.0.1:54176\n
[AutogradeApp | ERROR] There was an error processing assignment: /home/jupyter/submitted/studentx/tutorial_04\n
[AutogradeApp | ERROR] Traceback (most recent call last):\n      File "/opt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py", line 589, in run_cell\n
        msg = self.kc.iopub_channel.get_msg(timeout=timeout)\n      File "/opt/conda/lib/python3.8/site-packages/jupyter_client/blocking/channels.py", line 54, in get_msg\n        raise Empty\n    _queue.Empty\n    \n    During handling of the above exception, another
 exception occurred:\n    \n    Traceback (most recent call last):\n      File "/opt/conda/lib/python3.8/site-packages/nbgrader/converters/base.py", line 336, in convert_notebooks\n        self.convert_single_notebook(notebook_filename)\n      File "/opt/conda/lib/pyt
hon3.8/site-packages/nbgrader/converters/autograde.py", line 195, in convert_single_notebook\n        super(Autograde, self).convert_single_notebook(notebook_filename)\n      File "/opt/conda/lib/python3.8/site-packages/nbgrader/converters/base.py", line 292, in conve
rt_single_notebook\n        output, resources = self.exporter.from_filename(notebook_filename, resources=resources)\n      File "/opt/conda/lib/python3.8/site-packages/nbconvert/exporters/exporter.py", line 179, in from_filename\n        return self.from_file(f, resou
rces=resources, **kw)\n      File "/opt/conda/lib/python3.8/site-packages/nbconvert/exporters/exporter.py", line 197, in from_file\n        return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)\n      File "/opt/conda/lib/
python3.8/site-packages/nbconvert/exporters/notebook.py", line 32, in from_notebook_node\n        nb_copy, resources = super(NotebookExporter, self).from_notebook_node(nb, resources, **kw)\n      File "/opt/conda/lib/python3.8/site-packages/nbconvert/exporters/exporte
r.py", line 139, in from_notebook_node\n        nb_copy, resources = self._preprocess(nb_copy, resources)\n      File "/opt/conda/lib/python3.8/site-packages/nbconvert/exporters/exporter.py", line 316, in _preprocess\n        nbc, resc = preprocessor(nbc, resc)\n     
 File "/opt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/base.py", line 47, in __call__\n        return self.preprocess(nb, resources)\n      File "/opt/conda/lib/python3.8/site-packages/nbgrader/preprocessors/execute.py", line 41, in preprocess\n        
output = super(Execute, self).preprocess(nb, resources)\n      File "/opt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py", line 405, in preprocess\n        nb, resources = super(ExecutePreprocessor, self).preprocess(nb, resources)\n      File "/o
pt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/base.py", line 69, in preprocess\n        nb.cells[index], resources = self.preprocess_cell(cell, resources, index)\n      File "/opt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py", li
ne 438, in preprocess_cell\n        reply, outputs = self.run_cell(cell, cell_index, store_history)\n      File "/opt/conda/lib/python3.8/site-packages/nbconvert/preprocessors/execute.py", line 597, in run_cell\n        raise TimeoutError("Timeout waiting for IOPub ou
tput")\n    TimeoutError: Timeout waiting for IOPub output\n    \n
[AutogradeApp | WARNING] Removing failed assignment: /home/jupyter/autograded/studentx/tutorial_04\n
[AutogradeApp | ERROR] There was an error processing assignment \'tutorial_04\' for student \'st
udentx\'\n
[AutogradeApp | ERROR] Please see the the above traceback for details on the specific errors on the above failures.\n',)
[2021-10-07 23:21:06+0000] INFO - prefect.TaskRunner | Task 'autograde[15]': Finished task run for task with final state: 'Failed'

Rudaux currently just halts processing that assignment. Probably should do something a bit more graceful there. I think in the case of this particular error, a retry or two is warranted (when I reran the flow afterwards, they were graded properly).

@trevorcampbell trevorcampbell changed the title Error that we should handle more gracefully Errors that we should handle more gracefully Oct 14, 2021
@trevorcampbell
Copy link
Contributor Author

Another one:

[2021-10-14 15:02:17+0000] INFO - dsci100-004-snap - prefect.TaskRunner | Task 'get_assignments': Finished task run for task with final state: 'Success'
[2021-10-14 15:02:17+0000] INFO - dsci100-004-snap - prefect.TaskRunner | Task 'get_existing_snapshots': Starting task run...
[2021-10-14 15:02:17+0000] INFO - dsci100-004-snap - prefect.get_existing_snapshots | Opening ssh connection to STUDENT_URL
[2021-10-14 15:02:18+0000] ERROR - dsci100-004-snap - prefect.TaskRunner | Task 'get_existing_snapshots': Exception encountered during task execution!
Traceback (most recent call last):
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/transport.py", line 2211, in _check_banner
    buf = self.packetizer.readline(timeout)
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/packet.py", line 380, in readline
    buf += self._read_timeout(timeout)
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/packet.py", line 609, in _read_timeout
    raise EOFError()
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/prefect/engine/task_runner.py", line 863, in get_task_run_state
    logger=self.logger,
  File "/usr/local/lib/python3.6/site-packages/prefect/utilities/executors.py", line 445, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "/usr/local/lib/python3.6/site-packages/rudaux/snapshot.py", line 147, in get_existing_snapshots
    existing_snaps = _ssh_list_snapshot_names(config, course_id)
  File "/usr/local/lib/python3.6/site-packages/rudaux/snapshot.py", line 104, in _ssh_list_snapshot_names
    client = _ssh_open(config, course_id)
  File "/usr/local/lib/python3.6/site-packages/rudaux/snapshot.py", line 41, in _ssh_open
    client.connect(stu_ssh['hostname'], stu_ssh['port'], stu_ssh['user'], allow_agent=True)
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/client.py", line 406, in connect
    t.start_client(timeout=timeout)
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/transport.py", line 660, in start_client
    raise e
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/transport.py", line 2039, in run
    self._check_banner()
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/transport.py", line 2216, in _check_banner
    "Error reading SSH protocol banner" + str(e)
paramiko.ssh_exception.SSHException: Error reading SSH protocol banner
[2021-10-14 15:02:18+0000] INFO - dsci100-004-snap - prefect.TaskRunner | Task 'get_existing_snapshots': Finished task run for task with final state: 'Failed'

Can probably just wait and retry connection for things like that

@trevorcampbell
Copy link
Contributor Author

trevorcampbell commented Oct 15, 2021

EDIT: this one has been handled.

When a student hasn't clicked any of the worksheet links (so they don't have a home folder) and they have a special extension (so rudaux tries to snapshot their dataset individually):

[2021-10-15 16:17:27+0000] INFO - dsci100-001-snap - prefect.TaskRunner | Task 'take_snapshot[55]': Starting task run...
[2021-10-15 16:17:27+0000] INFO - dsci100-001-snap - prefect.TaskRunner | SKIP signal raised: SKIP('Snapshot dsci100-001-worksheet_03-number has already been taken; skipping',)
[2021-10-15 16:17:27+0000] INFO - dsci100-001-snap - prefect.TaskRunner | Task 'take_snapshot[55]': Finished task run for task with final state: 'Skipped'
[2021-10-15 16:17:27+0000] INFO - dsci100-001-snap - prefect.TaskRunner | Task 'take_snapshot[56]': Starting task run...
[2021-10-15 16:17:27+0000] INFO - dsci100-001-snap - prefect.take_snapshot[56] | Snapshot dsci100-001-worksheet_03-number does not exist and deadline 2021-10-15T06:59:59+00:00 has passed; taking & verifying snapshot.
[2021-10-15 16:17:27+0000] INFO - dsci100-001-snap - prefect.take_snapshot[56] | Opening ssh connection to dsci-100-student.stat.ubc.ca
[2021-10-15 16:17:28+0000] INFO - dsci100-001-snap - prefect.take_snapshot[56] | Taking snapshot
[2021-10-15 16:17:28+0000] INFO - dsci100-001-snap - prefect.take_snapshot[56] | Running ssh command /usr/sbin/zfs snapshot -r tank/home/dsci100/number@dsci100-001-worksheet_03-number
[2021-10-15 16:17:28+0000] INFO - dsci100-001-snap - prefect.take_snapshot[56] | Command exit codes: out = 2  err = 2
[2021-10-15 16:17:28+0000] INFO - dsci100-001-snap - prefect.TaskRunner | FAIL signal raised: FAIL('Paramiko SSH command error: nonzero status.\nstderr\n["cannot open \'tank/home/dsci100/number\': dataset does not exist\\n", \'usage:\\n\', \'\\tsnapshot [-r] [-o property=value] ... <filesystem|volume>@<snap> ...\\n\', \'\\n\', \'For the property list, run: zfs set|get\\n\', \'\\n\', \'For the delegated permission list, run: zfs allow|unallow\\n\']\nstdout\n[]',)

@trevorcampbell
Copy link
Contributor Author

FAIL signal raised: FAIL("Docker error autograding submission dsci100-001-num : worksheet_06-num : name-num: exited with status exited,  
[AutogradeApp | INFO] Creating/updating student with ID 'student-num': {}\n
[AutogradeApp | INFO] Overwriting files with master versions from the source directory\n
[AutogradeApp | INFO] Copying /home/jupyter/source/./worksheet_06/data/clean-wdbc-data.csv -> /home/jupyter/autograded/student-num/worksheet_06/data/clean-wdbc-data.csv\n
[AutogradeApp | INFO] Copying /home/jupyter/source/./workshee
t_06/tests_worksheet_06.R -> /home/jupyter/autograded/student-num/worksheet_06/tests_worksheet_06.R\n
[AutogradeApp | INFO] Copying /home/jupyter/source/./worksheet_06/cleanup_worksheet_06.R -> /home/j
upyter/autograded/student-num/worksheet_06/cleanup_worksheet_06.R\n
[AutogradeApp | INFO] Sanitizing /home/jupyter/submitted/student-num/worksheet_06/worksheet_06.ipynb\n
[AutogradeApp | INFO] Conver
ting notebook /home/jupyter/submitted/student-num/worksheet_06/worksheet_06.ipynb\n
[AutogradeApp | INFO] Writing 72048 bytes to /home/jupyter/autograded/student-num/worksheet_06/worksheet_06.ipynb\
n
[AutogradeApp | INFO] Autograding /home/jupyter/autograded/student-num/worksheet_06/worksheet_06.ipynb\n
[AutogradeApp | INFO] Converting notebook /home/jupyter/autograded/student-num/worksheet_06/
worksheet_06.ipynb\n
[AutogradeApp | INFO] Executing notebook with kernel: ir\nError: C stack usage  7971252 is too close to the limit\nExecution halted\n
[AutogradeApp | ERROR] Kernel died while waiting f
or execute reply.\n
[AutogradeApp | ERROR] While processing assignment /home/jupyter/submitted/student-num/worksheet_06, the kernel became unresponsive and we could not interrupt it. This probably mean
s that the students' code has an infinite loop that consumes a lot of memory or something similar. nbgrader doesn't know how to deal with this problem, so you will have to manually edit the students' cod
e (for example, to just throw an error rather than enter an infinite loop). \n
[AutogradeApp | WARNING] Removing failed assignment: /home/jupyter/autograded/student-num/worksheet_06\n
[AutogradeApp | ER
ROR] There was an error processing assignment 'worksheet_06' for student 'student-num'\n
[AutogradeApp | ERROR] Please see the the above traceback for details on the specific errors on the above failur
es.\n",)

@trevorcampbell
Copy link
Contributor Author

trevorcampbell commented Dec 2, 2021

[2021-12-02 06:32:51+0000] INFO - dsci100-004-snap - prefect.get_assignments | GET request to URL: https://canvas.ubc.ca/api/v1/courses/xxxx/assignments/yyyy/overrides
[2021-12-02 06:32:51+0000] INFO - dsci100-001-snap - prefect.get_assignments | GET request to URL: https://canvas.ubc.ca/api/v1/courses/xxxx/assignments/yyyy/overrides
[2021-12-02 06:32:51+0000] INFO - dsci100-004-snap - prefect.get_assignments | Retrieved 21 assignments from LMS for dsci100-004
[2021-12-02 06:32:52+0000] INFO - dsci100-004-snap - prefect.TaskRunner | Task 'get_assignments': Finished task run for task with final state: 'Success'
[2021-12-02 06:32:52+0000] INFO - dsci100-004-snap - prefect.TaskRunner | Task 'get_all_snapshots': Starting task run...
[2021-12-02 06:32:52+0000] INFO - dsci100-004-snap - prefect.TaskRunner | Task 'get_all_snapshots': Finished task run for task with final state: 'Success'
[2021-12-02 06:32:52+0000] INFO - dsci100-004-snap - prefect.TaskRunner | Task 'get_existing_snapshots': Starting task run...
[2021-12-02 06:32:52+0000] INFO - dsci100-004-snap - prefect.get_existing_snapshots | Opening ssh connection to student-url.com
[2021-12-02 06:32:52+0000] ERROR - dsci100-004-snap - prefect.TaskRunner | Task 'get_existing_snapshots': Exception encountered during task execution!
Traceback (most recent call last):
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/transport.py", line 2211, in _check_banner
    buf = self.packetizer.readline(timeout)
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/packet.py", line 380, in readline
    buf += self._read_timeout(timeout)
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/packet.py", line 609, in _read_timeout
    raise EOFError()
EOFError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/rudaux/snapshot.py", line 110, in _ssh_list_snapshot_names
    client = _ssh_open(config, course_id)
  File "/usr/local/lib/python3.6/site-packages/rudaux/snapshot.py", line 41, in _ssh_open
    client.connect(stu_ssh['hostname'], stu_ssh['port'], stu_ssh['user'], allow_agent=True)
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/client.py", line 406, in connect
    t.start_client(timeout=timeout)
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/transport.py", line 660, in start_client
    raise e
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/transport.py", line 2039, in run
    self._check_banner()
  File "/home/stty2u/.local/lib/python3.6/site-packages/paramiko/transport.py", line 2216, in _check_banner
    "Error reading SSH protocol banner" + str(e)
paramiko.ssh_exception.SSHException: Error reading SSH protocol banner

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/prefect/engine/task_runner.py", line 863, in get_task_run_state
    logger=self.logger,
  File "/usr/local/lib/python3.6/site-packages/prefect/utilities/executors.py", line 445, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "/usr/local/lib/python3.6/site-packages/rudaux/snapshot.py", line 153, in get_existing_snapshots
    existing_snaps = _ssh_list_snapshot_names(config, course_id)
  File "/usr/local/lib/python3.6/site-packages/rudaux/snapshot.py", line 116, in _ssh_list_snapshot_names
    client.close()
UnboundLocalError: local variable 'client' referenced before assignment
[2021-12-02 06:32:52+0000] INFO - dsci100-004-snap - prefect.TaskRunner | Task 'get_existing_snapshots': Finished task run for task with final state: 'Failed'

@trevorcampbell
Copy link
Contributor Author

trevorcampbell commented Dec 21, 2021

Dec 21 of the Fall 2021 semester, after final exam -- For some reason some of my students are appearing with registration dates of Dec 20. Rudaux then tries to give them an extension to Dec 27.

moved this into its own issue #153 since the issues in this thread are mostly "nice to fix but not critical", but 153 I would consider critical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant