Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/920: Implemented 2D convolution #1007

Draft
wants to merge 69 commits into
base: main
Choose a base branch
from

Conversation

shahpratham
Copy link
Collaborator

Description

Implemented 2D convolution with distributed kernel support.

Issue resolved: #920

skip ci

@ghost
Copy link

ghost commented Aug 18, 2022

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@shahpratham shahpratham changed the title Implemented 2D covolution Feature/920: Implemented 2D covolution Aug 18, 2022
@shahpratham shahpratham marked this pull request as ready for review September 7, 2022 07:40
@ClaudiaComito ClaudiaComito changed the title Feature/920: Implemented 2D covolution Feature/920: Implemented 2D convolution Sep 9, 2022
Comment on lines 289 to 290
if a.shape[0] < v.shape[0] or a.shape[1] < v.shape[1]:
raise ValueError("Filter size must not be greater than the signal size")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still valid?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, when a is of dimension (4, 15) and v is of dimension (5, 10)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more precise Error message: "Filter size must not be larger in one dimension and smaller in the other" as the user might not be aware of the swapping in case of the filter size beeing larger in both dimensions

@shahpratham shahpratham force-pushed the feature920/distributed-2D-convolution branch from 34551a7 to 8ccdf3e Compare September 11, 2022 19:02
@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented Mar 27, 2024

There seems to be a new problem in the CI:

from . import pypocketfft as pfft
E   ImportError: /opt/conda/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /opt/conda/lib/python3.10/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-310-x86_64-linux-gnu.so)

The problem is discussed here among many other places: https://stackoverflow.com/questions/72540359/glibcxx-3-4-30-not-found-for-librosa-in-conda-virtual-environment-after-tryin
I'm not sure what the best way to handle this is, within our CI images.

It's the cuda one only. Conda installs some older packages and outputs some warning there: WARNING conda.models.version:get_matcher(556): Using .* with relational operator is superfluous and deprecated and will be removed in a future version of conda. Your spec was 1.7.1., but conda is ignoring the . and treating it as 1.7.1

I don't understand @mtar . What spec?

Thanks for looking into this btw.

@mtar
Copy link
Collaborator

mtar commented Mar 27, 2024

The issue happens only on the Cuda Runner, likely because of outdated libraries installed by conda. The failure on the amd runner is caused by a device mismatch in ht.convolve[2d] instead.

Copy link
Contributor

github-actions bot commented Apr 3, 2024

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@mtar mtar added PR talk To be discussed Requires discussion in project meeting first labels Apr 12, 2024
@mtar
Copy link
Collaborator

mtar commented Apr 12, 2024

@ClaudiaComito
Tests on cuda runner working; now same errors as on amd runner

@ClaudiaComito ClaudiaComito modified the milestones: 1.4.0, 1.5.0 Apr 12, 2024
@ClaudiaComito ClaudiaComito removed PR talk signal To be discussed Requires discussion in project meeting first labels Apr 12, 2024
Copy link
Contributor

This pull request is stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the stale label Jun 17, 2024
Copy link
Contributor

Thank you for the PR!

@ClaudiaComito
Copy link
Contributor

I cannot reproduce the CUDA problems locally, but even on CPU I run into different problems with mismatching split axes (and local shapes):

======================================================================
======================================================================
ERROR: test_convolve2d (heat.core.tests.test_signal.TestSignal)
ERROR: test_convolve2d (heat.core.tests.test_signal.TestSignal)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/c.comito/devel/local/heat/heat/core/tests/test_signal.py", line 226, in test_convolve2d
    conv = ht.convolve2d(signal, dis_kernel_odd, mode=mode)
  File "/Users/c.comito/devel/local/heat/heat/core/signal.py", line 635, in convolve2d
    signal_filtered += global_signal_filtered[start_idx : start_idx + gshape[0]]
  File "/Users/c.comito/devel/local/heat/heat/core/arithmetics.py", line 188, in add_
    return _operations.__binary_op(wrap_add_, t1, t2, out=t1)
  File "/Users/c.comito/devel/local/heat/heat/core/_operations.py", line 197, in __binary_op
    sanitation.sanitize_out(out, output_shape, output_split, output_device, output_comm)
  File "/Users/c.comito/devel/local/heat/heat/core/sanitation.py", line 300, in sanitize_out
    raise ValueError(f"Expecting output buffer of split {output_split}, got {out.split}")
ValueError: Expecting output buffer of split 0, got None

(tests run on 2 processes).

@ClaudiaComito ClaudiaComito removed this from the 1.5.0 milestone Aug 15, 2024
@ClaudiaComito
Copy link
Contributor

I'm setting this PR to draft and marking it for next Monday's discussion. My thoughts so far (fed from discussions with @mrfh92 in the past):

  • code still needs significant debugging
  • fully distributed (signal and kernel) scaling behaviour unclear
  • fully distributed via FFT is being implemented (Distributed Convolution via FFT #1291 )

Options for this PR:

  1. fix bugs and merge independent of scaling behaviour (needs work)
  2. throw out distributed-kernel implementation, keep distributed-signal option only (needs work but will cover most use cases)
  3. do not merge
  4. ...??

What do you think @mrfh92 @krajsek @mtar ?

@ClaudiaComito ClaudiaComito marked this pull request as draft August 15, 2024 10:54
@github-actions github-actions bot removed the stale label Aug 19, 2024
@ClaudiaComito ClaudiaComito added this to the 1.6 milestone Aug 26, 2024
Copy link
Contributor

This pull request is stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the stale label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement convolve2d()
6 participants