Skip to content

Vectorization #640

Open
Open
@certik

Description

@certik

We need to implement ASR->ASR passes that implement vectorization. It seems we write a pragma (comment) which will direct the optimizer to apply a given transformation and specify exactly which one and which parameters for the given loop. Much later we can think about creating some heuristic optimizer (possibly profile driven). In this issue we simply have to implement the mechanism and implement all options. So the design space is relatively limited. The hard part is to figure out how to best represent the vector operations in ASR.

Examples of loops:

Array copy

a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
# Equivalently: A[:] = B[:]
for i in range(n):
    A[i] = B[i]

->

a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
for i in range(0,n,1024):
    A[i*1024:(i+1)*1024] = B[i*1024:(i+1)*1024]
# plus a remainder loop

Possibly this can be transformed into:

a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
for i in range(0,n,1024):
    vector_copy(1024, A[i*1024:(i+1)*1024], B[i*1024:(i+1)*1024])
# plus a remainder loop

This all happens at the ASR level. Then the backend very straighforwardly transforms vector_copy(n, A, B) with n=1024 (or other well defined vector lengths) into a loop over AVX-512 or ARM vector instructions. Or generates LLVM code that LLVM itself knows how to efficiently vectorize.

Matrix-vector multiply

for i in range(n):
    y[i] = 0
    for j in range(n):
        y[i] += A[i,j] * x[j]

We can vectorize the inner loop:

for i in range(n):
    y[i] = 0
    for j in range(0,n,1024):
        y[i] += A[i,j*1024:(j+1)*1024] * x[j*1024:(j+1)*1024]

Or the outer loop:

for i in range(0,n,1024):
    y[i*1024:(i+1)*1024] = 0
    for j in range(n):
        y[i*1024:(i+1)*1024] += A[i*1024:(i+1)*1024,j] * x[j]

We can also permute the loop order, such as:

y[:] = 0
for j in range(n):
    for i in range(n):
        y[i] += A[i,j] * x[j]

And again we can vectorize the inner and outer loop.

So there are at least 4 possibilities and we have to implement all 4 and allow the user to select using a pragma. There might be more options (I can't remember, but one might perhaps unroll the outer loop but vectorize the inner loop, or something like that).

For the remainder loop there are again at least two options: either iterate exactly over the correct length and not vectorize, or pad it to the nearest vector length and also vectorize.

Metadata

Metadata

Assignees

Labels

asrASR related changesasr_passASR pass related changesoptimizationASR optimization passes

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions