How to handle `dragonfly_pipeline_queue_length` that hangs forever (pipeline hangs)? #3997

alessio-locatelli · 2024-10-25T14:44:58Z

alessio-locatelli
Oct 25, 2024

Problem

~~I cannot reproduce this locally, but running a simple Redis pipeline in the cloud~~ (skip all this and read my next comment #3997 (reply in thread)).

(below is a Python code, but I think it is irrelevant here as the problem is related only to the DragonflyDB internal pipeline execution)

    pipe = redis_client.pipeline()
    key_count = 65536
    for i in count():
        if i == key_count:
            break
    pipe.hset(hash_name, mapping={str(i): str(j) for j in range(10_000)})
    pipe.execute()

results in

where dragonfly_pipeline_queue_length stays forever and no data is written.

Later in the log I have

Some commands are still being dispatched but didn't conclude in time. Proceeding in shutdown.

Versions

        env:
        - name: DFLY_cache_mode
          value: "false"
        - name: DFLY_enable_heartbeat_eviction
          value: "false"
        - name: DFLY_dbnum
          value: "1"
        - name: DFLY_proactor_threads
          value: "2"
        - name: DFLY_dbfilename
          value: dump
        - name: DFLY_maxmemory
          value: "5100273664"
        - name: DFLY_logtostdout
          value: "true"
        - name: DFLY_aclfile
          value: /dragonfly/snapshots/acl.file
        - name: HEALTHCHECK_PORT
          value: "9999"
        image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.24.0

    autopilot.gke.io/resource-adjustment: {"input":{"containers":[{"requests":{"cpu":"1","memory":"5Gi"},"name":"dragonfly"}]},"output":{"containers":[{"limits":{"cpu":"1","ephemeral-storage":"1Gi","memory":"5Gi"},"requests":{"cpu":"1","ephemeral-storage":"1Gi","memory":"5Gi"},"name":"dragonfly"}]},"modified":true}
    autopilot.gke.io/warden-version: 3.0.39

Question

What to do if a pipeline hangs forever inside DragonflyDB after I call pipe.execute()?

Update:

A smaller pipeline worked. Now I have an assumption that for a too large pipeline DragonflyDB hangs forever silently. How can it abort the pipeline and raise an error to the client?

I tried to search for "pipeline", "dragonfly_pipeline_queue_length", "Some commands are still being dispatched", but I found no related open issues or discussions.
Additionally, dragonfly_pipeline_queue_length is undocumented (or I was unable to find that) and the meaning of this metric is not obvious.

Answered by romange

Oct 26, 2024

I have not explained why there is a regression. The PR that causes it is #3152

Before that Dragonfly avoided the deadlock scenario above by reading all the input data from the socket into its memory buffers. Once it did that, the python client could proceed with consuming the replies and the deadlock did not happen.
So Dragonly just read infinite number requests - it's a weakness that could potentially lead to OOM.

This PR introduced limits to that: Dragonfly stopped reading requests if it had more than K bytes in pipeline buffers per IO thread. pipeline_buffer_limit is the flag that controls that and I just confirmed that docker run --network=host docker.dragonflydb.io/dragonflydb/dragon…

View full answer

romange · 2024-10-25T19:39:01Z

romange
Oct 25, 2024
Maintainer

Can you please provide a minimal reproducible example? This snippet is not clear to me. Are you saying it happens all the time?

4 replies

alessio-locatelli Oct 26, 2024
Author

Now I can confirm that, unlike Redis, DragonflyDB cannot handle a pipeline that is above a certain length and queue hangs forever silently (no errors, no warnings in the log).

MRE (minimal reproducible example)

Python 3.12.7
redis 5.1.1

import os

import redis
from redis import BlockingConnectionPool, Redis

r = Redis(
    connection_pool=BlockingConnectionPool(
        max_connections=10,  # https://redis.uptrace.dev/guide/go-redis-debugging.html#connection-pool-size
        timeout=60,  # https://redis.uptrace.dev/guide/go-redis-debugging.html#timeouts
        host=os.environ["REDIS_HOST"],
        port=int(os.environ["REDIS_PORT"]),
        password=os.environ["REDIS_PASSWORD"],
        socket_keepalive=True,
        retry_on_error=[
            ConnectionError,
            redis.exceptions.ConnectionError,
            redis.exceptions.TimeoutError,
        ],
        protocol=3,
    )
)
r.ping()

pipe = r.pipeline()

for i in range(1_000_000):
    pipe.hset("testhash", mapping={str(i): str(i)})

print("Executing...")
pipe.execute()
print(f"Executed. {r.hlen("testhash")=}")

r.delete("testhash")  # Clean up.

r.close()

DragonflyDB result

My attempts to make DragonflyDB working

I read https://www.dragonflydb.io/docs/managing-dragonfly/known-limitations but since in the cloud my pipeline hangs with 65536 commands but locally works, I do not think that we are under the mentioned limitations.

I tried to set DFLY_max_multi_bulk_len: 9999999 but it has no affect on a large pipeline.

Maybe we are under a known limitation (that you know, but I do not know), but DragonflyDB must raise an error or at least say something in the log. Now DragonflyDB takes a large pipeline and does nothing silently.

Redis result

No problems. It takes seconds to run the same MRE on redis:7.2-alpine3.20.

Redis VS DragonflyDB performance for large pipelines

Running the MRE above for 425_000 pipeline items (DragonflyDB hangs for more items), results in:

6.313 seconds with DragonflyDB
3.824 seconds with Redis

The larger the pipeline, the slower DragonflyDB is compared to Redis.

alessio-locatelli Oct 26, 2024
Author

I decided to run my MRE against every DragonflyDB version one by one.

⚠️ I found that the bug was introduced in https://github.com/dragonflydb/dragonfly/releases/tag/v1.20.0 and the bug is present in all further versions ( v1.20.1, ..., v1.24.0).

v1.19.2 and older does not have this bug.

@romange I hope that now you can bisect v1.20.0 commits and see where the regression was introduced.

romange Oct 26, 2024
Maintainer

Hello @alessio-locatelli, thank you very much for your efforts!

I can explain why this occurs. When Dragonfly Connection replies, it writes to the output socket.
If this socket becomes full, the entire request handling loop for that connection halts.

Now, with the Python client, when you send 1 million requests, Dragonfly begins processing them and sending replies. However, the client doesn't read these responses until it finishes sending all 1 million requests. At some point, Dragonfly's networking buffers become saturated, and the connection stops processing requests. This situation results in a deadlock, as the client tries writing more requests without reading the replies.

While it would be beneficial to have an option to configure the size of the client output buffer, allowing Dragonfly to further aggregate replies even with a blocked socket, this feature is not currently available. As a workaround, limiting batch sizes to reasonable numbers (around 100-1000) is recommended, as pipelining performance tends to plateau beyond these sizes.

romange Oct 26, 2024
Maintainer

I have not explained why there is a regression. The PR that causes it is #3152

Before that Dragonfly avoided the deadlock scenario above by reading all the input data from the socket into its memory buffers. Once it did that, the python client could proceed with consuming the replies and the deadlock did not happen.
So Dragonly just read infinite number requests - it's a weakness that could potentially lead to OOM.

This PR introduced limits to that: Dragonfly stopped reading requests if it had more than K bytes in pipeline buffers per IO thread. pipeline_buffer_limit is the flag that controls that and I just confirmed that docker run --network=host docker.dragonflydb.io/dragonflydb/dragonfly:v1.20.0 --pipeline_buffer_limit=800000000 works with your example. Now, later versions introduced even more guards against excessive pipelining:
--pipeline_queue_limit puts limits the pipelining length inside Dragonfly.

To summarize:
I can confirm that docker run --network=host docker.dragonflydb.io/dragonflydb/dragonfly:v1.24.0 --pipeline_buffer_limit=800000000 --pipeline_queue_limit=1000000 works well with your example.

I must say it was not easy for me to debug the problem, we should probably add some log warnings or metrics if this scenario happens.

Answer selected by alessio-locatelli

romange · 2024-10-26T13:26:46Z

romange
Oct 26, 2024
Maintainer

Thanks again for putting an effort to present the problem in most clear way, and even explore whether this is a regression and when it appeared 🍻

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle `dragonfly_pipeline_queue_length` that hangs forever (pipeline hangs)? #3997

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to handle dragonfly_pipeline_queue_length that hangs forever (pipeline hangs)? #3997

alessio-locatelli Oct 25, 2024

Problem

Versions

Question

Update:

Related

Replies: 2 comments · 4 replies

romange Oct 25, 2024 Maintainer

alessio-locatelli Oct 26, 2024 Author

MRE (minimal reproducible example)

DragonflyDB result

My attempts to make DragonflyDB working

Redis result

Redis VS DragonflyDB performance for large pipelines

alessio-locatelli Oct 26, 2024 Author

romange Oct 26, 2024 Maintainer

romange Oct 26, 2024 Maintainer

romange Oct 26, 2024 Maintainer

How to handle `dragonfly_pipeline_queue_length` that hangs forever (pipeline hangs)? #3997

alessio-locatelli
Oct 25, 2024

Replies: 2 comments 4 replies

romange
Oct 25, 2024
Maintainer

alessio-locatelli Oct 26, 2024
Author

alessio-locatelli Oct 26, 2024
Author

romange Oct 26, 2024
Maintainer

romange Oct 26, 2024
Maintainer

romange
Oct 26, 2024
Maintainer