Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems on 15TB ZFS flatfs repo: mutating MFS dir with 25k entries, ipfs add hang #10588

Open
3 tasks done
ProximaNova opened this issue Nov 17, 2024 · 21 comments
Open
3 tasks done
Assignees
Labels
kind/bug A bug in existing code (including security flaws) need/author-input Needs input from the original author need/triage Needs initial labeling and prioritization

Comments

@ProximaNova
Copy link

ProximaNova commented Nov 17, 2024

Checklist

Installation method

dist.ipfs.tech or ipfs-update

Version

Kubo version: 0.32.1
Repo version: 16
System version: amd64/linux
Golang version: go1.23.3

Config

{
  "API": {
    "HTTPHeaders": {
      "Access-Control-Allow-Methods": [
        "PUT",
        "POST",
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "http://10.0.0.232:5001",
        "http://localhost:3000",
        "http://127.0.0.1:5001",
        "https://webui.ipfs.io"
      ]
    }
  },
  "Addresses": {
    "API": "/ip4/10.0.0.232/tcp/5001",
    "Announce": null,
    "AppendAnnounce": null,
    "Gateway": "/ip4/10.0.0.232/tcp/8080",
    "NoAnnounce": null,
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/webrtc-direct",
      "/ip4/0.0.0.0/udp/4001/quic-v1",
      "/ip4/0.0.0.0/udp/4001/quic-v1/webtransport",
      "/ip6/::/udp/4001/webrtc-direct",
      "/ip6/::/udp/4001/quic-v1",
      "/ip6/::/udp/4001/quic-v1/webtransport"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic-v1/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "99999999999999999999h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 908070605040302000,
    "StorageMax": "12345678910123456789GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": true
    }
  },
  "Experimental": {
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "OptimisticProvide": false,
    "OptimisticProvideJobsPoolSize": 0,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "DeserializedResponses": null,
    "DisableHTMLErrors": null,
    "ExposeRoutingAPI": null,
    "HTTPHeaders": {},
    "NoDNSLink": false,
    "NoFetch": true,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": ""
  },
  "Identity": {
    "PeerID": "12D3KooWLowyP9yZLaxzactf2WAhR6cT1nahBmeRM23zPQR3CP4T"
  },
  "Internal": {},
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Ipwb": {
    "Replay": {
      "Host": "localhost",
      "Index": "/ipfs/bafkreic44im7s6qqtzq7gh3ose7nwkmmrohgmbyhqvxsdp3w5m6j6qdgeu",
      "Port": 2016
    }
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": true,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {
    "RemoteServices": {}
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {},
  "Routing": {
    "AcceleratedDHTClient": false,
    "Methods": null,
    "Routers": null
  },
  "Swarm": {
    "AddrFilters": null,
    "ConnMgr": {},
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": false,
    "RelayClient": {},
    "RelayService": {},
    "ResourceMgr": {},
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

Things were working fairly fast and OK (not great, but OK), then after a certain event a day ago things got way slower or stopped working. Setup: ZFS mirror pool of two 18 TB HDD which mostly contains IPFS data, like 15 TB of that. Things were working OK because like a month ago pinning stopped working. I saw some error about "cannot fix 1800 pins" or something occasionally. A day ago I was doing this with a list of 1,105,578 IPFS CIDs (totaling to 1.2 TB):
$ cat /home/u/Downloads/t5.txt | xargs -d "\n" sh -c 'for args do cid="$(echo "$args" | sed "s/ .*//g")"; fn="$(echo "$args" | sed "s/.* //g")"; date -u; ipfs files cp -p "/ipfs/$cid" "/dup/organize/4chan/mlp/$(echo "$fn" | perl -pE "s/^(....).*/\1/g")/$(echo "$fn" | perl -pE "s/^....(..).*/\1/g")/$fn"; date -u; done' _ >> /home/u/Downloads/t6.txt

What that command does: the input is many lines where each line is "[raw blocks CID] [Unix timestamp filename]" and each file is 1KB to 4MB in size. That command was running in offline mode yesterday; no ipfs daemon was running. It then puts those files in paths like this: "ipfs files cp -p /ipfs/[cid] /mfs/1419/00/1419000480902.jpg". It logs the timestamp of each "ipfs files cp" command to file "t6.txt".

That was the event which I think messed things up. It did 25,859 operations of copying files to MFS. After I canceled that command 24 or 48 hours ago, I have had persistent problems with my IPFS stuff. Such as the daemon not starting or hanging: ipfs/ipfs-docs#1956 - not a problem anymore. I do have the following problem; adding a small file to IPFS never finishes - this ran for like 30 minutes and didn't exit:

$ utc; ipfs add -rH --cid-version=1 --chunker=size-1048576 /home/u/Downloads/98-gfa.gif; utc
2024-11-17T22:58:52.545136781Z
added bafybeibfcytwdefk2hmatub3ab4wvfyei34xkwqz5ubzrqwslxi3d5ehau 98-gfa.gif
 1.35 MiB / 1.35 MiB [======================================] 100.00%
^C
Error: canceled
2024-11-17T23:23:04.879763382Z
$ # hit ctrl+c

And as said above, pinning doesn't work, so ipfs --offline pin add --progress bafybeibfcytwdefk2hmatub3ab4wvfyei34xkwqz5ubzrqwslxi3d5ehau is always stuck as "Fetched/Processed 0 nodes". About the 25,859 operations before it became bad: at the start of the text file you can see that files were copied to MFS quickly and at the end it went way slower:

$ head -n8 /home/u/Downloads/t6.txt
Sat Nov 16 02:58:39 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
Sat Nov 16 02:58:40 AM UTC 2024
$ tail -n8 /home/u/Downloads/t6.txt
Sun Nov 17 05:28:35 AM UTC 2024
Sun Nov 17 05:28:40 AM UTC 2024
Sun Nov 17 05:28:40 AM UTC 2024
Sun Nov 17 05:28:44 AM UTC 2024
Sun Nov 17 05:28:44 AM UTC 2024
Sun Nov 17 05:28:48 AM UTC 2024
Sun Nov 17 05:28:48 AM UTC 2024
$ # slow MFS

Like a week ago I saw some error about a dirty flag not being cleared. I have attached the output file of "$ ipfs diag profile" for more details. If there's something to be learned from this, I guess it's to not copy many files to MFS without the IPFS daemon running. I was trying to copy more than one million but only copied like 25,000. Also I've seen some weirdness with the "ipfs files" set of commands in the past (copy/move).

Related issue: #10383 titled "Improve data onbaording speed: ipfs add and ipfs dag import|export" (I recommend using raw blocks instead).

@ProximaNova ProximaNova added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Nov 17, 2024
@ProximaNova
Copy link
Author

ProximaNova commented Nov 17, 2024

GitHub said "File size too big: 25 MB are allowed, 50 MB were attempted to upload." So here's "ipfs-profile-2024-11-17-utc.zip":
https://files.catbox.moe/ic3ztm.zip or at ipfs://bafybeiarzkmarspokfjnhzxy7eeyquozca22robaazotpegjlabzlhmime

Update edit: This really is worse than expected. Add files command hanging on a small file (and sometimes oom-killing things) is one thing, but missing files is way worse. I see that the ipfs datastore indexes or whatever cannot find the following, meaning that it's completely gone or the index got messed up:

$ ipfs --offline ls -s bafybeiftetdkn3mxxfpi5jvtcxxlzjxcpezhhu7oemspuszlqdb6fkqm2m/selenium 1>/dev/null
Error: block was not found locally (offline): ipld: could not find bafybeiayzj2qrplhxgmircbo7mn7d6imcx2tceiiwjrvvmn72cenzrv7wa
$

I had 100% of that folder in the past:
https://web.archive.org/web/20241108195911/https://trustless-gateway.link/ipfs/bafybeiftetdkn3mxxfpi5jvtcxxlzjxcpezhhu7oemspuszlqdb6fkqm2m?format=car

Timeline:
2024-11-08: had 100% of that (this is before the ZFS message about done creating a mirror pool on the next line)
2024-11-11: resilvered 15.0T in 2 days 08:20:30 with 0 errors
2024-11-16: bad event with MFS running completely offline.

Point is that no filesystem or data corruption or data screw up happened other than ipfs's. That bafybeif...qm2m folder = 321 blocks (32,722,333 bytes). It corresponds to these .data files:
[snip]
That's 322 of them. I have 78 but am missing 244. (Tested by using two different repos in two different computers.) So those files got deleted somehow; good thing there was some remote capture of that CAR file. I can only hope that this is an isolated error and I'm not missing any other things. What deleted it? If I run $ ipfs files stat / then I can see CumulativeSize: 3673471960804799. This 3.673 petabytes number is not really true and is just the non-deduplicated size. It's 3.67 PB because occasionally I captured the CID of MFS root and put it in an MFS folder. This matters because StorageGCWatermark=908070605040302000 and StorageMax=12345678910123456789GB. Ten million zettabytes (storage max) is larger than 3.7 PB. Therefore, garbage collection (gc) should not have triggered.

I didn't run any gc command, and as far as I know, no gc was ran in the background. No point in running mirror HDDs if the data gets deleted by some software (RAID is not a backup). This is really annoying or frustrating.

@lidel lidel changed the title ipfs suddenly became very slow or nonfunctional (Kubo 0.32.1) ipfs suddenly became very slow or nonfunctional (15TB ZFS + MFS) Nov 19, 2024
@lidel lidel changed the title ipfs suddenly became very slow or nonfunctional (15TB ZFS + MFS) Various problems with15TB flatfs/ZFS repo (mutating MFS dir with 25k entries, ipfs add hangs) Nov 19, 2024
@lidel lidel changed the title Various problems with15TB flatfs/ZFS repo (mutating MFS dir with 25k entries, ipfs add hangs) Problems on 15TB ZFS flatfs repo: mutating MFS dir with 25k entries, ipfs add hang Nov 19, 2024
@gammazero
Copy link
Contributor

gammazero commented Nov 19, 2024

Note: It is not that the 244 out of 322 blocks were deleted. Adding files to MFS does not necessarily download all the data, but rather creates references to the data. Did you explicitly preload the data after ipfs files cp using either ipfs refs -r, ipfs dag stat, or ipfs pin add?

It'd be helpful to get profiles during the bad events such as:

  • While getting close to OOMing
  • While it's hanging trying to add a small file

@gammazero gammazero added the need/author-input Needs input from the original author label Nov 19, 2024
@hsanjuan
Copy link
Contributor

I think your leveldb datastore blew up.

Does add work with --pin=false ?

leveldb is used to store the pinset and the mfs root, probably it is hanging there.

That doesn't explain your "data missing" part, as the blocks are stored in flatfs (blocks folder). Please convert your CID to file name with https://go.dev/play/p/WOuGq_mJWbU. Then look if somewhere in blocks there is a file named like that (ending in .data). Afaik the only way that IPFS would have removed that would be GC'ing, but as you said that shouldn't have triggered.

@ProximaNova
Copy link
Author

ProximaNova commented Nov 24, 2024

@gammazero

Note: It is not that the 244 out of 322 blocks were deleted. Adding files to MFS does not necessarily download all the data, but rather creates references to the data. Did you explicitly preload the data after ipfs files cp using either ipfs refs -r, ipfs dag stat, or ipfs pin add?

You are referring to the 32MB folder I wrote about: probably didn't use MFS at all with that one (if I did I only copied its root to MFS after $ ipfs add [folder name] finished). I have a manifest text file in that folder which shows all the paths at "/mnt/whatever/path/...". That means that I had all of the files and added them via $ ipfs add -rH --cid-version=1 --chunker=size-1048576 . - not a case of "I added this Wikipedia CID to MFS and I only have the small parts of it that I downloaded now that I'm offline". I almost certainly had 100% of it in 2024-11-08. Those $IPFS_PATH/blocks/*/*.data files somehow got deleted, which is what I'm confused about. Perhaps garbage collection ran as part of this process (which deleted them):
2024-11-19T15:51:01.592 ERROR pin dspinner/pin.go:1069 checked 13417 pins for invalid indexes, repaired 0 pins

It'd be helpful to get profiles during the bad events such as:

  • While getting close to OOMing
  • While it's hanging trying to add a small file

It's basically hanging again. Command $ ipfs daemon --mount has been running for 30 minutes and is stuck at "PeerID: [ID]". That daemon startup command should fairly quickly progress to the last message/line, which is "Daemon is ready" (in under a minute). The action of adding a bunch of files to MFS preceded this bug which I am now experiencing again. I ran that command to add many files to MFS while the daemon was running, oom-killed the daemon putting me in offline mode, command kept adding files to MFS while offline, hit ctrl+c to cancel that xargs-related command, command canceled so I ran $ ps aux | grep ipfs | tail -n2 and saw

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
u        1834203  8.4  2.0 1959552 79716 pts/0   Sl   07:22   0:21 ipfs files cp -p /ipfs/id... /dup/organize/...

The "Sl" status reported by $ ps -ef $ ps aux is bad, IIRC. (Edit: Sl = "S interruptible sleep (waiting for an event to complete)" + "l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)" stated in https://askubuntu.com/questions/1144036/process-status-of-s-s-s-sl and Z/zombie status is bad.) Last time the daemon startup was "hanging" it eventually did startup and I think you can only run $ ipfs diag profile while it's online. I expect it to "hang" again while adding a small file and I can share a profile of that while that's happening. (Maybe I should say sorry for the late reply; I kinda fell into a void, so to speak.)

@ProximaNova
Copy link
Author

ProximaNova commented Nov 24, 2024

@hsanjuan

I think your leveldb datastore blew up.

Does add work with --pin=false ?

leveldb is used to store the pinset and the mfs root, probably it is hanging there.

That doesn't explain your "data missing" part, as the blocks are stored in flatfs (blocks folder). Please convert your CID to file name with https://go.dev/play/p/WOuGq_mJWbU. Then look if somewhere in blocks there is a file named like that (ending in .data). Afaik the only way that IPFS would have removed that would be GC'ing, but as you said that shouldn't have triggered.

Roughly 24 or 48 hours after $ ipfs add ... was hanging it stopped hanging. A computer reboot didn't fix it but running an $ ipfs add ... multiple times fixed it IIRC and it went back to working normally/quickly. So I ran that multiple times while it was in that bad state and both times something was OOM killed. I'm likely experiencing this problem again, so I can reproduce it and maybe get more clarity or insight on it. About my previous post: $ ipfs daemon --mount reached "WebUI: http://..." which is closer to "Daemon is ready" after 37 minutes (2024-11-24 14:27:36 UTC to 2024-11-24 15:04:18 UTC).

About pinning being broken - in the past I was able to do this: run $ ipfs pin add --progress CID (would say "pinned CID success") then run that command again and it would immediately say that CID, a hundred-gigabyte or one-gigabyte folder, is pinned. Now all that happens when I run that command again is it says "Fetched/processed n nodes" until it says "pin success" even though it was pinned as "pin successful" in the recent past. I've observed this brokenness multiple times.

That go.dev tool looks helpful. I wondered about a thing to do that in the past! Anyways, I already did that: converting the CID's data into the corresponding CIQ*.data files. I said that in an above post which contains the text "It corresponds to these .data files:". How I did it: (1.) saw that I didn't have all of a CID in "repo A" where I did in the past (2.) downloaded a .car file of that CID from an HTTP-only website (3.) deleted everything in "repo B" = empty repo (4.) imported said CAR file into repo B (5.) got a list of all .data files in repo B (6.) checked repo A to see which ones were missing = only had 78 out of 322 of them. (Repo A is in one computer and repo B is in a different computer.)

@ProximaNova
Copy link
Author

$ ipfs daemon --mount took 26 minutes to get to "Daemon is ready" this time (after I canceled the previous ipfs daemon --mount). (Saw running time via $ ps aux | grep ipfs -> $ ps -o etime 1835062.) Command $ ipfs add --cid-version=1 ... is again hanging on adding a small file. Here's TZ=UTC ipfs diag profile for that:
https://cdn3.filehaus.su/files/1732464038_66024/ipfs-profile-2024-11-24T15_53_19Z_00.zip also at the below IPFS link which is ipfs://bafybeigkenmsvcusn4gxqhedpva36ylzmxv2kbrhvt3glja7rr72saaxsm/ipfs-profile-2024-11-24T15_53_19Z_00.zip

Timeline:
2024-11-24 15:51:42 UTC: command to add a small file started
2024-11-24 15:53:11 UTC: $ ipfs diag profile command began
2024-11-24 15:53:51 UTC: $ ipfs diag profile command finished; it writes a file to the current working directory
2024-11-24 16:04:22 UTC: command to add a small file still running, NOT finished

What that looks like - Terminal tab 2:

$ utc; TZ=UTC ipfs diag profile; utc
2024-11-24T15:53:11.212678506Z
Wrote profiles to: ipfs-profile-2024-11-24T15_53_19Z_00.zip
2024-11-24T15:53:51.299314648Z

Terminal tab 1:

$ utc; ipfs add -rHw --cid-version=1 --chunker=size-1048576 cid.sh; utc
2024-11-24T15:51:42.067077766Z
added bafkreifb47p62choz45thl3hx7ybwxv6icj23ep2faj5f4wlott4srmnla cid.sh
added bafybeiafvmyfzr4pa5qwqswqq4ushkodgvqlmckr6ftxtb2lrcn7egdaz4 
 994 B / 994 B [===============] 100.00%
[hanging so I hit ctrl+c]^C
Error: canceled
2024-11-24T16:06:36.464197711Z
$
$ utc; ipfs add -rHw --cid-version=1 --chunker=size-1048576 ipfs-profile-2024-11-24T15_53_19Z_00.zip; utc
2024-11-24T16:01:21.076519169Z
added bafybeigsyvwtpuza7mht36xy4yzv4g6qktq7d2wegnskkk5qnbdxncew6e ipfs-profile-2024-11-24T15_53_19Z_00.zip
added bafybeigkenmsvcusn4gxqhedpva36ylzmxv2kbrhvt3glja7rr72saaxsm 
 50.20 MiB / 50.20 MiB [================] 100.00%
^C
Error: canceled
2024-11-24T16:09:06.687213647Z
$ # next: test with --pin=false per @hsanjuan

I then ran $ utc; ipfs add --pin=false -rHw --cid-version=1 --chunker=size-1048576 cid.sh; utc and it made no difference:
2024-11-24T16:10:46.280780203Z: start
2024-11-24T16:14:12.189227523Z: still running
2024-11-24T16:14:24.598811241Z: ^C \ Error: canceled [by me pressing ctrl+c]

@hsanjuan
Copy link
Contributor

A quick view on your stack:

  • I think leveldb is compacting. Leveldb is known to work badly at scale. They also suffer a lot of read-amplification penalty when not tuned correctly. It may be you are suffering both because you hit a pinset of certain size.
  • The reprovider is trying to read keys
  • The pinner is trying to read all the pins (perhaps for the reprovider)
  • Bitswap server is trying to read blocks via getBlockSizes
  • And the pinner is trying to pin (which it cannot do while it is also trying to read all the pins)

Try disabling the reprovider setting the Interval to 0 (https://github.com/ipfs/kubo/blob/master/docs/config.md#reproviderinterval) and enable StrategicProviding (https://github.com/ipfs/kubo/blob/master/docs/experimental-features.md#strategic-providing) as this disables bitswap-providing additionally and see if anything improves.

Otherwise, backup the leveldb altogether and/or backup the list of pins. Then delete it and then try again and tell me if it keeps hanging. I'm still fairly convinced leveldb is the bottleneck.

@ProximaNova
Copy link
Author

ProximaNova commented Nov 26, 2024

Above, I said "Like a week ago I saw some error about a dirty flag not being cleared." This refers to this error, also had it today:
2024-11-26T06:27:56.794-0000 ERROR pin dspinner/pin.go:925 failed to set pin dirty flag: leveldb: closed

Also above, I described that ipfs daemon took too long to startup (30 minutes or longer). Less-than-ideal solution for that: start and stop the daemon 3 to 6 times; this takes like 1 hour and 30 minutes to do (must reach "Daemon is ready" each time). After that, it will startup in less than 10 seconds. This method should also "fix" ipfs add hanging. ipfs add seemingly can also hang as related to "...failed to set pin dirty flag...", but that's a separate problem. If I add a bunch of files to MFS again, then this will likely result in this taking-to-long-or-forever problem again. Either because it does that in all cases or because it does that when adding while completely offline. Possible solution: similar to $ ipfs --offline ... (which I find to be helpful), there could be a $ ipfs --online ... setting. Command ipfs --online does not exist, but if it did, it would only do operations while online.

Other problem: pinning appears to be broken. Less-than-ideal solution: rollback to a previous ZFS snapshot (I made none so can't do this). If I could rollback, then that might also contain .data files that were mysteriously deleted.

@hsanjuan

I think leveldb is compacting. Leveldb is known to work badly at scale. They also suffer a lot of read-amplification penalty when not tuned correctly. It may be you are suffering both because you hit a pinset of certain size.

By "pinset of certain size" I think you mean the total data / file size. I have 13,643 pins as reported by $ ipfs --offline pin ls --type=recursive.

Try disabling the reprovider setting the Interval to 0 (https://github.com/ipfs/kubo/blob/master/docs/config.md#reproviderinterval) and enable StrategicProviding (https://github.com/ipfs/kubo/blob/master/docs/experimental-features.md#strategic-providing) as this disables bitswap-providing additionally and see if anything improves.

Will do, thanks for suggesting things that may help.

Then delete it and then try again and tell me if it keeps hanging. I'm still fairly convinced leveldb is the bottleneck.

I'm kinda out of storage space. Need ~$400 for another 18TB HDD (price+tax): maybe month(s) later. Edit: read next post.

@ProximaNova
Copy link
Author

ProximaNova commented Nov 26, 2024

Then delete it and then try again and tell me if it keeps hanging. I'm still fairly convinced leveldb is the bottleneck.

I'm kinda out of storage space. Need ~$400 for another 18TB HDD (price+tax): maybe month(s) later.

You posted "Otherwise, backup the leveldb altogether and/or backup the list of pins. Then delete it and then try again and tell me if it keeps hanging. I'm still fairly convinced leveldb is the bottleneck." I may have misinterpreted this as meaning "delete everything (after copying it elsewhere) so you can redo it". I think you actually meant that I should duplicate/backup then delete "$IPFS_PATH/datastore/*" then see if it works better. (datastore folder = like the same thing as leveldb as far as I know.)

For above changes ( #10588 (comment) ):
$ ipfs config Reprovider.Interval 0 ### disabling the reprovider setting the Interval to 0
$ ipfs config --json Experimental.StrategicProviding true ### Replaces the existing provide mechanism with a robust, strategic provider system. Currently enabling this option will provide nothing.
Because that disables things or limits functionality, this may be a temporary (or "permanent") solution to this problem.

@hsanjuan
Copy link
Contributor

I think you actually meant that I should duplicate/backup then delete "$IPFS_PATH/datastore/*"

Yes sorry, I meant that folder only. It does not contain the blocks, it only contains metadata (pinset, mfs root, reprovider queues).

I'm trying to establish if the hanging behavior you see is related to leveldb.

  • For that, trying with a new empty datastore folder and seeing if the issue still exists will tell me it's a leveldb problem
  • Disabling the reprovider etc causes less contention (less reads) and it is interesting to know if things work without that too.

@ProximaNova
Copy link
Author

Days ago I did some tests with

$ ipfs config Reprovider.Interval 0
$ ipfs config --json Experimental.StrategicProviding true

but I think I'm going to have to ignore the results I got with that because it wrote the following to the config:

    "StrategicProviding": true,

which is fine, but

  "Reprovider": {
    "Interval": "0"
  },

may be wrong. The string for "0" may not be the same as the JSON number type for 0, which is "Interval": 0

Here's the results anyways, which should maybe be ignored: the same days ago I tried adding many files to MFS. Both attempts resulted in OOM. The out of memory either resulted in the entire computer freezing or the daemon getting killed. Also around that time, I did try to get a profile by running ipfs diag profile before the ipfs daemon or something was going to get oom'd, but that ipfs diag profile made the computer freeze, likely due to using too much memory. (Then when I rebooted the computer my boot partition was gone: fixed after hours of working to restore that partition.) Before running ipfs diag profile when lots of RAM was being used, here's what htop looked like: ipfs://bafkreidzgeyxrlbhbrz34ns23gg6fx2in23ysg2fluix7a6bcixufb7ir4 - also at http://ongtrong.xyz/4XwsdZi4_3v7np2i4sr-0wNi3Ji3RfY6kXDkSR34ENE

@hsanjuan
Copy link
Contributor

hsanjuan commented Dec 5, 2024

"Interval": "0"

This is fine because this gets parsed as a duration (i.e "0s" or "10m") so it should be a string.

Did you try with an empty leveldb datastore? And does it still hang when adding something small?

@ProximaNova
Copy link
Author

This is fine because this gets parsed as a duration (i.e "0s" or "10m") so it should be a string.

OK, and I assume non-string 0 also works.

And does it still hang when adding something small?

Not currently. But it will hang again if I add many files to MFS (with that command) and re-trigger that bug. Happened twice so far as documented by this thread.

Did you try with an empty leveldb datastore?

No. What I was doing was running a command which automatically ran ipfs cp -p /ipfs/src /mfs/dest on 1,105,578 files which I already completely have in my ipfs repo but not at the desired MFS paths. Roughly: 0.3 MB/file, 300 MB per 1000 files, and 30 GB per 100,000 files. In order to test this with an empty leveldb datastore I would have to do this: (1) have a copy of the files - 30 to 90 GB of them for roughly 200,000 files - via ipfs get or something (2) make a backup/copy of datastore/ then empty "datastore/" (3) import the 90 gigabytes worth of CAR files or pin a remote CID which corresponds to those files or otherwise readd those CIDs to be pinned or recorded in the repo's datastore (4) run that cat /home/u/Downloads/t5.txt | xargs -d "\n" sh -c 'for args do ... command and see if it works any better.

So that would take some amount of effort and time to do. There's now 384,226 lines in the log file - tail shows that adding files to MFS slowed down at the end of the most recent attempt:

$ tail /home/u/Downloads/t6.txt | uniq
Thu Nov 28 08:02:06 PM UTC 2024
Thu Nov 28 08:02:14 PM UTC 2024
Thu Nov 28 08:02:17 PM UTC 2024
Thu Nov 28 08:02:52 PM UTC 2024
Thu Nov 28 08:03:59 PM UTC 2024
$ # Try again, but only add ~50,000 at a time?

Roughly 300,000 out of 1 million done: if I logged it better then I could say more confidently about whether or not "Reprovider.Interval 0" and "Experimental.StrategicProviding true" helped at all. It seems like it did, but I could try again with better logging this time. (I can always test the thing I'm trying to do but with an empty leveldb datastore, just have to put in the effort to do that.)

@ProximaNova
Copy link
Author

@hsanjuan

I think leveldb is compacting. Leveldb is known to work badly at scale.

So leveldb does not work well at a large scale. What works better then? I've heard of badgerdb IIRC - maybe that works better. badgerdb is used in an implementation/setup of IPFS.

@hsanjuan
Copy link
Contributor

hsanjuan commented Dec 5, 2024

In order to test this with an empty leveldb datastore I would have to do this: (1) ...

I think we already cleared that deleting the leveldb folder does not delete the data (which is in the blocks folder), only the list of pins, the mfs root and the pending-reprovide entries. You should see that this folder should be much smaller in size and you can back it up easily.

OK, and I assume non-string 0 also works.

Probably not.

I would like you that you start with an empty leveldb datastore and the reprovider settings I mentioned, and see if simple things work. If they work then you can try your major operations and when they degrade please get profiles so that we know what the memory is going into.

@ProximaNova
Copy link
Author

I think we already cleared that deleting the leveldb folder does not delete the data (which is in the blocks folder), only the list of pins, the mfs root and the pending-reprovide entries. You should see that this folder should be much smaller in size and you can back it up easily.

Yes, I know. I essentially knew that multiple posts ago and also months ago. The point of the steps posted in #10588 (comment) was to clarify the process that I would need to do in order to test that.

Probably not.

Alright, so string zero it is.

@hsanjuan
Copy link
Contributor

hsanjuan commented Dec 6, 2024

Ok, I am going to assume it is leveldb. Kubo now supports pebble, so you should be able to replace the leveldb part of the configuration with pebble. You can play with that and let us know if things get out of control again.

@ProximaNova
Copy link
Author

ProximaNova commented Dec 7, 2024

About "I think leveldb is compacting." Seems I know what you mean by that. I did another command where many files are added to MFS. It oom'd the daemon and kept working in offline mode: I let that run for >12 hours. What the datastore folder looks like after I recently canceled that command:

$ find ./datastore -type f | wc -l
32003
$ du -sh ./datastore
171M [...]

That's 32,003 files totaling to 171 MB. Normally it's roughly 60 files at 50 MB. This should explain why the ipfs daemon takes so long to start up and ipfs add acts slowly in this condition. This hanging-like bug-related behavior is likely a combination of it taking a while to look inside of a many-files folder and then condensing that data into fewer files. I can post more details once I fix this again which takes like 90 minutes to do. Details of interest after putting it back to normal: how many files are in "datastore/" and what's the total size?

@hsanjuan
Copy link
Contributor

Hello @ProximaNova , can you try to start kubo with export MFS_PIN_POLL_INTERVAL=99999999m before?

I think this will fix the "copying to mfs gets slow" problem.

@hsanjuan hsanjuan self-assigned this Dec 11, 2024
@lidel
Copy link
Member

lidel commented Dec 17, 2024

hsanjuan added a commit that referenced this issue Dec 18, 2024
This is a mitigation to increased MFS memory usage in the course of many writes operations.

The underlying issue is the unbounded growth of the mfs directory cache in
boxo. In the latest boxo version, this cache can be cleared by calling Flush()
on the folder. In order to trigger that, we call Flush() on the parent folder
of the file/folder where the write-operations are happening.

To flushing the parent folder allows it to grow unbounded. Then, any read
operation to that folder or parents (i.e. stat), will trigger a sync-operation to match
the cache to the underlying unixfs structure (and obtain the correct node-cid).

This sync operation must visit every item in the cache. When the cache has grown too much,
and the underlying unixfs-folder has switched into a HAMT, the operation can take minutes.

Thus, we should clear the cache often and the Flush flag is a good indicator
that we can let it go. Users can always run with --flush=false and flush at
regular intervals during their MFS writes if they want to extract some performance.

Fixes #8694, #10588.
@hsanjuan
Copy link
Contributor

See summary: #8694 (comment)

gammazero pushed a commit that referenced this issue Dec 19, 2024
* cmd/files: flush parent folders

This is a mitigation to increased MFS memory usage in the course of many writes operations.

The underlying issue is the unbounded growth of the mfs directory cache in
boxo. In the latest boxo version, this cache can be cleared by calling Flush()
on the folder. In order to trigger that, we call Flush() on the parent folder
of the file/folder where the write-operations are happening.

To flushing the parent folder allows it to grow unbounded. Then, any read
operation to that folder or parents (i.e. stat), will trigger a sync-operation to match
the cache to the underlying unixfs structure (and obtain the correct node-cid).

This sync operation must visit every item in the cache. When the cache has grown too much,
and the underlying unixfs-folder has switched into a HAMT, the operation can take minutes.

Thus, we should clear the cache often and the Flush flag is a good indicator
that we can let it go. Users can always run with --flush=false and flush at
regular intervals during their MFS writes if they want to extract some performance.

Fixes #8694, #10588.

* cmd/files: docs and changelog for --flush changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/author-input Needs input from the original author need/triage Needs initial labeling and prioritization
Projects
None yet
Development

No branches or pull requests

4 participants