Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPFS Frequently Hangs #8409

Closed
3 tasks done
kevincox opened this issue Sep 5, 2021 · 21 comments
Closed
3 tasks done

IPFS Frequently Hangs #8409

kevincox opened this issue Sep 5, 2021 · 21 comments
Labels
kind/bug A bug in existing code (including security flaws) kind/stale need/author-input Needs input from the original author

Comments

@kevincox
Copy link

kevincox commented Sep 5, 2021

Checklist

Installation method

third-party binary

Version

go-ipfs version: 0.9.1
Repo version: 11
System version: amd64/linux
Golang version: go1.16.7

Config

Note, ipfs config show hangs when this problem is occurring. This was obtained by:

  • Killing ipfs deamon.
  • config show still hangs.
  • Restart ipfs daemon (appears to fix repo state).
  • Then this config was returned.
{
  "API": {
    "HTTPHeaders": {
      "Access-Control-Allow-Origin": [
        "REDACTED"
      ]
    }
  },
  "Addresses": {
    "API": "/unix/run/ipfs-api.sock",
    "Announce": [],
    "Gateway": "/unix/run/ipfs-gateway.sock",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic",
      "/ip6/::/udp/4001/quic"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt"
  ],
  "DNS": {
    "Resolvers": null
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "10GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false,
      "Interval": 10
    }
  },
  "Experimental": {
    "AcceleratedDHTClient": false,
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "ShardingEnabled": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "REDACTED"
  },
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Migration": {
    "DownloadSources": null,
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {},
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "Routing": {
    "Type": "dht"
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 900,
      "LowWater": 600,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "EnableAutoRelay": false,
    "EnableRelayHop": false,
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

After running for a while go-ipfs hangs.

I tried following the steps here: https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md

% ipfs diag profile
Error: expected 0 argument(s), got 1

USAGE
  ipfs diag - Generate diagnostic reports.

  ipfs diag

SUBCOMMANDS
  ipfs diag cmds - List commands run on this IPFS node.
  ipfs diag sys  - Print system diagnostic information.

  For more information about each command, use:
  'ipfs diag <subcmd> --help'

All of the following hang.

  • curl localhost:5001/debug/pprof/goroutine?debug=2 > ipfs.stacks
  • curl localhost:5001/debug/pprof/profile > ipfs.cpuprof
  • curl localhost:5001/debug/pprof/heap > ipfs.heap
  • curl localhost:5001/debug/vars > ipfs.vars
  • ipfs diag sys > ipfs.sysinfo

I tried getting backtraces from GDB but the results only contain the go runtime. Let me know if you want a copy of those.

My monitoring shows an interesting symptom. This issue can reliably be spotted by all of the memory transitioning to anon from file and slab and the CPU and network traffic appear to become very irregular, occurring only in bursts.

ipfs

Killing the process regularly doesn't work, it must be killed with SIGKILL.

@kevincox kevincox added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Sep 5, 2021
@kevincox
Copy link
Author

kevincox commented Sep 5, 2021

I also have the following instance where the memory change started to happen but IPFS appears to have continued working properly. This event corolates with when I re-added a large folder (~40GiB) to IPFS. I don't think this would have created any new blocks.

image

@kevincox
Copy link
Author

kevincox commented Sep 5, 2021

I can't see anything strongly related with the last occurrence.

@Stebalien
Copy link
Member

I tried following the steps here: https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md

Ah, sorry, those are the instructions for the latest master (next release). Here are the instructions for v0.9.1:

https://github.com/ipfs/go-ipfs/blob/v0.9.1/docs/debug-guide.md, the script is in https://github.com/ipfs/go-ipfs/blob/v0.9.1/bin/collect-profiles.sh.

@kevincox
Copy link
Author

kevincox commented Sep 6, 2021

Thank, but I think this has the same problem of requiring the daemon to be responsive as it is just asking the daemon to produce the dumps.

Furthermore I am service the API via a unix socket for access control so the old script wouldn't be helpful even if ipfs was responsive.

@Stebalien
Copy link
Member

All of the following hang.

That is bizarre. I've seen this kind of thing before, but it usually means IPFS is out of memory (which doesn't look like the case here.

Next time this happens, could you kill go-ipfs with a SIGQUIT and capture STDOUT+STDERR? That'll kill IPFS and dump go stack traces.

I'd also check for any suspicious messages in dmesg.

@kevincox
Copy link
Author

kevincox commented Sep 6, 2021

I checked and there was no dmesg output from that time. Is there any particular reason you are interested in dmesg? Seems like an odd place to look for problems.

I'll try to send SIGQUIT next time this occurs.

@Winterhuman
Copy link
Contributor

Winterhuman commented Sep 6, 2021

Just a wild guess, but, maybe check journalctl and specifically look for OOM killers; the node might not be using all the ram but the system might be preventing it from doing so. Also, does your fstab file put any size limits on tmpfs filesystems?

@kevincox
Copy link
Author

kevincox commented Sep 6, 2021

Nothing. Also this system is almost always lightly loaded on RAM <4/16GiB.

Although IIUC go-ipfs is just one process so if it was OOM killed it shouldn't hang, just be shut down.

@Stebalien
Copy link
Member

Stebalien commented Sep 6, 2021 via email

@Stebalien
Copy link
Member

Stebalien commented Sep 6, 2021 via email

@kevincox
Copy link
Author

kevincox commented Sep 6, 2021

The rest of the system is operating fine. So unless somehow go-ipfs is particularly sensitive that is an unlikely cause.

@kevincox
Copy link
Author

kevincox commented Sep 8, 2021

I killed it with SIGQUIT and got call traces.

ipfs.log

Oddly generating these traces took a looooooong time, about 12h long. Other processes on the system were very responsive and while there were a small number of minor page faults I can't see anything that would explain this slowness. Furthermore the network traffic and CPU usage didn't stop so it seems like some background process was still going on and maybe somehow slowing down the trace handler?

The graphs look similar. However interestingly the CPU usage is trending down. I'm curious what will happen if I leave it hung for 48 hours or similar to see what happens to those bursts of CPU usage. The do appear to be flattening out but it is not completely clear.

image

@Stebalien
Copy link
Member

Ok, so, it looks like you're receiving a lot of DHT provider puts.

The strange thing is that all of these operations seem to have been stalled for 10 hours. The worker that's supposed to handle them is processing a provider get, but that get doesn't appear to be stalled.

The simplest workaround is to turn on DHT "client" mode (set Routing.Type to "dhtclient") while we figure out what's going on here.

@kevincox
Copy link
Author

kevincox commented Sep 8, 2021

Ok, I'll try setting that conf and see if it mitigates the issue.

@kevincox
Copy link
Author

I haven't seen this hang in a couple of days so it does seem that the mitigation is working.

@aschmahmann
Copy link
Contributor

@Stebalien is this related to libp2p/go-libp2p-kad-dht#729?

@Stebalien
Copy link
Member

That and timing out/giving up, yes.

@guseggert
Copy link
Contributor

Following up, have you tried newer versions of kubo (aka go-ipfs)?

@guseggert guseggert added need/author-input Needs input from the original author and removed need/triage Needs initial labeling and prioritization labels Aug 5, 2022
@kevincox
Copy link
Author

kevincox commented Aug 5, 2022

I haven't removed the override and have shut down my IPFS node for now.

@github-actions
Copy link

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

@github-actions
Copy link

This issue was closed because it is missing author input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) kind/stale need/author-input Needs input from the original author
Projects
None yet
Development

No branches or pull requests

5 participants