Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observed NPEs during contract invoke #1585

Open
SamMayWork opened this issue Sep 19, 2024 · 2 comments
Open

Observed NPEs during contract invoke #1585

SamMayWork opened this issue Sep 19, 2024 · 2 comments
Assignees

Comments

@SamMayWork
Copy link
Contributor

SamMayWork commented Sep 19, 2024

When invoking a method on a contract, we've occasionally seen FireFly get into a state where we get an NPE from the syncasync bridge and then no response is returned. It's not clear at the moment what causes FireFly to get into this state, but hard restarting the FireFly process seems to get it out of this state and allow invoke operations to succeed.

We've seen this issue in 1.2.2 but suspect the same kind of problem might happen in 1.3.X as it's not currently fixed yet.

Stack Trace
http2: panic serving: runtime error: invalid memory address or nil pointer dereference
goroutine 142759 [running]:
net/http.(*http2serverConn).runHandler.func1()
	/usr/local/go/src/net/http/h2_bundle.go:5903 +0x125
panic({0xf54100, 0x18a3e20})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/hyperledger/firefly/internal/syncasync.(*syncAsyncBridge).addInFlight(0xc00037ce00, {0xc0005161a0, 0xd}, 0xc000e996b0, 0x6)
	/firefly/internal/syncasync/sync_async_bridge.go:142 +0x1f1
github.com/hyperledger/firefly/internal/syncasync.(*syncAsyncBridge).sendAndWait(0xc00037ce00, {0x12c37b8, 0xc000acce40}, {0xc0005161a0, 0xd}, 0x20?, 0xfcf9c0?, 0xc000c70680)
	/firefly/internal/syncasync/sync_async_bridge.go:622 +0x8a
github.com/hyperledger/firefly/internal/syncasync.(*syncAsyncBridge).WaitForInvokeOperation(0xc00020d880?, {0x12c37b8?, 0xc000acce40?}, 0xc001889440?, 0x1?)
	/firefly/internal/syncasync/sync_async_bridge.go:700 +0x45
github.com/hyperledger/firefly/internal/contracts.(*contractManager).InvokeContract(0xc00020d880, {0x12c37b8, 0xc000acce40}, 0xc001889440, 0x1)
	/firefly/internal/contracts/manager.go:444 +0x525
github.com/hyperledger/firefly/internal/contracts.(*contractManager).InvokeContractAPI(0xc00020d880, {0x12c37b8, 0xc000acce40}, {0xc000feb6a6, 0x31}, {0xc000feb6df, 0x7}, 0xc001889440, 0x40?)
	/firefly/internal/contracts/manager.go:468 +0x145
github.com/hyperledger/firefly/internal/apiserver.glob..func167(0xc0003cc120, 0xc001391500)
	/firefly/internal/apiserver/route_post_contract_api_invoke.go:53 +0x1b5
github.com/hyperledger/firefly/internal/apiserver.(*apiServer).routeHandler.func1(0xc0003cc120)
	/firefly/internal/apiserver/server.go:232 +0x3a9
github.com/hyperledger/firefly-common/pkg/ffapi.(*HandlerFactory).RouteHandler.func1({0x12c2968, 0xc000897fa0}, 0xc001a6ce00)
	/go/pkg/mod/github.com/hyperledger/[email protected]/pkg/ffapi/handler.go:201 +0x73f
github.com/hyperledger/firefly-common/pkg/ffapi.(*HandlerFactory).APIWrapper.func1({0x12c2968?, 0xc000897fa0}, 0xc001a6cc00)
	/go/pkg/mod/github.com/hyperledger/[email protected]/pkg/ffapi/handler.go:303 +0x362
net/http.HandlerFunc.ServeHTTP(0xc001a6cc00?, {0x12c2968?, 0xc000897fa0?}, 0x0?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
gitlab.com/hfuss/mux-prometheus/pkg/middleware.(*Instrumentation).Middleware.func1({0x12c2c68?, 0xc00179c230}, 0xc001a6cc00)
	/go/pkg/mod/gitlab.com/hfuss/[email protected]/pkg/middleware/middleware.go:69 +0xef
net/http.HandlerFunc.ServeHTTP(0xc001a6ca00?, {0x12c2c68?, 0xc00179c230?}, 0x417225?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/gorilla/mux.(*Router).ServeHTTP(0xc0005a15c0, {0x12c2c68, 0xc00179c230}, 0xc000f53000)
	/go/pkg/mod/github.com/gorilla/[email protected]/mux.go:210 +0x1cf
github.com/rs/cors.(*Cors).Handler.func1({0x12c2c68, 0xc00179c230}, 0xc000f53000)
	/go/pkg/mod/github.com/rs/[email protected]/cors.go:236 +0x1c4
net/http.HandlerFunc.ServeHTTP(0xc000feb6e7?, {0x12c2c68?, 0xc00179c230?}, 0x5?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
net/http.serverHandler.ServeHTTP({0x632b00?}, {0x12c2c68, 0xc00179c230}, 0xc000f53000)
	/usr/local/go/src/net/http/server.go:2947 +0x30c
net/http.initALPNRequest.ServeHTTP({{0x12c37b8?, 0xc0008afb00?}, 0xc0019e9880?, {0xc0006320f0?}}, {0x12c2c68, 0xc00179c230}, 0xc000f53000)
	/usr/local/go/src/net/http/server.go:3556 +0x245
net/http.(*http2serverConn).runHandler(0x0?, 0xc0018c5320?, 0x0?, 0x0?)
	/usr/local/go/src/net/http/h2_bundle.go:5910 +0x78
created by net/http.(*http2serverConn).processHeaders
	/usr/local/go/src/net/http/h2_bundle.go:5645 +0x5b9
@EnriqueL8
Copy link
Contributor

EnriqueL8 commented Sep 19, 2024

Investigation and Outcomes

OS: Docker Composer (observed in K8s as well)
Blockchain: Ethereum

Scenario

User of FireFly tried to invoke a smart contract through the API deployed using confirm=true, FireFly hits a Null Pointer Exception in the code. The confirm path adds the transaction to a queue and calls sysevents which is the line in the stack trace that has the null pointer exception.

Why is this sysevents nil when the namespace has started correctly?

Steps to Reproduce

We have a running FireFly stack at version at unreleased v1.3.0 using the CLI which is connected to a transaction manager. FireFly has at least one contract listener created. Transaction manager is unreachable by FireFly for some unknown reason, maybe due to the networking timeouts. For this scenario, we took down the EVMConnect container. We have configured FireFly to start the blockchain plugin in backgroundStart mode and with retry which means the FireFly namespace will start up correctly but the blockchain plugin will keep retrying in a background process.

We then tried to invoke a Contract API with confirm=true and were able to reproduce the NPE.

Found the cause of the issue

After enabling debug mode in VSCode with FireFly, we were able to step through the initialisation process of FireFly setting up the plugins. In turns out there is a bug in the way the Contract Manager (responsible of handing contract listeners, etc...) is initialised. As part of the initialisation of Contract Manager, it will verify that the contract listener exist in the transaction manager and it takes a syncasyncbridge plugin for the scenarios of confirm=true. Since the TM is down, it fails to do so but on return of the NewContractManager function it returns both the object/struct and an error. The code catches the error correctly and returns it top level to retry this init step but it also stores the ContractManager reference to the struct in the Orchestrator. This struct containers a pointer to the syncasyncbridge object passed in at initialisation.

Link to creation of the Contract Manager

or.contracts, err = contracts.NewContractManager(ctx, or.namespace.Name, or.database(), or.blockchain(), or.data, or.broadcast, or.messaging, or.batch, or.identity, or.operations, or.txHelper, or.txWriter, or.syncasync, or.cacheManager)

Now FireFly retried the initialisation of the namespace by going through the same flow and starting the plugins. As part of this, it creates a new syncasyncbridge plugin and it checks if the ContractManager in the Orchestrator is nil but as we have explained above it was set in the previous run. So now the orchestrator has a new sysevents plugin and old ContractManager plugin pointing to an old syncasyncbridge plugin. Furthermore down in the initialisation of namespace, the code calls syncasyncbridge.Init(events plugin) using the new syncasyncbridge.Init plugin.

Link to where Init is called

or.syncasync.Init(or.events)

When a user calls an invoke on Contract API with confirm=true, the ContractManager struct uses the syncasyncbridge plugin to manage inflight transactions. As part of that flow, the syncasyncbridge plugin will try and call the events plugin. Because the ContractManager is pointing to an old version of the syncasyncbridge plugin for which Init(events plugin) was not called, it tried to actually a nil events plugin

Proposed Fix

In the InitComponents function, when initialising the ContractManager plugin make sure not to set the orchestrator.Contracts if an error has occurred. So that the next time the init happens, it gets recreated correctly. Also validate this for other plugins, because I've seen this pattern used before...

@SamMayWork
Copy link
Contributor Author

Recreate

Since this is a kind of nasty issue to try and recreate, I'll provide here the full configuration for getting this working plus the exact steps that have been followed.

  1. Create a new FireFly stack using the FF CLI, create 1 node with multiparty set to false
  2. Start the stack to generate the /runtime config in the stack directory
  3. Remove FireFly core from the Docker Compose file
  4. Change the config to point to the 127.0.0.1 addresses for all the services rather than through the Docker network
  5. Change the DB location to point to somewhere on your local machine
  6. Make sure Background Start is enabled for FireFly in your configuration *1
  7. Stop the FireFly stack
  8. Start the FireFly stack
  9. Start FireFly through a local VSCode with the config pointing to the config for the stack on your local machine
  10. Navigate to the API for FireFly
  11. Deploy a contract and create a blockchain listener for it*2
  12. Stop the EVM Connect container from the command line
  13. Restart FireFly

*1 - Use this configuration to enable background start

plugins:
  blockchain:
  - ethereum:
      ethconnect:
        retry:
          enabled: true
        backgroundStart:
          enabled: true
        topic: "0"
        url: http://127.0.0.1:5102
    name: blockchain0
    type: ethereum
  ...

*2 - For FireFly 1.3 use this guide, for FireFly 1.2 use this guide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants