Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: for/range with iterator results in program that never terminates #69434

Closed
vsivsi opened this issue Sep 12, 2024 · 9 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. NextMinor A critical problem that affects the availability or correctness of production systems built using Go release-blocker
Milestone

Comments

@vsivsi
Copy link

vsivsi commented Sep 12, 2024

Go version

go version go1.23.1 darwin/amd64

Output of go env in your module/workspace:

go env
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/Users/user/Library/Caches/go-build'
GOENV='/Users/user/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/user/go/pkg/mod'
GONOPROXY='github.com/vsivsi'
GONOSUMDB='github.com/vsivsi'
GOOS='darwin'
GOPATH='/Users/user/go'
GOPRIVATE='github.com/vsivsi'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/Cellar/go/1.23.1/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/local/Cellar/go/1.23.1/libexec/pkg/tool/darwin_amd64'
GOVCS=''
GOVERSION='go1.23.1'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/user/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/Users/user/wordcounter/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/h5/p_k0bq5d55g3r_v_sz5ghz5h0000gn/T/go-build693439929=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

Modified an existing "pull" style io.Reader loop to instead use a go 1.23 iterator, with no other changes to the following logic in the loop body.

So this:

for {
	n, err := wordReader.Read(buf)
	if err == io.EOF {
		break
	} else if err != nil {
		panic(err)
	}
	word := string(buf[:n])
		
	// rest of loop logic
}

Became this:

for word := range wordReader.All() {

 	// rest of loop logic (same as case above)

}

Where new wordReader.All() iterator implementation is based on wordReader.Read():

// All returns an iterator allowing the caller to iterate over the WordReader using for/range.
func (wr *WordReader) All() iter.Seq[string] {
	word := make([]byte, 1024)
	return func(yield func(string) bool) {
		var err error
		var n int
		for n, err = wr.Read(word); err == nil; n, err = wr.Read(word) {
			if !yield(string(word[:n])) {
				return
			}
		}
		if err != io.EOF {
			fmt.Fprintf(os.Stderr, "error reading word: %v\n", err)
		}
	}
}

A full reproduction including both the original "pull" and new "iterator" versions of the loop is available here:
https://go.dev/play/p/ENULYowKCLS

What did you see happen?

Executing with the new range/iterator loop does not terminate. In the go playground link above this manifests as the original "pull" style loop completing immediately (in EstimateUniqueWordsPull()) , and then the program times-out the playground while running EstimateUniqueWordsIter().

If run locally as a CLI, the program does not terminate.

Note! Frustratingly there are many cases that rescue this non-termination behavior. Running under a debug build or using the race detector works as expected.

As do even seemingly trivial changes to the code, e.g. in the Playground repro:

     // fmt.Println("Rounds", rounds) // ######### Uncomment line to rescue

Simply adding a single Println reverts to the expected behavior. This feels very racy to me, but there are no goroutine calls in the reproducing code, so this is a pretty clear indication that something bad is happening out of sight.

What did you expect to see?

I expect the two loops documented above as implemented in EstimateUniqueWordsPull() and EstimateUniqueWordsIter() to behave essentially identically, each running nearly instantaneously on a small test input, and the program should return a result for each and properly terminate.

Note, in the reproduction code in the playground, it would be normal for either of the above functions to return values resulting in either the "PASS" case or the "expected X, got Y" mismatch. This code implements a probabilistic counting estimation algorithm that uses randomness, as described here:
https://www.quantamagazine.org/computer-scientists-invent-an-efficient-new-way-to-count-20240516/

@gabyhelp
Copy link

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 12, 2024
@adonovan
Copy link
Member

When I run this program it spends a lot of time in refreshCounter trying to count to three hundred billion:

		for i := 0; i < pr.n-1; i++ {
			pr.bitmask &= rand.Uint64()
		}

I don't think the problem is in the compiler or runtime.

@vsivsi
Copy link
Author

vsivsi commented Sep 13, 2024

@adonovan Thanks for taking a look. Good sleuthing. I agree with the symptom that you uncovered, but continue to suspect something fishy is going on here. I sincerely wish I had a simpler reproduction of this, but well, I assume that this new iterator feature is not blatantly broken... And so a simpler reproduction may not be in the cards. Anyhow, based on your clue, here is some possible firmer evidence that something is amiss:

I've added a check method to the struct holding the value of pr.n used in the loop you highlighted above. Because this issue seemingly can't be reproduced in the debugger, we'll resort to debugging with panics.

// check panics if pr.n is not the expected value
func (pr *ProbabilisticSkipper) check(n int) {
	if pr.n != n {
		panic(fmt.Sprintf("check: pr.n != n  %d != %d", pr.n, n))
	}
}

If I place a check call at the top and bottom of the range loop with the iterator, it appears that the value of pr.n is getting clobbered between the end of one iteration of the loop and the start of another, because the panic always appears to happen at the top.

So, the loop now looks like this:

wordSkipper.check(rounds)     // No panic

for word := range wordReader.All() {
	wordSkipper.check(rounds)       //  Panic always occurs here

	// ... all other loop logic...

	wordSkipper.check(rounds)      //  Never panics here
}

So the mystery for me is, why is this struct member value seemingly "randomly" changing between iterations of the "iterator" loop (but not the "pull" loop)? And why not for debug builds?

My iterator function doesn't have anything to do with this wordSkipper value, it literally makes no reference to that value or any value of its type in any way, so I don't understand how wordSkipper is getting modified by any code I can see.

Here's the new reproduction code with the above changes: https://go.dev/play/p/iEMVpdhy3go

Thanks again for taking a look. I've been programming for along time, so I've been humbled enough times that I'm certainly not beyond considering that there's some subtle issue with my code. But I have spent a bunch of time looking at this now, and boy is it not obvious what's going on here. Like I said above, this feels super racy no me, but I'm not doing anything overtly concurrent...

@randall77
Copy link
Contributor

This does look like a compiler bug to me. The variable wordSkipper should live in the parent frame and be captured by address in the loop body closure so the body can update it appropriately. But looking at the assembly the function body allocates its own copy of wordSkipper on the stack. It isn't at the same location as the outer one (and isn't copied back when done).

@drchase

@xieyuschen
Copy link

xieyuschen commented Sep 13, 2024

It's likely a compiler issue, but it's not the issue of range-over-func blocks forever, instead, it's caused by the generated code doesn't handle the parameter properly, which triggers a long-time running for-loop inside users code. I tried to clear the code in this go playground. Probably you can focus on this code snippet:

	if len(words) >= memorySize {
		rounds++ // generated store ssa instruction doesn't take effect, rounds value is wrong as 309237645415
		// fmt.Println(rounds) // generated load ssa instruction take effect
		wordSkipper = NewProbabilisticSkipper(rounds)
		for w := range words {
			if roundRemover.ShouldSkip() {
				delete(words, w)
			}
		}
	}
  • case1: use assignment of rounds rounds++ and comment the fmt.Println
	rounds++ // generated store ssa instruction doesn't take effect, ...
	// fmt.Println(rounds) // generated load ssa instruction take effect
  • case2: comment assignment of rounds and use fmt.Println
	// rounds++ // generated store ssa instruction doesn't take effect, ...
	fmt.Println(rounds) // generated load ssa instruction take effect
  • more cases:
rand.Intn(rounds) // 3. cannot work, used to check whether std lib matters
load(rounds) // 4. cannot work, when it's custom function
readInterface(rounds) // 5. work, which receives an interface 

func readInterface(r interface{}) {
	fmt.Println(r)
}
func load(r int) {
	fmt.Println(r)
	_ = r
}

The output is shown below when I use gotip run main.go on my local. Amazingly, the output of case1 is the same at the remote go playground. So I wonder is there something special for value 309237645415, do you have some ideas about it? @randall77

image

// output for case1
do bit operation 0 times when refresh counter in refreshCounter
do bit operation 1 times when refresh counter in refreshCounter
do bit operation 309237645415 times when refresh counter in refreshCounter

// output for case2
do bit operation 0 times when refresh counter in refreshCounter
0
Iter: expected 6 unique words, got 5

Looks like the compiler hasn't passed the allocation to the function call properly. Besides, currently if there is a load instruction and the parameter is passed as an interface before function call it works well. But the store instruction doesn't take effects. Not sure the fix is to respect the store instruction here as well.

@adonovan
Copy link
Member

(Apologies for my dismissive response. If Keith says it's a compiler bug, it's a compiler bug.)

@dmitshur dmitshur added this to the Go1.24 milestone Sep 13, 2024
@timothy-king timothy-king added the NeedsFix The path to resolution is known, but the work has not been done. label Sep 13, 2024
@cuonglm cuonglm self-assigned this Sep 18, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/614096 mentions this issue: cmd/compile: fix wrong esacpe analysis for rangefunc

@cherrymui cherrymui added the NextMinor A critical problem that affects the availability or correctness of production systems built using Go label Sep 18, 2024
@cuonglm
Copy link
Member

cuonglm commented Sep 18, 2024

This issue does not need a backport, #69511 will do the job.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/614195 mentions this issue: [release-branch.go1.23] cmd/compile: fix wrong esacpe analysis for rangefunc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. NextMinor A critical problem that affects the availability or correctness of production systems built using Go release-blocker
Projects
None yet
Development

No branches or pull requests

10 participants