Skip to content

Commit

Permalink
Add a retry budget to flaky tests (linkerd#1872)
Browse files Browse the repository at this point in the history
A number of Linkerd's end-to-end tests are known to sometimes fail spuriously on CI (linkerd#1224, linkerd#1225, linkerd#1504, etc). If a test known to be flaky fails, the build is typically restarted, sometimes repeatedly if the test fails twice in a row. This is problematic for the following reasons:
+  This lengthens the CI feedback loop considerably, as we need to recompile Linkerd and re-run *all* the other tests, as well as re-running the failed test.
+ Furthermore, rerunning the whole test suite also creates an opportunity for *other* flaky tests to also fail, prolonging the process.
+ Spurious failures can decrease overall confidence in the test suite.
+ If a new contributor makes a minor change in a pull request and a flaky test fails, the new contributor typically doesn't know which test failures are spurious, and may assume that their changes somehow caused an unrelated part of the codebase to break. This is an uncomfortable experience, particularly if it takes a few hours for someone "in the know" to notice and restart the build.

This PR adds a retry budget to tests known to be flaky. It builds upon ScalaTest's [`Retries`](http://doc.scalatest.org/3.0.0/index.html#org.scalatest.Retries) trait, but adds the notion of a retry *budget*, rather than retrying tests only a single time. This is probably necessary as we've observed some of the flaky tests to fail spuriously multiple times in a row.

My rationale behind this is as follows: obviously we'd all like to write tests that never fail spuriously. With that said, it's proved very difficult to identify the cause behind these flaky test failures, so it doesn't seem like this problem will be fixed easily any time soon. Alternatively, we could consider skipping the known flaky tests entirely on CI.  However, a pretty large number of our E2E tests have been observed to fail spuriously at least occasionally, and I, for one, would be uncomfortable skipping such a large portion of the test suite on CI. Finally, note that the retry budget behaviour will result in more or less the same outcome as our current manual restarting of failed flaky tests, but will shorten the feedback loop significantly, and won't result in broken builds due to flaky test failures.

Fixes linkerd#1225. Fixes linkerd#1504.

Signed-off-by: Eliza Weisman <[email protected]>
  • Loading branch information
hawkw authored Mar 21, 2018
1 parent b8efcd4 commit 863fc93
Show file tree
Hide file tree
Showing 8 changed files with 92 additions and 57 deletions.
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
package io.buoyant.grpc.interop

import com.twitter.util.{Future, Return, Throw, Try}
import com.twitter.util.Future
import grpc.{testing => pb}
import io.buoyant.test.FunSuite
import org.scalatest.tagobjects.Retryable

trait InteropTestBase { _: FunSuite =>

Expand Down Expand Up @@ -32,13 +33,13 @@ trait InteropTestBase { _: FunSuite =>
if (only.nonEmpty && !only(name)) ignore(name) {}
else todo.get(name) match {
case Some(msg) =>
test(name) {
test(name, Retryable) {
assertThrows[Throwable](await(withClient(run)))
cancel(s"TODO: $msg")
}

case None =>
test(name) { await(withClient(run)) }
test(name, Retryable) { await(withClient(run)) }
}

}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
package io.buoyant.grpc.interop

import java.net.InetSocketAddress
import com.twitter.conversions.storage._
import com.twitter.finagle.buoyant.{H2, h2}
import com.twitter.util.Future
import grpc.{testing => pb}
import io.buoyant.test.FunSuite
import java.net.InetSocketAddress
import io.buoyant.test.{BudgetedRetries, FunSuite}

class NetworkedInteropTest extends FunSuite with InteropTestBase {
class NetworkedInteropTest extends FunSuite with InteropTestBase with BudgetedRetries {

// override def only = Set("large_unary")

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,15 @@ import io.buoyant.linkerd._
import io.buoyant.linkerd.usage.UsageMessage
import io.buoyant.namer.{NamerInitializer, TestNamerInitializer}
import io.buoyant.telemetry.MetricsTree
import io.buoyant.test.{Awaits, FunSuite}
import io.buoyant.test.{Awaits, BudgetedRetries, FunSuite}
import java.net.InetSocketAddress
import java.text.SimpleDateFormat
import java.util.Date
import org.scalatest.tagobjects.Retryable
import scala.util.Try

class UsageDataTelemeterEndToEndTest extends FunSuite with Awaits {
class UsageDataTelemeterEndToEndTest extends FunSuite with Awaits with BudgetedRetries {

case class Downstream(name: String, server: ListeningServer, service: Service[Request, Response]) {
val address = server.boundAddress.asInstanceOf[InetSocketAddress]
val port = address.getPort
Expand All @@ -43,7 +45,7 @@ class UsageDataTelemeterEndToEndTest extends FunSuite with Awaits {
namers: Seq[NamerInitializer] = Seq(TestNamerInitializer)
) = Linker.Initializers(protocol = protos, namer = namers)

test("telemeter sends metrics") {
test("telemeter sends metrics", Retryable) {
val promise = new Promise[UsageMessage]
val proxy = Downstream.mk("proxy") { r =>
val Buf.ByteBuffer.Owned(bb) = Buf.ByteBuffer.coerce(r.content)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,24 +1,28 @@
package io.buoyant.linkerd
package protocol

import com.twitter.conversions.time._
import com.twitter.finagle.{Http => FinagleHttp, Status => _, http => _, _}
import java.io.File
import java.net.InetSocketAddress
import com.twitter.finagle.buoyant.linkerd.Headers
import com.twitter.finagle.http.{param => _, _}
import com.twitter.finagle.http.Method._
import com.twitter.finagle.http.{param => _, _}
import com.twitter.finagle.stats.{InMemoryStatsReceiver, NullStatsReceiver}
import com.twitter.finagle.tracing.{Annotation, BufferingTracer, NullTracer}
import com.twitter.finagle.{Http => FinagleHttp, Status => _, http => _, _}
import com.twitter.util._
import io.buoyant.router.{Http, RoutingFactory}
import io.buoyant.router.http.MethodAndHostIdentifier
import io.buoyant.test.Awaits
import java.io.File
import java.net.InetSocketAddress
import io.buoyant.test.{Awaits, BudgetedRetries}
import org.scalatest.tagobjects.Retryable
import org.scalatest.{FunSuite, MustMatchers, OptionValues}
import scala.io.Source
import scala.util.Random

class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with OptionValues {
class HttpEndToEndTest
extends FunSuite
with Awaits
with MustMatchers
with OptionValues
with BudgetedRetries {


case class Downstream(name: String, server: ListeningServer) {
val address = server.boundAddress.asInstanceOf[InetSocketAddress]
Expand Down Expand Up @@ -79,7 +83,7 @@ class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with Optio
case Annotation.Message(m) if Seq("l5d.retryable", "l5d.failure").contains(m) => m
}

test("linking") {
test("linking", Retryable) {
val stats = NullStatsReceiver
val tracer = new BufferingTracer
def withAnnotations(f: Seq[Annotation] => Unit): Unit = {
Expand Down Expand Up @@ -150,7 +154,7 @@ class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with Optio
}


test("marks 5XX as failure by default") {
test("marks 5XX as failure by default", Retryable) {
val stats = new InMemoryStatsReceiver
val tracer = NullTracer

Expand Down Expand Up @@ -207,7 +211,7 @@ class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with Optio
}
}

test("marks exceptions as failure by default") {
test("marks exceptions as failure by default", Retryable) {
val stats = new InMemoryStatsReceiver
val tracer = NullTracer

Expand Down Expand Up @@ -349,22 +353,22 @@ class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with Optio
}
}

test("retries retryableIdempotent5XX") {
test("retries retryableIdempotent5XX", Retryable) {
retryTest("io.l5d.http.retryableIdempotent5XX", idempotentMethods)
}

test("retries retryablRead5XX") {
test("retries retryablRead5XX", Retryable) {
retryTest("io.l5d.http.retryableRead5XX", readMethods)
}

test("retries nonRetryable5XX") {
test("retries nonRetryable5XX", Retryable) {
retryTest("io.l5d.http.nonRetryable5XX", Set.empty)
}

val dtabReadHeaders = Seq("l5d-dtab", "l5d-ctx-dtab")
val dtabWriteHeader = "l5d-ctx-dtab"

for (readHeader <- dtabReadHeaders) test(s"dtab read from $readHeader header") {
for (readHeader <- dtabReadHeaders) test(s"dtab read from $readHeader header", Retryable) {
val stats = NullStatsReceiver
val tracer = new BufferingTracer

Expand Down Expand Up @@ -397,7 +401,7 @@ class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with Optio
assert(!headers.contains("dtab-local"))
}

test("dtab-local header is ignored") {
test("dtab-local header is ignored", Retryable) {
val stats = NullStatsReceiver
val tracer = new BufferingTracer

Expand Down Expand Up @@ -427,7 +431,7 @@ class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with Optio
assert(!headers.contains(dtabWriteHeader))
}

test("with clearContext") {
test("with clearContext", Retryable) {
val downstream = Downstream.mk("dog") { req =>
val rsp = Response()
rsp.contentString = req.headerMap.collect {
Expand Down Expand Up @@ -472,7 +476,7 @@ class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with Optio
))
}

test("clearContext will remove linkerd error headers and body") {
test("clearContext will remove linkerd error headers and body", Retryable) {
val yaml =
s"""|routers:
|- protocol: http
Expand Down Expand Up @@ -500,7 +504,7 @@ class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with Optio

}

test("timestampHeader adds header") {
test("timestampHeader adds header", Retryable) {
@volatile var headers: Option[HeaderMap] = None
val downstream = Downstream.mk("test") {
req =>
Expand Down Expand Up @@ -539,7 +543,7 @@ class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with Optio
}
}

test("no timestampHeader does not add timestamp header") {
test("no timestampHeader does not add timestamp header", Retryable) {
@volatile var headers: Option[HeaderMap] = None
val downstream = Downstream.mk("test") {
req =>
Expand Down Expand Up @@ -576,7 +580,7 @@ class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with Optio
}
}

test("without clearContext") {
test("without clearContext", Retryable) {
val downstream = Downstream.mk("dog") { req =>
val rsp = Response()
rsp.contentString = req.headerMap.collect {
Expand Down Expand Up @@ -623,7 +627,7 @@ class HttpEndToEndTest extends FunSuite with Awaits with MustMatchers with Optio
assert(headers.get("l5d-ctx-dtab") == Some(localDtab))
}

test("logs to correct files") {
test("logs to correct files", Retryable) {
val downstream = Downstream.mk("test") {
req =>
val rsp = Response()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,20 @@ import java.io.{File, InputStream}
import java.net.InetSocketAddress
import java.nio.file.StandardCopyOption.REPLACE_EXISTING
import java.nio.file.{Files, Paths}

import com.twitter.finagle._
import com.twitter.finagle.buoyant.TlsClientConfig
import com.twitter.finagle.http._
import com.twitter.finagle.{Http => FinagleHttp, Status => _, http => _, _}
import com.twitter.finagle.http.{param => _, _}
import com.twitter.finagle.http.{Method, Status}
import com.twitter.finagle.http.{Method, Status, param => _, _}
import com.twitter.finagle.ssl.client.SslClientConfiguration
import com.twitter.finagle.ssl.server.SslServerConfiguration
import com.twitter.finagle.ssl.{ClientAuth, KeyCredentials, TrustCredentials}
import com.twitter.finagle.transport.Transport
import com.twitter.finagle.{Http => FinagleHttp, Status => _, http => _, _}
import com.twitter.util._
import io.buoyant.config.Parser
import io.buoyant.test.FunSuite
import io.buoyant.test.{BudgetedRetries, FunSuite}
import org.scalatest.tagobjects.Retryable


class HttpTlsEndToEndTest extends FunSuite {
class HttpTlsEndToEndTest extends FunSuite with BudgetedRetries {

private[this] def loadResource(p: String): InputStream =
getClass.getResourceAsStream(p)
Expand All @@ -36,7 +34,7 @@ class HttpTlsEndToEndTest extends FunSuite {
Future.value(Response())
}

test("client/server works with TLS") {
test("client/server works with TLS", Retryable) {
val srv = {
val srvCert = loadPem("linkerd-tls-e2e-cert")
val srvKey = loadPem("linkerd-tls-e2e-key")
Expand Down
Original file line number Diff line number Diff line change
@@ -1,21 +1,23 @@
package io.buoyant.router
package h2

import java.net.InetSocketAddress
import com.twitter.concurrent.AsyncQueue
import com.twitter.finagle.{ChannelClosedException, Dtab, Failure, Path}
import com.twitter.finagle.buoyant.Dst
import com.twitter.finagle.buoyant.h2._
import com.twitter.finagle.{ChannelClosedException, Dtab, Failure, Path}
import com.twitter.logging.Level
import com.twitter.util.{Future, Promise, Throw}
import io.buoyant.router.h2.ClassifiedRetries.BufferSize
import io.buoyant.test.FunSuite
import java.net.InetSocketAddress
import io.buoyant.test.{BudgetedRetries, FunSuite}
import org.scalatest.tagobjects.Retryable

class RouterEndToEndTest
extends FunSuite
with ClientServerHelpers {
with ClientServerHelpers
with BudgetedRetries {

test("simple prior knowledge") {
test("simple prior knowledge", Retryable) {
val cat = Downstream.const("cat", "meow")
val dog = Downstream.const("dog", "woof")
val dtab = Dtab.read(s"""
Expand Down Expand Up @@ -47,7 +49,7 @@ class RouterEndToEndTest
}
}

test("fails requests with connection-headers") {
test("fails requests with connection-headers", Retryable) {
val dog = Downstream.const("dog", "woof")
val dtab = Dtab.read(s"""
/p/dog => /$$/inet/127.1/${dog.port} ;
Expand All @@ -73,7 +75,7 @@ class RouterEndToEndTest
}
}

test("resets downstream on upstream cancelation") {
test("resets downstream on upstream cancelation", Retryable) {
val dogReqP = new Promise[Stream]
val dogRspP = new Promise[Stream]
@volatile var serverInterrupted: Option[Throwable] = None
Expand Down Expand Up @@ -116,7 +118,7 @@ class RouterEndToEndTest
}
}

test("resets downstream on upstream disconnect") {
test("resets downstream on upstream disconnect", Retryable) {
val dogReqP = new Promise[Stream]
val dogRspP = new Promise[Stream]
val dog = Downstream.service("dog") { req =>
Expand Down Expand Up @@ -164,7 +166,7 @@ class RouterEndToEndTest
}
}

test("resets upstream on downstream failure") {
test("resets upstream on downstream failure", Retryable) {
val dogReqP = new Promise[Stream]
val dogRspP = new Promise[Stream]
val dog = Downstream.service("dog") { req =>
Expand Down
15 changes: 8 additions & 7 deletions router/mux/src/e2e/scala/io/buoyant/router/MuxEndToEndTest.scala
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
package io.buoyant.router

import com.twitter.conversions.time._
import com.twitter.finagle.{Mux=>FinagleMux, _}
import java.net.InetSocketAddress
import com.twitter.finagle.stats.NullStatsReceiver
import com.twitter.finagle.tracing.{BufferingTracer, NullTracer}
import com.twitter.finagle.{Mux => FinagleMux, _}
import com.twitter.io.Buf
import com.twitter.util._
import io.buoyant.test.Awaits
import java.net.InetSocketAddress
import io.buoyant.test.{Awaits, BudgetedRetries}
import org.scalatest.FunSuite
import org.scalatest.tagobjects.Retryable

class MuxEndToEndTest extends FunSuite with Awaits with BudgetedRetries {

class MuxEndToEndTest extends FunSuite with Awaits {

/*
* A bunch of utility/setup. The test is configured as follows:
Expand Down Expand Up @@ -86,15 +87,15 @@ class MuxEndToEndTest extends FunSuite with Awaits {
}

// sanity check without router
test("client-server no routing") {
test("client-server no routing", Retryable) {
val horse = Downstream.const("horse", "neigh")
val client = Upstream(horse)
assert(client("mred") == "neigh")
await(horse.server.close())
await(client.client.close())
}

test("end-to-end mux routing") {
test("end-to-end mux routing", Retryable) {
// downstream services
val cat = Downstream.const("cat", "meow")
val dog = Downstream.const("dog", "woof")
Expand Down
27 changes: 27 additions & 0 deletions test-util/src/main/scala/io/buoyant/test/BudgetedRetries.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
package io.buoyant.test

import org.scalatest.{Canceled, Failed, Outcome, Retries}

/**
* Mixin trait for tests to support a retry budget.
*/
trait BudgetedRetries extends FunSuite with Retries {

/**
* The number of retries permitted before a test is failed.
*
* Tests that mix in `BudgetedRetries`
*/
def retries = 4

override def withFixture(test: NoArgTest) =
if (isRetryable(test)) withRetries(test, retries)
else super.withFixture(test)

private[this] def withRetries (test: NoArgTest, remaining: Int): Outcome =
super.withFixture(test) match {
case Failed(_) | Canceled(_) if remaining == 1 => super.withFixture(test)
case Failed(_) | Canceled(_) => withRetries(test, remaining - 1)
case other => other
}
}

0 comments on commit 863fc93

Please sign in to comment.