Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beam Core IO: WriteFiles stuck #21492

Closed
damccorm opened this issue Jun 4, 2022 · 9 comments · Fixed by #23300
Closed

Beam Core IO: WriteFiles stuck #21492

damccorm opened this issue Jun 4, 2022 · 9 comments · Fixed by #23300

Comments

@damccorm
Copy link
Contributor

damccorm commented Jun 4, 2022

In our pipeline, workers are sometimes stuck at step WriteShardsIntoTempFiles. We believe this happens when GCS fails to respond for more than 10 minutes and Google's VPC cuts the connection. The Java code will just hang forever because it's waiting on FIN or RST from the peer. 

Environment: Java 8, Beam 2.34/2.35, Dataflow runner.

See below for relevant information from logs, threads, netstat. 

Log snippet:


Operation ongoing in step Write Events To File/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles
for at least 09h15m00s without outputting or completing in state process
  at org.conscrypt.NativeCrypto.SSL_do_handshake(Native
Method)
  at org.conscrypt.NativeSsl.doHandshake(NativeSsl.java:392)
  at org.conscrypt.ConscryptFileDescriptorSocket.startHandshake(ConscryptFileDescriptorSocket.java:225)
 
at org.conscrypt.ConscryptFileDescriptorSocket.waitForHandshake(ConscryptFileDescriptorSocket.java:474)
 
at org.conscrypt.ConscryptFileDescriptorSocket.getOutputStream(ConscryptFileDescriptorSocket.java:461)
 
at sun.net.www.http.HttpClient.openServer(HttpClient.java:465)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
 
at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
  at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
 
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
 
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
  at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
 
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
 
at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162)
  at
com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:148)
  at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
 
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
  at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
 
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
 
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
 
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1955)
 
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1857)
 
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1926)
 
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:374)
  at
org.apache.beam.sdk.extensions.gcp.util.GcsUtil.create(GcsUtil.java:476)
  at org.apache.beam.sdk.extensions.gcp.util.GcsUtil.create(GcsUtil.java:446)
 
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.create(GcsFileSystem.java:127)
  at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.create(GcsFileSystem.java:60)
 
at org.apache.beam.sdk.io.FileSystems.create(FileSystems.java:239)
  at org.apache.beam.sdk.io.FileSystems.create(FileSystems.java:226)
 
at org.apache.beam.sdk.io.FileBasedSink$Writer.open(FileBasedSink.java:935)
  at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:768)
 
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown
Source)

 


Operation ongoing in step Write Events To File/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles
for at least 04h55m00s without outputting or completing in state process
  at sun.misc.Unsafe.park(Native
Method)
  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
  at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
 
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
  at com.google.cloud.hadoop.util.BaseAbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(BaseAbstractGoogleAsyncWriteChannel.java:247)
 
at com.google.cloud.hadoop.util.BaseAbstractGoogleAsyncWriteChannel.close(BaseAbstractGoogleAsyncWriteChannel.java:168)
 
at org.apache.beam.sdk.io.FileBasedSink$Writer.close(FileBasedSink.java:1029)
  at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:932)
 
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown
Source) 

 

Threads snippet:

 


"Thread-140" - Thread t@204
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native
Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
 
      at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920)
 
      at org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884)
 
      at org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706)
 
      at org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230)
        at
org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209)
        - locked
<27a37219> (a java.lang.Object)
        at org.conscrypt.ConscryptEngineSocket.waitForHandshake(ConscryptEngineSocket.java:547)
 
      at org.conscrypt.ConscryptEngineSocket.getOutputStream(ConscryptEngineSocket.java:290)
     
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:465)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
 
      - locked <1edf4d14> (a sun.net.www.protocol.https.HttpsClient)
        at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
 
      at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
        at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
 
      at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
   
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
     
  at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
 
      at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162)
 
      at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:148)
       
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
        at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
 
      at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
 
      at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
 
      at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
 
      at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1970)
 
      at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1864)
 
      at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1938)
 
      at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:411)
 
      at org.apache.beam.sdk.extensions.gcp.util.GcsUtil.create(GcsUtil.java:490)
        at org.apache.beam.sdk.extensions.gcp.util.GcsUtil.create(GcsUtil.java:471)
 
      at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.create(GcsFileSystem.java:144)
 
      at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.create(GcsFileSystem.java:62)
   
    at org.apache.beam.sdk.io.FileSystems.create(FileSystems.java:243)
        at org.apache.beam.sdk.io.FileSystems.create(FileSystems.java:230)
 
      at org.apache.beam.sdk.io.FileBasedSink$Writer.open(FileBasedSink.java:951)
        at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:921)
 
      at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown
Source)
        at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:232)
 
      at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:188)
 
      at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:339)
 
      at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
 
      at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
 
      at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:185)
 
      at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108)
 
      at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1058)
 
      at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner$$Lambda$1284/1132939800.output(Unknown
Source)
        at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:445)
 
      at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:130)
 
      at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1061)
 
      at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:771)
 
      at org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:97)
 
      at org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:43)
 
      at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121)
 
      at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
 
      at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:80)
 
      at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:137)
 
      at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
 
      at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
 
      at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:212)
 
      at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:163)
 
      at org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92)
 
      at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1437)
 
      at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:165)
 
      at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1113)
 
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748) 
 Locked ownable synchronizers:
        - locked <4b11862a> (a java.util.concurrent.ThreadPoolExecutor$Worker

 

Netstat snippet:


tcp        1      0 our-pipeline:34942    internal-ip:443        CLOSE_WAIT 

 

Imported from Jira BEAM-14067. Original Jira may contain additional context.
Reported by: yianni.

@scwhittle
Copy link
Contributor

I believe this is google/conscrypt#864 and will be addressed once GoogleCloudDataproc/hadoop-connectors#865 is in a 2.2.x release and beam depedencies are so google_cloud_bigdataoss_version includes this version.

In the meantime this can be avoided by disabling conscrypt

@scwhittle
Copy link
Contributor

The fix should be in the 2.2.8 hadoop-connectors release, no ETA yet

@scwhittle
Copy link
Contributor

.take-issue

@cnauroth
Copy link

Hello @scwhittle . We have released GCS connector 2.2.8 containing the fix for Conscrypt socket hangs.

https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop3-2.2.8

@scwhittle
Copy link
Contributor

Thanks Chris! #23300 to update Beam

@scwhittle
Copy link
Contributor

.reopen-issue

@github-actions github-actions bot reopened this Oct 20, 2022
@scwhittle
Copy link
Contributor

Reopening as #23300 is being rolled back due to #23588

Workarounds in the meantime are to:

  • not enable the conscrypt security provider via enable_conscrypt_security_provider experiment
  • update the bigdataoss:gcsio library used manually to 2.2.8 (and not use use_grpc_for_gcs experiment)

@bvolpato
Copy link
Contributor

The reason for rollbacks here (use_grpc_for_gcs experiment) broke for other reasons, and I'm attempting an upgrade now to 2.2.16: #27679

@scwhittle
Copy link
Contributor

This should be fixed by upgrade to 2.2.16 or higher above

@github-actions github-actions bot added this to the 2.61.0 Release milestone Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants