Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS table lock not working: broken downloads #325

Open
MattBlissett opened this issue Nov 14, 2023 · 5 comments
Open

HDFS table lock not working: broken downloads #325

MattBlissett opened this issue Nov 14, 2023 · 5 comments
Assignees

Comments

@MattBlissett
Copy link
Member

Some downloads can fail around 06:00Z when the HDFS table build completes.

Error: java.io.IOException: java.io.FileNotFoundException: File does not exist: hdfs://ha-nn/user/hive/warehouse/prod_h.db/occurrence_multimedia/001455_0
	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
	at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:254)
	at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:169)
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:438)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://ha-nn/user/hive/warehouse/prod_h.db/occurrence_multimedia/001455_0
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1270)
	at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1262)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1262)
	at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
	at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:372)
	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSplit(ParquetRecordReaderWrapper.java:252)
	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:95)
	at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.<init>(ParquetRecordReaderWrapper.java:81)
	at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:72)
	at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:252)
	... 8 more

There are a few of these from recent weeks, but it's not necessarily a new problem.

@muttcg muttcg self-assigned this Nov 14, 2023
@muttcg
Copy link
Member

muttcg commented Nov 15, 2023

We rely on INSERT OVERWRITE, which creates a read lock according to the documentation. I made a simple test, and everything appears to be correct and working as we expect.

The lock is present when data is being inserted:

SHOW LOCKS uat.test_lock

tab_name mode
uat@test_lock EXCLUSIVE

SQL queries for that table are waiting for lock releases during insertion.

@timrobertson100
Copy link
Member

Thanks @muttcg

Do we know if the error is always related to the occurrence_multimedia table?
If so, perhaps there is something we're overlooking in locking behavior when using JOIN queries and replacing both tables - might be something to test too.

@muttcg
Copy link
Member

muttcg commented Nov 15, 2023

@timrobertson100
Since YARN stores only the last 5 days of logs (if I'm not mistaken), I didn't find more similar cases. But, I also tried to simulate multimedia table insert from another table, and it worked correctly, so no clear answer why it failed.

@timrobertson100
Copy link
Member

timrobertson100 commented Nov 15, 2023

Thanks. Found in Slack from 2 months ago, so this confirms not just multimedia:

The download that failed this morning at 8:06 has error Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/hive/warehouse/prod_h.db/occurrence/003993_0

@timrobertson100
Copy link
Member

timrobertson100 commented Nov 16, 2023

I have just seen this when trying a clustering run. The clustering run is slightly different than a download in that it is doing a Spark SQL job, sourced from the Hive metastore. It could be that Spark SQL doesn't lock (or perhaps our environment is not configured to lock) the same way as the Oozie-launched MR jobs.

23/11/16 06:15:51 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-a7a68671-faf6-422e-b105-b98435477dbe
...
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/hive/warehouse/prod_h.db/occurrence/000852_0
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)

The nightly table build job appears to still be in the create-avro stage

image

Edited after the build table completed to add:

The create-avro launches 2 child jobs, and it may be noteworthy that the first (INSERT OVERWRITE TABLE occ...occurrence_avro(Stage-1)) finished 21 secs before the error above, and the second (INSERT OVERWRITE TABLE occurrenc...mm_record(Stage-1)) did not start until a minute later.

Also likely relevant is that the file that is missing in my query was actually created nearly an hour before the error and before I submitted my clustering job, but was presumably sitting in a job tmp directory and moved into place as the MR job completed (hdfs mv should hold the create time and not the time it was moved)

hdfs dfs -ls /user/hive/warehouse/prod_h.db/occurrence | grep "000852_0"
-rwxrwxrwt   3 hdfs hive   36353064 2023-11-16 05:23 /user/hive/warehouse/prod_h.db/occurrence/000852_0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants