Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alluxio metadata is inconsistent(alluxio元数据不一致) #18716

Open
YaAYadeer opened this issue Dec 19, 2024 · 2 comments
Open

Alluxio metadata is inconsistent(alluxio元数据不一致) #18716

YaAYadeer opened this issue Dec 19, 2024 · 2 comments
Labels
type-bug This issue is about a bug

Comments

@YaAYadeer
Copy link

YaAYadeer commented Dec 19, 2024

问题场景-元数据不一致

  1. pyspark任务会使用到一些临时目录,作业运行过程中删除临时目录时出现异常,alluxio中显示某临时文件不存在,但底层hdfs中此文件存在,元数据不一致导致目录无法正常删除
java.io.IOException: alluxio.exception.DirectoryNotEmptyException: Failed to delete 3 paths from the under file system: 

/kdl/pre-kde/C4CA4238A0B923820DCC509A6F75849B/hive/idcp_prod/driving_score_rcd_datatx_merge/_temporary/0/_temporary (UFS dir not in sync. Sync UFS, or delete with unchecked flag.), 
/kdl/pre-kde/C4CA4238A0B923820DCC509A6F75849B/hive/idcp_prod/driving_score_rcd_datatx_merge/_temporary/0 (Directory not empty), 
/kdl/pre-kde/C4CA4238A0B923820DCC509A6F75849B/hive/idcp_prod/driving_score_rcd_datatx_merge/_temporary (Directory not empty)
底层
hdfs dfs -ls  /pre-kcde/C4CA4238A0B923820DCC509A6F75849B/hive/idcp_prod/driving_score_rcd_datatx_merge/_temporary/0/_temporary/attempt_202412171250252523373634165305884_0002_m_008133_8137/date1=20241215 |more

-rw-r--r--   3 alluxio       3659 2024-12-17 13:05 /pre-kcde/C4CA4238A0B923820DCC509A6F75849B/hive/idcp_prod/driving_score_rcd_datatx_merge/_temporary/0/_temporary/attempt_202412171250252523373634165305884_0002_m_008133_8137/date1=20241215/part-
08133-0855a9dc-b3a2-411b-ad4c-cdbf6a3349fc.c000.snappy.orc
不一致的原因
alluxio fs checkConsistency  /kdl/pre-kde/C4CA4238A0B923820DCC509A6F75849B/hive/idcp_prod/driving_score_rcd_datatx_merge/_temporary/0/_temporary/attempt_202412171250252523373634165305884_0002_m_008133_8137/date1=20241215/part-08133-0855a9dc-b3a2-411b-ad4c-cdbf6a3349fc.c000.snappy.orc
Path "/kdl/pre-kde/C4CA4238A0B923820DCC509A6F75849B/hive/idcp_prod/driving_score_rcd_datatx_merge/_temporary/0/_temporary/attempt_202412171250252523373634165305884_0002_m_008133_8137/date1=20241215/part-08133-0855a9dc-b3a2-411b-ad4c-cdbf6a3349fc.c000.snappy.orc" does not exist.
  1. 运行pyspark任务(yarn模式提交),向hive表中写入数据(hive分区表,单个分区下面有多个小文件),pyspark任务对接的数据源是alluxio路径形式,alluxio底层挂载了hdfs。pyspark任务运行完成之后,发现通过pyspark文件所属用户不一致,hdfs上文件owner是alluxio alluxio内查询文件owner是yarn
    image (1)

pyspark任务

`
spark.sql("""
CREATE TABLE IF NOT EXISTS tmx_hive1.day_table_orc12(
id INT,
content STRING
)
PARTITIONED BY (dt STRING)
STORED AS ORC
""")

rows_per_file = 2
total_files = 6
total_rows = rows_per_file * total_files

data = [(i,generate_fixed_length_string(100), "20241213") for i in range(total_rows)]
columns = ["id", "content", "dt"]
df = spark.createDataFrame(data, columns)
#print(f"Original number of partitions: {df.rdd.getNumPartitions()}")

df_repartitioned = df.repartition(total_files, "id")

df_repartitioned.write \
  .mode("append") \
  .format("hive") \
  .partitionBy("dt") \
  .saveAsTable("tmx_hive1.day_table_orc12")

spark.stop()`

运行pyspark任务时alluxio日志

master-rpc-executor-TPE-thread-159

  • Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/kcde/hdfs/tmx/C20AD4D76FE97759AA27A0C99BFF6710/hive/tmx_hive1/day_table_orc12/dt=20241213, desiredLockPattern=READ, shouldSync={Should sync: false, Last sync time: 1734508706428}}, descendantType=NONE, commonOptions=syncIntervalMs: -1
    ttl: -1
    ttlAction: DELETE
    , forceSync=true} because it does not exist on the UFS or in Alluxio
    2024-12-18 16:32:02,973 INFO
    master-rpc-executor-TPE-thread-169
  • Updating inode 'part-00001-e18c745a-1855-4efb-832c-16ec0f0985e9.c000' mode bits from rw-r--r-- to rwxrwxrwx
    2024-12-18 16:32:02,980 INFO
    master-rpc-executor-TPE-thread-85
  • Updating inode 'part-00002-e18c745a-1855-4efb-832c-16ec0f0985e9.c000' mode bits from rw-r--r-- to rwxrwxrwx
    2024-12-18 16:32:02,985 INFO
    master-rpc-executor-TPE-thread-187
  • Updating inode 'part-00003-e18c745a-1855-4efb-832c-16ec0f0985e9.c000' mode bits from rw-r--r-- to rwxrwxrwx
    2024-12-18 16:32:02,990 INFO
    master-rpc-executor-TPE-thread-183
  • Updating inode 'part-00005-e18c745a-1855-4efb-832c-16ec0f0985e9.c000' mode bits from rw-r--r-- to rwxrwxrwx
    2024-12-18 16:32:04,670 WARN
    master-rpc-executor-TPE-thread-92
  • Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/kdl/kcde/spark/spark3-history/0/application_1731383355780_0129_1, desiredLockPattern=READ, shouldSync={Should sync: false, Last sync time: 0}}, descendantType=NONE, commonOptions=syncIntervalMs: -1
    ttl: -1
    ttlAction: DELETE
    , forceSync=true} because it does not exist on the UFS or in Alluxio
@YaAYadeer YaAYadeer added the type-bug This issue is about a bug label Dec 19, 2024
@YaAYadeer
Copy link
Author

元数据修复前后对比查询

  • 通过alluxio fs ls命令手动触发元数据同步进行文件修复
  • 图片左侧为修复前,右侧为修复后
    • 红框与block相关,黑框与权限相关,其余差别与文件id,编辑时间,缓存信息等有关,inAlluxioPercentage=100表示文件的数据已经被缓存到 Alluxio 中
    • 对比内容发现修复前后所有者(owner) 只有这个不一致

download_image

@YaAYadeer
Copy link
Author

@YichuanSun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

No branches or pull requests

1 participant