Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More than 100% data in Alluxio #18577

Open
ziyangRen opened this issue Apr 17, 2024 · 17 comments
Open

More than 100% data in Alluxio #18577

ziyangRen opened this issue Apr 17, 2024 · 17 comments
Labels
type-bug This issue is about a bug

Comments

@ziyangRen
Copy link

Alluxio Version:
What version of Alluxio are you using?

Describe the bug
The error logs I observed in spark were:
Protocol message tag had invalid wire type.
The error logs I observed in trino were:

Error opening Hive split alluxio:/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000 (offset=67108864, length=67108864): Incorrect file size (270589145) for file (end of stream not reached): alluxio:/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000

I check the information of this file in alluxio as follows, the size of the file is more than 100%, specifically, it is made up of two 256M blocks, and the size of the file in HDFS is 258.1M,it should be noted that the data in HDFS is written by Alluxio,I used alluxio.user.file.metadata.sync.interval=216000000 and alluxio.user.file.writetype.default=CACHE_THROUGH
When I switched the metadata for this table back to HDFS, the job worked, meaning the HDFS data was working, but Alluxio's data was causing problems

Here's how this file looks in alluxio:

bin/alluxio fs stat /user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000
/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000 is a file path.
URIStatus{info=FileInfo{fileId=164846448410623, name=part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000, path=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000, ufsPath=hdfs://intsig-bigdata-nameservice/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000, length=270589145, blockSizeBytes=268435456, creationTimeMs=1713296722044, completed=true, folder=false, pinned=false, pinnedlocation=[], cacheable=true, persisted=true, blockIds=[164846431633408, 164846431633409], inMemoryPercentage=198, lastModificationTimesMs=1713296987573, ttl=-1, lastAccessTimesMs=1713296987573, ttlAction=FREE, owner=core_adm, group=hive, mode=440, persistenceState=PERSISTED, mountPoint=false, replicationMax=-1, replicationMin=0, fileBlockInfos=[FileBlockInfo{blockInfo=BlockInfo{id=164846431633408, length=268435456, locations=[BlockLocation{workerId=4789052418356342161, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-106.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-106.intsig.internal, rack=null)}, tierAlias=MEM, mediumType=MEM}, BlockLocation{workerId=7686509082663303020, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-136.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-136.intsig.internal, rack=null)}, tierAlias=MEM, mediumType=MEM}, BlockLocation{workerId=8389319337060975339, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-139.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-139.intsig.internal, rack=null)}, tierAlias=MEM, mediumType=MEM}, BlockLocation{workerId=252882622198499095, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-112.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-112.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=318529424824982519, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-115.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-115.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=751232389065403260, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-99.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-99.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=884009842569883069, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-130.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-130.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=1322556044732826109, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-108.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-108.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=1659740148412970156, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-103.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-103.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=1750661044104421740, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-129.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-129.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2212130573833630043, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-96.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-96.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2323607761968839547, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-117.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-117.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2537207964641668843, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-105.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-105.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2915095790252411072, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-97.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-97.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=3097974485333619764, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-101.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-101.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=3571195479121079143, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-95.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-95.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=5001574657684482277, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-94.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-94.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=6859183490568364892, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-107.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-107.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7009847752086549463, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-111.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-111.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7273818547467537089, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-110.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-110.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7438420668927043862, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-116.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-116.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7504197730422955713, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-104.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-104.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8039274055332598606, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-113.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-113.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8379003440035830949, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-100.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-100.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8964743355741175974, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-119.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-119.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=9006306898128303596, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-114.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-114.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}]}, offset=0, ufsLocations=[]}, FileBlockInfo{blockInfo=BlockInfo{id=164846431633409, length=268435456, locations=[BlockLocation{workerId=4789052418356342161, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-106.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-106.intsig.internal, rack=null)}, tierAlias=MEM, mediumType=MEM}, BlockLocation{workerId=8389319337060975339, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-139.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-139.intsig.internal, rack=null)}, tierAlias=MEM, mediumType=MEM}, BlockLocation{workerId=252882622198499095, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-112.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-112.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=318529424824982519, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-115.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-115.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=751232389065403260, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-99.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-99.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=884009842569883069, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-130.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-130.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=1322556044732826109, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-108.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-108.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=1659740148412970156, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-103.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-103.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2212130573833630043, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-96.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-96.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2323607761968839547, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-117.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-117.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2537207964641668843, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-105.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-105.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2915095790252411072, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-97.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-97.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=3097974485333619764, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-101.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-101.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=3571195479121079143, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-95.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-95.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=5001574657684482277, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-94.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-94.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=6859183490568364892, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-107.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-107.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7009847752086549463, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-111.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-111.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7273818547467537089, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-110.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-110.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7438420668927043862, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-116.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-116.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7504197730422955713, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-104.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-104.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7686509082663303020, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-136.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-136.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8039274055332598606, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-113.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-113.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8379003440035830949, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-100.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-100.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8964743355741175974, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-119.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-119.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=9006306898128303596, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-114.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-114.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}]}, offset=268435456, ufsLocations=[]}], mountId=1, inAlluxioPercentage=198, ufsFingerprint=TYPE|FILE UFS|hdfs OWNER|prod_tong_liu GROUP|prod_tong_liu MODE|432 CONTENT_HASH|(len:270589145,_modtime:1713296987554) , acl=user::rw-,group::rwx,other::---,group:intsig:r-x,group:prod:rwx,mask::rw-, defaultAcl=}, cacheContext=null}
Containing the following blocks: 
BlockInfo{id=164846431633408, length=268435456, locations=[BlockLocation{workerId=4789052418356342161, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-106.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-106.intsig.internal, rack=null)}, tierAlias=MEM, mediumType=MEM}, BlockLocation{workerId=7686509082663303020, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-136.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-136.intsig.internal, rack=null)}, tierAlias=MEM, mediumType=MEM}, BlockLocation{workerId=8389319337060975339, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-139.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-139.intsig.internal, rack=null)}, tierAlias=MEM, mediumType=MEM}, BlockLocation{workerId=252882622198499095, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-112.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-112.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=318529424824982519, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-115.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-115.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=751232389065403260, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-99.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-99.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=884009842569883069, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-130.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-130.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=1322556044732826109, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-108.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-108.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=1659740148412970156, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-103.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-103.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=1750661044104421740, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-129.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-129.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2212130573833630043, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-96.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-96.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2323607761968839547, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-117.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-117.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2537207964641668843, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-105.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-105.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2915095790252411072, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-97.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-97.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=3097974485333619764, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-101.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-101.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=3571195479121079143, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-95.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-95.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=5001574657684482277, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-94.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-94.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=6859183490568364892, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-107.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-107.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7009847752086549463, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-111.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-111.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7273818547467537089, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-110.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-110.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7438420668927043862, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-116.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-116.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7504197730422955713, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-104.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-104.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8039274055332598606, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-113.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-113.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8379003440035830949, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-100.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-100.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8964743355741175974, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-119.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-119.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=9006306898128303596, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-114.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-114.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}]}
BlockInfo{id=164846431633409, length=268435456, locations=[BlockLocation{workerId=4789052418356342161, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-106.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-106.intsig.internal, rack=null)}, tierAlias=MEM, mediumType=MEM}, BlockLocation{workerId=8389319337060975339, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-139.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-139.intsig.internal, rack=null)}, tierAlias=MEM, mediumType=MEM}, BlockLocation{workerId=252882622198499095, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-112.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-112.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=318529424824982519, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-115.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-115.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=751232389065403260, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-99.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-99.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=884009842569883069, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-130.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-130.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=1322556044732826109, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-108.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-108.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=1659740148412970156, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-103.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-103.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2212130573833630043, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-96.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-96.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2323607761968839547, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-117.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-117.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2537207964641668843, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-105.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-105.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=2915095790252411072, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-97.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-97.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=3097974485333619764, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-101.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-101.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=3571195479121079143, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-95.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-95.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=5001574657684482277, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-94.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-94.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=6859183490568364892, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-107.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-107.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7009847752086549463, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-111.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-111.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7273818547467537089, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-110.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-110.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7438420668927043862, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-116.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-116.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7504197730422955713, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-104.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-104.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=7686509082663303020, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-136.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-136.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8039274055332598606, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-113.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-113.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8379003440035830949, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-100.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-100.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=8964743355741175974, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-119.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-119.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}, BlockLocation{workerId=9006306898128303596, address=WorkerNetAddress{host=centos-bigdata-datanode-10-24-2-114.intsig.internal, containerHost=, rpcPort=29999, dataPort=29999, webPort=30010, domainSocketPath=, tieredIdentity=TieredIdentity(node=centos-bigdata-datanode-10-24-2-114.intsig.internal, rack=null)}, tierAlias=SSD, mediumType=SSD}]}

In addition, neither checksum nor copyToLocal attempts were successful for this file, and I didn't see any errors from the master or worker when the Spark and Trino tasks failed

@ziyangRen ziyangRen added the type-bug This issue is about a bug label Apr 17, 2024
@YichuanSun
Copy link
Contributor

Which alluxio version do you use?

@jasondrogba
Copy link
Contributor

you need to refresh metadata.
set alluxio.user.file.metadata.sync.interval=0 to sync metadata.

@ziyangRen
Copy link
Author

Thank you for your reply !
@YichuanSun I used 2.9.3
@jasondrogba I tried to check the metadata using checkConsistency, but it was no different from UFS, and my manual loadMetaData data was not fixed.

/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000 is consistent with the under storage system.

I'm wondering what's causing this, and I'm not going to bypass Alluxio and write UFS directly, as there should be no inconsistency in metadata

@ziyangRen
Copy link
Author

In addition, the write to the file is a spark2.4.8 connection to hive write (the hive table metadata points to alluxio), and the alluxio log before and after the file generation is as follows:

grep part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000 /opt/alluxio/alluxio-2.9.3/logs/*log*|grep "2024-04-17 03:"
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.81:2024-04-17 03:58:42,943 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=true allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.200:52778   cmd=getFileInfo src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000     dst=null        perm=core_adm:hive:rw-rwx---    executionTimeUs=302     proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.82:2024-04-17 03:58:09,365 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=false        allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.113:56684   cmd=getFileInfo src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000 dst=null        perm=null       executionTimeUs=1366      proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.82:2024-04-17 03:58:09,373 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=true allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.113:56684   cmd=rename      src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/task_20240417030152_0021_m_000957/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000  dst=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000   perm=core_adm:hive:rw-rwx---    executionTimeUs=6462    proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.82:2024-04-17 03:58:36,183 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=false        allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.113:56684   cmd=delete      src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000     dst=null        perm=null       executionTimeUs=1067    proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.82:2024-04-17 03:58:36,189 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=true allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.113:56684   cmd=rename      src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000 dst=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000     perm=core_adm:hive:rw-rwx---    executionTimeUs=5159    proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.82:2024-04-17 03:58:36,190 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=true allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.113:56684   cmd=getFileInfo src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000     dst=null        perm=core_adm:hive:rw-rwx---    executionTimeUs=293     proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.85:2024-04-17 03:49:45,387 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=true allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.135:58148   cmd=getNewBlockIdForFile        src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_000957_34607/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000        dst=null        perm=core_adm:hive:rw-rwx---    executionTimeUs=38      proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.85:2024-04-17 03:49:47,574 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=true allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.135:58148   cmd=completeFile        src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_000957_34607/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000        dst=null        perm=core_adm:hive:rw-rwx---    executionTimeUs=1265    proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.85:2024-04-17 03:49:47,581 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=true allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.135:58148   cmd=getFileInfo src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_000957_34607/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000        dst=null        perm=core_adm:hive:rw-rwx---    executionTimeUs=4358    proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.86:2024-04-17 03:45:22,046 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=true allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.135:58148   cmd=createFile  src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_000957_34607/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000        dst=null        perm=core_adm:hive:rw-rwx---    executionTimeUs=22677   proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.86:2024-04-17 03:45:22,057 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=true allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.135:58148   cmd=getNewBlockIdForFile        src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_000957_34607/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000        dst=null        perm=core_adm:hive:rw-rwx---    executionTimeUs=48      proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master_audit.log.86:2024-04-17 03:45:52,078 INFO  [AsyncUserAccessAuditLogger](AsyncUserAccessAuditLogWriter.java:126) - succeeded=true allowed=true      ugi=prod_tong_liu,prod_tong_liu (AUTH=SIMPLE)   ip=/10.24.2.135:58148   cmd=getNewBlockIdForFile        src=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_000957_34607/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000        dst=null        perm=core_adm:hive:rw-rwx---    executionTimeUs=35      proto=rpc
/opt/alluxio/alluxio-2.9.3/logs/master.log:2024-04-17 03:58:36,183 WARN  [master-rpc-executor-TPE-thread-319](InodeSyncStream.java:503) - Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000, desiredLockPattern=READ, shouldSync={Should sync: true, Last sync time: 1712631045797}}, descendantType=ALL, commonOptions=syncIntervalMs: 216000000
/opt/alluxio/alluxio-2.9.3/logs/master.log:2024-04-17 03:58:36,183 WARN  [master-rpc-executor-TPE-thread-319](RpcUtils.java:197) - Exit (Error): Remove: request=path: "/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000"
/opt/alluxio/alluxio-2.9.3/logs/master.log:2024-04-17 03:58:36,185 WARN  [master-rpc-executor-TPE-thread-216](InodeSyncStream.java:503) - Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000, desiredLockPattern=READ, shouldSync={Should sync: true, Last sync time: 1712631045797}}, descendantType=ONE, commonOptions=syncIntervalMs: 216000000
/opt/alluxio/alluxio-2.9.3/logs/master.log.1:2024-04-17 03:45:22,024 WARN  [master-rpc-executor-TPE-thread-131](InodeSyncStream.java:503) - Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_000957_34607/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000, desiredLockPattern=READ, shouldSync={Should sync: true, Last sync time: 1712631045797}}, descendantType=ONE, commonOptions=syncIntervalMs: 216000000
/opt/alluxio/alluxio-2.9.3/logs/master.log.1:2024-04-17 03:58:09,365 WARN  [master-rpc-executor-TPE-thread-180](InodeSyncStream.java:503) - Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000, desiredLockPattern=READ, shouldSync={Should sync: true, Last sync time: 1712631045797}}, descendantType=NONE, commonOptions=syncIntervalMs: 216000000
/opt/alluxio/alluxio-2.9.3/logs/master.log.1:2024-04-17 03:58:09,367 WARN  [master-rpc-executor-TPE-thread-407](InodeSyncStream.java:503) - Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000, desiredLockPattern=READ, shouldSync={Should sync: true, Last sync time: 1712631045797}}, descendantType=ONE, commonOptions=syncIntervalMs: 216000000

@jasondrogba
Copy link
Contributor

jasondrogba commented Apr 18, 2024

Have you tried refreshing hive's metadata? I think the problem is caused by the difference between the metadata in hive and the metadata in alluxio.
In addition, According to the final WARN, it may be that alluxio does not have permission to access /user/hive/warehouse, alluxio failed to sync metadata.
Metadata Synchronization in Alluxio: Design, Implementation and Optimization this blog has more information about metadata sync

@jasondrogba
Copy link
Contributor

jasondrogba commented Apr 18, 2024

You can share more information from master.log

 WARN  [master-rpc-executor-TPE-thread-407](InodeSyncStream.java:503) - Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/........

https://github.com/Alluxio/alluxio/blame/26919b8894d251b803c82513cb1eeee562bace0a/core/server/master/src/main/java/alluxio/master/file/InodeSyncStream.java#L503
and you can check the code here, the file does not exist on the UFS or in Alluxio

@ziyangRen
Copy link
Author

@jasondrogba Of course, I'll share more logs, the following two logs show up repeatedly in master.log.Note that all files in /user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416 will get the same two alerts

2024-04-17 03:58:41,516 WARN  [master-rpc-executor-TPE-thread-273](RpcUtils.java:197) - Exit (Error): Remove: request=path: "/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-01388-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000"
options {
  recursive: true
  alluxioOnly: false
  unchecked: false
  commonOptions {
    syncIntervalMs: 216000000
    ttl: -1
    ttlAction: FREE
    operationId {
      mostSignificantBits: -7539285028618024718
      leastSignificantBits: -6183483581502312092
    }
  }
}
, Error=alluxio.exception.FileDoesNotExistException: Path "/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-01388-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000" does not exist.
2024-04-17 03:58:41,517 WARN  [master-rpc-executor-TPE-thread-53](InodeSyncStream.java:503) - Failed to sync metadata on root path InodeSyncStream{rootPath=LockingScheme{path=/user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/batch=20240416/part-01388-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000, desiredLockPattern=READ, shouldSync={Should sync: true, Last sync time: 1712631045797}}, descendantType=ONE, commonOptions=syncIntervalMs: 216000000
ttl: -1
ttlAction: FREE
operationId {
  mostSignificantBits: 3717584004956243185
  leastSignificantBits: -6223852750410610288
}
, forceSync=false} because it does not exist on the UFS or in Alluxio

One thing that bothered me was that the logs were telling me that the sync failed because the files didn't exist, but when I checked manually, the files were always there,additionally, the Alluxio process is started as the HDFS superuser with full access to /user/hive/warehouse.

After that, I ran the following command to make sure I didn't miss critical logs:

cat /opt/alluxio/alluxio-2.9.3/logs/master.log.1 |grep 'adm_cs_device_tag_df' |wc -l
cat /opt/alluxio/alluxio-2.9.3/logs/master.log.1 |grep 'adm_cs_device_tag_df' | grep  'WARN' |wc -l
cat /opt/alluxio/alluxio-2.9.3/logs/master.log.1 |grep 'adm_cs_device_tag_df' | grep  'Error' |wc -l
cat /opt/alluxio/alluxio-2.9.3/logs/master.log.1 |grep 'adm_cs_device_tag_df' | grep  'WARN' |grep "Failed to sync metadata" | wc -l
cat /opt/alluxio/alluxio-2.9.3/logs/master.log.1 |grep 'adm_cs_device_tag_df' | grep  'WARN' |grep -v "Failed to sync metadata" | head -100

Could this be due to the fact that I set ACLs manually? Although I didn't see any permissions errors, all the tables written by alluxio, user and group are different from those written without Alluxio (note that Alluxio's manual acl is the same as HDFS's acl).

@ziyangRen
Copy link
Author

@jasondrogba hi, did you forget this issue, and the problem still exists. If possible, please help to confirm whether this is a bug and how to solve it. If you need any logs or information, please feel free to tell me

@jasondrogba
Copy link
Contributor

jasondrogba commented Apr 23, 2024

oh! @ziyangRen ,can you share the worker log? and why do you have so many block replicas?
If convenient, could you also please share the alluxio-site.properties file?

@ziyangRen
Copy link
Author

hi @jasondrogba I tried my best to gather the worker logs and didn't find any errors, but I've tried to summarize a few recurring events from that time that might help:

The first is a large number of block trasfer, and this time the two blockids with the wrong data have gone through this process:

2024-04-17 03:33:49,402 WARN  [block-management-task-47](TieredBlockStore.java:569) - Target tier: BlockStoreLocation{TierAlias=SSD, DirIndex=0, MediumType=SSD} has no available space to store 134255858 bytes for session: -5163578156016566513
2024-04-17 03:33:49,402 WARN  [block-management-task-47](BlockTransferExecutor.java:146) - Transfer-order: BlockTransferInfo{TransferType=SWAP, SrcBlockId=164604722282496, DstBlockId=164753401970688, SrcLocation=BlockStoreLocation{TierAlias=MEM, DirIndex=0, MediumType=MEM}, DstLocation=BlockStoreLocation{TierAlias=SSD, DirIndex=0, MediumType=SSD}} failed. alluxio.exception.runtime.ResourceExhaustedRuntimeException: Failed to find space in BlockStoreLocation{TierAlias=SSD, DirIndex=0, MediumType=SSD} to move blockId 164604722282496
2024-04-17 03:33:49,402 WARN  [block-management-task-47](AlignTask.java:100) - Insufficient space for worker swap space, swap restore task called.

The following logs appear to be normal for reading and writing to HDFS, and I have listed them below:

2024-04-17 04:07:31,271 WARN  [worker-rpc-executor-TPE-thread-55447](LogUtils.java:135) - Exception occurred while processing read request onError sessionId: null, null: io.grpc.StatusRuntimeException: CANCELLED: client cancelled
2024-04-17 03:47:43,513 INFO  [DataStreamer for file /user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_000957_34607/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000](SaslDataTransferClient.java:239) - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-04-17 03:49:47,027 INFO  [DataStreamer for file /user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_000957_34607/batch=20240416/part-00957-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000](SaslDataTransferClient.java:239) - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2024-04-17 03:55:23,777 INFO  [DataStreamer for file /user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_001238_34888/batch=20240416/part-01238-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000](DataStreamer.java:1790) - Exception in createBlockOutputStream blk_6388178153_5380773515
java.net.SocketTimeoutException: 75000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.24.2.119:51092 remote=/10.24.2.119:1004]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:548)
        at org.apache.hadoop.hdfs.DataStreamer.createBlockOutputStream(DataStreamer.java:1762)
        at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1679)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:716)
2024-04-17 03:55:23,778 WARN  [DataStreamer for file /user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_001238_34888/batch=20240416/part-01238-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000](DataStreamer.java:1683) - Abandoning BP-1902924606-10.2.5.100-1516956632926:blk_6388178153_5380773515
2024-04-17 03:55:23,788 WARN  [DataStreamer for file /user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_001238_34888/batch=20240416/part-01238-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000](DataStreamer.java:1688) - Excluding datanode DatanodeInfoWithStorage[10.24.2.119:1004,DS-68690278-d955-4978-9cdf-4c9ec80a3d6e,DISK]
2024-04-17 03:55:23,778 WARN  [DataStreamer for file /user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_001238_34888/batch=20240416/part-01238-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000](DataStreamer.java:1683) - Abandoning BP-1902924606-10.2.5.100-1516956632926:blk_6388178153_5380773515
2024-04-17 03:55:23,788 WARN  [DataStreamer for file /user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_001238_34888/batch=20240416/part-01238-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000](DataStreamer.java:1688) - Excluding datanode DatanodeInfoWithStorage[10.24.2.119:1004,DS-68690278-d955-4978-9cdf-4c9ec80a3d6e,DISK]
2024-04-17 03:55:23,796 INFO  [DataStreamer for file /user/hive/warehouse/edw_user.db/adm_cs_device_tag_df/.hive-staging_hive_2024-04-17_03-01-46_626_2294969227079732485-1/-ext-10000/_temporary/0/_temporary/attempt_20240417030152_0021_m_001238_34888/batch=20240416/part-01238-8a201bff-21ad-4a8a-9cff-cdec51ed1657.c000](SaslDataTransferClient.java:239) - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false

As for the problem of too many replicas you mentioned, I am also confused. Although there will be a large number of concurrent reads and writes by clients in the actual scenario, it should be possible to avoid a large number of replicas by configuring it as follows, using the relevant alluxio configuration:

alluxio.master.ufs.block.location.cache.capacity=0

# User properties
alluxio.user.file.metadata.sync.interval=216000000
alluxio.user.file.writetype.default=CACHE_THROUGH
alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.DeterministicHashPolicy
alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards=3
alluxio.user.block.write.location.policy.class=alluxio.client.block.policy.MostAvailableFirstPolicy
alluxio.user.file.replication.max=3

If there is any need, please let me know at any time and I will provide you with relevant information as soon as possible

@jasondrogba
Copy link
Contributor

I see you're using different medium types, MEM and SSD, and based on your URIStatus, there are many blocks. I suspect this might be due to your configuration of multiple-tier storage.
Based on the information you've shown in the worker log, I noticed an error,
ResourceExhaustedRuntimeException: Failed to find space in SSD.
My guess is that the HDFS data is 258.1MB, which can be divided into two blocks. The first block, 256MB, is stored in MEM, while the second block is stored in SSD. However, due to insufficient space in SSD, it gets cleared, leading to the Trino error: "Incorrect file size (270589145) for file (end of stream not reached)."

@ziyangRen
Copy link
Author

@jasondrogba Thanks for your quick reply. But if you say this is the case, I have three questions:

  1. I can understand the data miss due to lack of SSD storage space, but when Trino queries, Alluxio should be able to find the cleared data block in UFS, but the current situation is that the metadata is inconsistent, Alluxio is not aware of the missing data.What causes Alluxio's inability to determine metadata inconsistencies
  2. Why is the second 256M block generated
  3. In addition to this block, many blocks experienced the same processing during that time, but they did not have data problems. The reason you proposed at present does not seem to explain the special phenomenon of this block

@jasondrogba
Copy link
Contributor

  1. according to the master log you shared, alluxio tried to sync the data, but failed
    Error=alluxio.exception.FileDoesNotExistException: Path
    Failed to sync metadata on root path .... because it does not exist on the UFS or in Alluxio
  2. The reason is, you've set the block size to 256MB, so any excess beyond 256MB for a 258MB file will be placed in the second block.
  3. I think this also illustrates that it's a special case. It's highly likely an issue with this HDFS file.

@ziyangRen
Copy link
Author

ziyangRen commented Apr 24, 2024

@jasondrogba Thanks again for your patience, but I still have some questions about the previous question:

  1. I manually synched the metadata and it didn't work because Alluxio didn't recognize the inconsistency. And after the checkConsistency instruction executes, Alluxio thinks the metadata is consistent, which is not expected. What is the cause of this problem
  2. I know that the size of the block is 256M. My question is, if the block beyond 256M is lost (the size of the cleared block is 2.1M), why there is another 256M data, This leads to a file size larger than 258.1M of HDFS in Alluxio (if the excess blocks are cleared due to insufficient SSD capacity, then it is expected that the file only has a 256M block left, and the data of the missing part can be retrieved by reading UFS).
  3. I wanted to know if I could avoid this problem by using single-level caching and reducing the metadata refresh interval.
alluxio.worker.tieredstore.levels=1
alluxio.worker.tieredstore.level0.alias=SSD
alluxio.worker.tieredstore.level0.dirs.path=/data1/alluxio-ssd-cache,/data2/alluxio-ssd-cache,/data3/alluxio-ssd-cache
alluxio.worker.tieredstore.level1.dirs.quota=800g,800g,800
alluxio.user.file.metadata.sync.interval=36000

@jasondrogba
Copy link
Contributor

jasondrogba commented Apr 24, 2024

@yuzhu You have more wisdom and experience.
Do you have any idea about this issue?

@jasondrogba
Copy link
Contributor

@ziyangRen we recommend to use single tieredstore, you can try it.

@ziyangRen
Copy link
Author

@jasondrogba Thank you for your suggestions and patient answers. I will change to single tieredstore. If the same problem still occurs after modification, I will update it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

No branches or pull requests

3 participants