Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs found during the use of Alluxio Java SDK #18540

Open
Never-D opened this issue Mar 6, 2024 · 15 comments
Open

Bugs found during the use of Alluxio Java SDK #18540

Never-D opened this issue Mar 6, 2024 · 15 comments
Labels
type-bug This issue is about a bug

Comments

@Never-D
Copy link

Never-D commented Mar 6, 2024

Alluxio Version:
server version is 2.9.3 (Java SDK Reference Method: implementation ("org.alluxio:alluxio-shaded-client:2.9.3"))

Describe the bug
My service encapsulates an HTTP protocol file download interface using SDK and uses nginx as the reverse proxy before me. When testing this HTTP interface, it was found that there were partial download errors when the concurrency was between 100 and 1000(The downloaded package is approximately 160MB). The specific error message is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.

My code is as follows:

public void testAlluxioDownload(HttpServletResponse response, String path) {
    AlluxioProperties alluxioProperties = new AlluxioProperties();
    alluxioProperties.set(PropertyKey.MASTER_EMBEDDED_JOURNAL_ADDRESSES, alluxioConfig.getMasterAddress());
    InstancedConfiguration conf = new InstancedConfiguration(alluxioProperties);
    FileSystem fileSystem = FileSystem.Factory.create(conf);
    URIStatus status;
    AlluxioURI alluxioURI = new AlluxioURI(path);
    try {
        if (!fileSystem.exists(alluxioURI)) {
            fileSystem.loadMetadata(alluxioURI);

            if (!fileSystem.exists(alluxioURI)) {
                HttpTool.setFailedResponseMessage(response, HttpStatus.NOT_FOUND.value(), "文件不存在");
                return;
            }
        }
        status = fileSystem.getStatus(alluxioURI);
    } catch (Throwable e) {
        throw new Throwable(e);
    }
    try (FileInStream fileInputStream = fileSystem.openFile(alluxioURI);
         ServletOutputStream outputStream = response.getOutputStream();
    ) {
        response.setContentType(MediaType.APPLICATION_OCTET_STREAM_VALUE);
        long fileSize = status.getLength();
        if (fileSize != -1) {
            response.setHeader("Content-Length", String.valueOf(fileSize));
        }
        response.setHeader(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename="
                + new String(alluxioURI.getName().getBytes(StandardCharsets.UTF_8)));
        IOUtils.copy(fileInputStream, outputStream, 1024);
        response.setStatus(HttpStatus.OK.value());
    } catch (OpenDirectoryException e) {
        throw new Throwable(e);
    } catch (FileDoesNotExistException e) {
        throw new Throwable(e);
    } catch (FileIncompleteException e) {
        throw new Throwable(e);
    } catch (Throwable e) {
        throw new Throwable(e);
    }
}

To Reproduce
Start a springboot service using the above code and SDK, and execute the wget command concurrently to reproduce the bug scenario I described

Expected behavior
Hope to provide a solution or repair plan

Urgency
This bug has caused our alloxio to be unable to provide high concurrency file downloads, seriously affecting usage

Are you planning to fix it
I am currently unsure if the problem is caused by missing configuration of client or server parameters, or if there are bugs in the code itself, so I do not have a repair plan or plan yet

Additional context
If you are unable to repair it in a timely manner, you can also send me the repair or solution, and I can try to repair it myself

@Never-D Never-D added the type-bug This issue is about a bug label Mar 6, 2024
@YichuanSun
Copy link
Contributor

Could you show your alluxio configuration? Then I can find whether it is just a configuration problem.

@Never-D
Copy link
Author

Never-D commented Mar 11, 2024

@YichuanSun
Master and worker node information: 4C 32G

master configuration:

alluxio.master.hostname=${localip}
alluxio.master.embedded.journal.addresses=${alluxio_master_ip01}:19200,${alluxio_master_ip02}:19200,${alluxio_master_ip03}:19200  
alluxio.master.mount.table.root.ufs=cos://lt-cubesats-alluxio-prod/alluxio/
fs.cos.access.key=${cos_cubesats_alluxio_accessKeyId}
fs.cos.app.id=1259571579
fs.cos.connection.max=4096
fs.cos.connection.timeout=50sec
fs.cos.region=ap-nanjing
fs.cos.secret.key=${cos_cubesats_alluxio_secretKey}
fs.cos.socket.timeout=50sec

#开启自动加载缓存并配置缓存目录:该目录在上传后和从对象存储发现后 马上进行缓存
alluxio.master.data.async.cache.enabled=true
alluxio.master.data.async.cache.file.path=/shein-os/cos-alluxio/data


alluxio.user.file.replication.durable=2
alluxio.master.worker.timeout=180sec
# 元数据刷新间隔
alluxio.user.file.metadata.sync.interval=30min
# 元数据管理
alluxio.master.metastore.dir=/data01/metastore
alluxio.master.journal.folder=/data01/journal
alluxio.security.authorization.permission.enabled=false
# 用户模拟
alluxio.master.security.impersonation.hadoop.users=*
alluxio.master.security.impersonation.hadoop.groups=*
alluxio.master.security.impersonation.client.users=*
alluxio.master.security.impersonation.client.groups=*
alluxio.master.security.impersonation.yarn.users=*
alluxio.master.security.impersonation.yarn.groups=*

# 禁止local缓存 alluxio上远程存储数据
#alluxio.user.file.passive.cache.enabled=false
alluxio.user.file.writetype.default=THROUGH
alluxio.user.file.readtype.default=CACHE

# 解决莫名添加挂载桶文件大小0的空文件
alluxio.underfs.object.store.breadcrumbs.enabled=false  
# fuse监控 配置
alluxio.fuse.web.enabled=true
alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.CapacityBasedDeterministicHashPolicy
alluxio.user.client.cache.enabled=true
alluxio.user.client.cache.store.type=LOCAL
alluxio.user.client.cache.dirs=/home/hadoop
alluxio.user.client.cache.size=10GB
alluxio.user.client.cache.page.size=4MB


alluxio.master.shell.copy.file.buffer.size=8388608
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.user.network.writer.chunk.size.bytes=4MB
alluxio.user.client.cache.async.write.threads=32
alluxio.user.client.cache.timeout.threads=64
alluxio.user.client.cache.timeout.duration=30min
alluxio.user.network.reader.chunk.size.bytes=4MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB

#alluxio.user.block.size.bytes.default=16MB

# worker Web处理的线程调大
alluxio.web.threads=4000
alluxio.network.connection.health.check.timeout.ms=180sec

alluxio.web.threaddump.log.enabled=true

alluxio.master.rpc.executor.max.pool.size=4000
alluxio.master.rpc.executor.core.pool.size=4000

alluxio.user.network.data.timeout.ms=30min
alluxio.user.streaming.data.timeout=30min

worker configuration:

alluxio.master.embedded.journal.addresses=${alluxio_master_ip01}:19200,${alluxio_master_ip02}:19200,${alluxio_master_ip03}:19200   

# 缓存配置 启用二级缓存
alluxio.worker.tieredstore.levels=2
alluxio.worker.tieredstore.level0.alias=MEM
alluxio.worker.tieredstore.level0.dirs.path=/mnt/ramdisk
alluxio.worker.tieredstore.level0.dirs.mediumtype=MEM
alluxio.worker.tieredstore.level0.dirs.quota=1GB
alluxio.worker.tieredstore.level1.alias=HDD
alluxio.worker.tieredstore.level1.dirs.path=/data01/alluxio-namespace
alluxio.worker.tieredstore.level1.dirs.mediumtype=HDD
alluxio.worker.tieredstore.level1.dirs.quota=700GB

# 禁止local缓存 alluxio上远程存储数据
alluxio.user.file.passive.cache.enabled=false
alluxio.user.file.writetype.default=THROUGH
alluxio.user.file.readtype.default=CACHE
alluxio.worker.tieredstore.level0.watermark.high.ratio=0.70
alluxio.worker.tieredstore.level1.watermark.high.ratio=0.70


alluxio.security.authorization.permission.enabled=false
alluxio.network.ip.address.used=true

# 解决莫名添加挂载桶文件大小0的空文件
alluxio.underfs.object.store.breadcrumbs.enabled=false

alluxio.master.shell.copy.file.buffer.size=8388608
alluxio.underfs.object.store.breadcrumbs.enabled=false
alluxio.user.network.writer.chunk.size.bytes=4MB
alluxio.user.client.cache.async.write.threads=32
alluxio.user.client.cache.timeout.threads=64
alluxio.user.network.reader.chunk.size.bytes=4MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.chunk.size.bytes=2MB
alluxio.user.streaming.reader.close.timeout=30s

alluxio.consul.enabled=true
alluxio.consul.url=http://xxxx
alluxio.consul.service.name=prod-alluxio-server-ci-east-worker
alluxio.service.env.type=prod
alluxio.consul.service.tag=type=type=worker,disk=ssd,model=m6,cmdb-app-name=ci-alluxio,cmdb-name=ci-alluxio-cneast-prod-main

# worker Web处理的线程调大
alluxio.web.threads=4000

alluxio.network.connection.health.check.timeout.ms=180sec

alluxio.web.threaddump.log.enabled=true

alluxio.worker.management.load.detection.cool.down.time=60sec
alluxio.worker.free.space.timeout=180sec
alluxio.worker.master.periodical.rpc.timeout=30min
alluxio.worker.memory.size=21GB
alluxio.worker.network.block.reader.threads.max=4000
alluxio.worker.network.keepalive.time=30min
alluxio.worker.network.keepalive.timeout=30min
alluxio.worker.network.permit.keepalive.time=30min
alluxio.worker.network.netty.worker.threads=8
alluxio.worker.block.master.client.pool.size=30
alluxio.worker.rpc.executor.core.pool.size=4000
alluxio.worker.rpc.executor.max.pool.size=4000

The client only configured master information.
In addition, we found that when providing an HTTP download interface on the worker node, if only one worker is used for downloading, no problems are found; However, if multiple worker nodes are called to download files at the same time, some of the files will fail to download, similar to the phenomenon of using SDK.

The code for the download interface is as follows:

  @GET
  @Path(PATH_PARAM)
  @ApiOperation(value = "Download the given file at the path", response = java.io.InputStream.class)
  @Produces(MediaType.APPLICATION_OCTET_STREAM)
  public Response downloadFile(@PathParam("path") final String path) throws IOException, AlluxioException {
    AlluxioURI uri = new AlluxioURI("/" + path);
    FileInStream is;
    URIStatus status;
    try {
      if (!mFsClient.exists(uri)) {
        mFsClient.loadMetadata(uri);

        if (!mFsClient.exists(uri)) {
          return Response.noContent().build();
        }
      }

//      is = mFsClient.openFile(uri);
      status = mFsClient.getStatus(uri);
    } catch (IOException | AlluxioException e) {
      return Response.status(500).entity(e.getMessage()).build();
    }

    StreamingOutput fileStream = output -> {
      try (FileInStream input = mFsClient.openFile(uri)) {
        byte[] buffer = new byte[1024];
        int length;
        while ((length = input.read(buffer)) != -1) {
          output.write(buffer, 0, length);
          output.flush();
        }
      } catch (AlluxioException e) {
        throw new RuntimeException(e);
      }
    };

      try {
        return Response.ok(fileStream)
            .header("Content-Disposition", "attachment; filename=" + uri.getName())
            .header("Content-Length", status.getLength())
            .build();
      } catch (Exception e) {
        return Response.status(500).entity(e.getMessage()).build();
      }
  }

@jasondrogba
Copy link
Contributor

jasondrogba commented Mar 12, 2024

Have you found errors in the master, worker and proxy logs? Can you share the logs?

if only one worker is used for downloading, no problems are found; However, if multiple worker nodes are called to download files at the same time, some of the files will fail to download, similar to the phenomenon of using SDK.

alluxio.user.block.master.client.pool.size.max You can try increasing this property.

@Never-D
Copy link
Author

Never-D commented Mar 13, 2024

@jasondrogba error log is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.

@Never-D
Copy link
Author

Never-D commented Mar 13, 2024

@jasondrogba Is there any solution to the concurrency issue caused by adding a download interface similar to a proxy node to the worker node?

@jasondrogba
Copy link
Contributor

@jasondrogba error log is: java. io. IOException: Broken pipe org. apache. catalina. connector ClientAbortException: java. io. IOException: Broken pipe.

@Never-D I guess this error message comes from springboot? It may be that the timeout period of the tomcat configuration or nginx configuration is too small.
https://stackoverflow.com/questions/43825908/org-apache-catalina-connector-clientabortexception-java-io-ioexception-apr-err

Most likely, your server is taking too long to respond and the client is getting bored and closing the connection.
A bit more explanation: tomcat receives a request on a connection and tries to fulfill it. Imagine this takes 3 minutes, now, if the client has a timeout of say 2 minutes, it will close the connection and when tomcat finally comes back to try to write the response, the connection is closed and it throws an org.apache.catalina.connector.ClientAbortException.

I think you can increase the timeout of springboot server and nginx, or increase the CPU and memory of the alluxio node.
You can share the alluxio log, and let’s take a look at what causes the concurrent processing timeout. It’s difficult for us to determine the cause just by the error report you shared. Have you found any errors in the master.log and worker.log under alluxio/logs? I hope you can share the errors in the alluxio logs.

@Never-D
Copy link
Author

Never-D commented Mar 19, 2024

@jasondrogba 2024-03-19 11:18:58,227 INFO ALLUXIO-PROXY-WEB-SERVICE-224 - Alluxio S3 API received GET request: URI=http://alluxio-test-proxy.dev.sheincorp.cn/api/v1/paths/%2Fshein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz/download-file User=null Media Type=null Query Parameters={} Path Parameters={}
2024-03-19 11:19:15,682 WARN ALLUXIO-PROXY-WEB-SERVICE-157 - Failed to read block 21508390913 of file /shein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz from worker WorkerNetAddress{host=10.121.0.207, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.121.0.207, rack=null)}. This worker will be skipped for future read operations, will retry: alluxio.exception.status.UnavailableException: io exception.

@Never-D
Copy link
Author

Never-D commented Mar 19, 2024

@jasondrogba Download 162MB file, the download size is incorrect, but there is no error message.
2024-03-19 12:50:27 (1.60 MB/s) - ‘aws-sdk-cpp-v1.0.tar.gz.184’ saved [58884016]

@jasondrogba
Copy link
Contributor

jasondrogba commented Mar 19, 2024

This worker will be skipped for future read operations

According to this line, I found the error from AlluxioFileInStream
you can take a look on #16094 and #16096

export ALLUXIO_FUSE_JAVA_OPTS="-XX:MaxDirectMemorySize=128m"

you can try to increase MaxDirectMemorySize.
@secfree Hi~, do you have any idea about this error, I think you have more experience and intelligent, can help him with this issue.

@YichuanSun
Copy link
Contributor

One possible reason: you have to close the FileSystem instance at the end of your code, otherwise these FileSystem objects are leakage resource. @Never-D

@YichuanSun
Copy link
Contributor

One possible reason: you have to close the FileSystem instance at the end of your code, otherwise these FileSystem objects are leakage resource. @Never-D

Especially in such a high concurrency case.

@secfree
Copy link
Contributor

secfree commented Mar 19, 2024

Hi @Never-D

2024-03-19 11:19:15,682 WARN ALLUXIO-PROXY-WEB-SERVICE-157 - Failed to read block 21508390913 of file /shein-os/cos-alluxio-test/data/upload-test/2/nexus-test/aws-sdk-cpp-v1.0.tar.gz from worker WorkerNetAddress{host=10.121.0.207, containerHost=, rpcPort=29999, dataPort=29999, webPort=30000, domainSocketPath=, tieredIdentity=TieredIdentity(node=10.121.0.207, rack=null)}. This worker will be skipped for future read operations, will retry: alluxio.exception.status.UnavailableException: io exception.

Can you check the log of alluxio-worker? Generally it's caused by short of direct memory at the alluxio-worker side when reading concurrently. Increase the value of -XX:MaxDirectMemorySize for the alluxio-worker process may help.

For the IOException: Broken pipe. exception, one idea is to catch it and retry at your http service side.

@Never-D
Copy link
Author

Never-D commented Mar 19, 2024

@YichuanSun The error I sent was a proxy error

@Never-D
Copy link
Author

Never-D commented Mar 19, 2024

There are no error logs in the worker node

@Never-D
Copy link
Author

Never-D commented Mar 19, 2024

Download 162MB file, the download size is incorrect, but there is no error message.
2024-03-19 12:50:27 (1.60 MB/s) - ‘aws-sdk-cpp-v1.0.tar.gz.184’ saved [58884016]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug This issue is about a bug
Projects
None yet
Development

No branches or pull requests

4 participants