Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter是否支持batch处理,以及怎么设置batch_size? #285

Closed
3 tasks done
Yang-QW opened this issue Mar 29, 2024 · 6 comments
Closed
3 tasks done

filter是否支持batch处理,以及怎么设置batch_size? #285

Yang-QW opened this issue Mar 29, 2024 · 6 comments
Labels
enhancement New feature or request question Further information is requested stale-issue

Comments

@Yang-QW
Copy link

Yang-QW commented Mar 29, 2024

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

使用最新版本的docker镜像:v0.2.0
filter算子是否支持batch处理?我按照文档设置了self._batched_op = True,但是compute_stats中读取到的样本并不是列表,比较后发现Mapper类定义了is_batched_op方法而Filter类没有,我仿照Mapper类增加了is_batched_op方法后compute_stats可以读取到列表,但是列表长度为1,这样也无法提高自定义算子的效率。请问怎么设置batch的大小?

Additional 额外信息

No response

@Yang-QW Yang-QW added the question Further information is requested label Mar 29, 2024
@Yang-QW
Copy link
Author

Yang-QW commented Mar 29, 2024

image
我好像找到了设置batch_size的地方,希望以后可以加到配置文件中

@BeachWang
Copy link
Collaborator

Hi, 我们这里的batch主要考虑mapper中一个样本生成多个样本的情况,返回时需要包装成batch,目前只有mapper支持batch功能,且输入batch大小固定为1。确实每个类型的op都应支持batch比较合理,且batch大小的设置应该开放给用户。但是这边用户可能需要考虑一下打batch的开销,如果batch_op的加速不足以cover住这部分开销可能速度会更慢。

@BeachWang BeachWang added the enhancement New feature or request label Mar 29, 2024
@sherrytonger
Copy link

Hi, 我们这里的batch主要考虑mapper中一个样本生成多个样本的情况,返回时需要包装成batch,目前只有mapper支持batch功能,且输入batch大小固定为1。确实每个类型的op都应支持batch比较合理,且batch大小的设置应该开放给用户。但是这边用户可能需要考虑一下打batch的开销,如果batch_op的加速不足以cover住这部分开销可能速度会更慢。

batch的开销有什么呢?内存占用?

@HYLcool
Copy link
Collaborator

HYLcool commented Apr 24, 2024

batch的开销有什么呢?内存占用?

是的,内存是一个点,并行度相同的情况下,batch size越大,同时在处理的数据越多,内存占用可能越大。

目前大部分Filter算子能力暂时都只支持单样本依次处理,增加batch size带来的加速空间相对来说没有那么大,在内存等资源允许的情况下,不如增大并行度np。

此外,部分Mapper为batched OP的原因主要为这些Mapper是用来进行数据增强或者数据生成的,因此不同于普通Mapper的1->1的映射过程,它需要一个1->N映射过程,我们这里使用batch化来支持这种新类型。

Copy link

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

Copy link

Close this stale issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested stale-issue
Projects
None yet
Development

No branches or pull requests

4 participants