Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java|Go] Check string is ascii before using meta string encoding #1619

Closed
chaokunyang opened this issue May 9, 2024 · 6 comments · Fixed by #1631
Closed

[Java|Go] Check string is ascii before using meta string encoding #1619

chaokunyang opened this issue May 9, 2024 · 6 comments · Fixed by #1631
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@chaokunyang
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

In #1514 and #1566 , we compress every char using 5/6 bytes. But we didn't check the encoding is ascii. In UTF-8, some byte may be in range of ascii char, but some not. We may take a utf-8 string as a meta string by accident?

Describe the solution you'd like

Check string is ascii before using meta string encoding

Additional context

@chaokunyang chaokunyang added enhancement New feature or request good first issue Good for newcomers labels May 9, 2024
@jasonmokk
Copy link
Contributor

jasonmokk commented May 9, 2024

Hi @chaokunyang I'd like to contribute to Apache Fury, can you please assign this issue to me?

@LiangliangSui
Copy link
Contributor

Great! thanks for the willingness to contribute to Fury.

chaokunyang pushed a commit that referenced this issue May 10, 2024

## What does this PR do?

<!-- Describe the purpose of this PR. -->
This PR introduces a validation method to ensure that all input strings
to the `MetaString` encoder are ASCII.

## Related issues

<!--
Is there any related issue? Please attach here.

- #1619 
- #xxxx1
- #xxxx2
-->


## Does this PR introduce any user-facing change?

<!--
If any user-facing interface changes, please [open an
issue](https://github.com/apache/incubator-fury/issues/new/choose)
describing the need to do so and update the document if necessary.
-->

- [ ] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?


## Benchmark

<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.
-->

---------

Signed-off-by: Jason Mok <jjasonmok1@gmail.com>
@LiangliangSui
Copy link
Contributor

// org/apache/fury/meta/MetaStringEncoder.java
public MetaString encode(String input) {
  if (input.isEmpty()) {
    return new MetaString(input, Encoding.UTF_8, specialChar1, specialChar2, new byte[0]);
  }
  Encoding encoding = computeEncoding(input);
  return encode(input, encoding);
}

Could we judge here whether the input is all ASCII encoded? If not, just return UTF-8 encoded MetaString directly, so that we can save time on computeEncoding and encode.

WDYT @chaokunyang

@LiangliangSui
Copy link
Contributor

In addition, we also need to add unit tests to cover this issue

@LiangliangSui LiangliangSui reopened this May 10, 2024
@jasonmokk
Copy link
Contributor

jasonmokk commented May 10, 2024

@LiangliangSui Thanks for bringing that up, can I submit another PR to correct that/add unit tests? I also think it would be optimal to have the ASCII check early and just directly return a UTF-8 encoded MetaString.

@LiangliangSui
Copy link
Contributor

can I submit another PR to correct that/add unit tests?

Sure, that is great!

LiangliangSui pushed a commit that referenced this issue May 14, 2024
<!--
**Thanks for contributing to Fury.**

**If this is your first time opening a PR on fury, you can refer to
[CONTRIBUTING.md](https://github.com/apache/incubator-fury/blob/main/CONTRIBUTING.md).**

Contribution Checklist

- The **Apache Fury (incubating)** community has restrictions on the
naming of pr titles. You can also find instructions in
[CONTRIBUTING.md](https://github.com/apache/incubator-fury/blob/main/CONTRIBUTING.md).

- Fury has a strong focus on performance. If the PR you submit will have
an impact on performance, please benchmark it first and provide the
benchmark result here.
-->

## What does this PR do?

<!-- Describe the purpose of this PR. -->
This PR enhances the current ASCII check (before meta string encoding) I
implemented in #1620 to return a UTF-8 encoded `MetaString` early if the
input is non-ASCII. This improves efficiency and saves time on
`computeEncoding` and `encode`. Unit tests are also added.




## Related issues
#1619

## Does this PR introduce any user-facing change?

<!--
If any user-facing interface changes, please [open an
issue](https://github.com/apache/incubator-fury/issues/new/choose)
describing the need to do so and update the document if necessary.
-->

- [ ] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?


## Benchmark

<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.
-->

---------

Signed-off-by: Jason Mok <jjasonmok1@gmail.com>
@LiangliangSui LiangliangSui linked a pull request May 14, 2024 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants