Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add takestring! and make String(::Memory) not truncate #54372

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

jakobnissen
Copy link
Contributor

@jakobnissen jakobnissen commented May 6, 2024

This function creates a string from a byte vector, truncating the vector and reusing its memory where possible.

This also removes the truncating behaviour of String(::Memory) - see #54369

Still needs to be done

  • Make tests pass
  • Comb through Base to find places where take_string!(::Memory) should be used instead of String(::Memory) for efficiency [edit: couldn't find any]

Closes #54369

@jakobnissen jakobnissen marked this pull request as ready for review May 6, 2024 09:28
@jakobnissen jakobnissen added the status:awaiting review PR is complete and seems ready to merge. Has tests and news/compat if needed. CI failures unrelated. label May 6, 2024
@nhz2
Copy link
Contributor

nhz2 commented May 6, 2024

Also, check for uses of StringMemory for example:

slug = Base.StringMemory(len)
chars = b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
for i = 1:len
slug[i] = chars[(Libc.rand() % length(chars)) + 1]
end
return String(slug)

#53962

base/strings/string.jl Outdated Show resolved Hide resolved
base/strings/string.jl Outdated Show resolved Hide resolved
@stevengj
Copy link
Member

stevengj commented May 6, 2024

The name makes me think that take_string!(iobuffer) should be a replacement/synonym for String(take!(iobuffer))?

@oscardssmith oscardssmith added the status:triage This should be discussed on a triage call label May 7, 2024
@oscardssmith
Copy link
Member

I would be in favor of that. Every time I have used an IOBuffer I have had to go to the docs to figure out how to get the data out of it correctly.

@jariji
Copy link
Contributor

jariji commented May 7, 2024

So what's the specification of String(_): does it truncate or not? Or does it depend on the specific argument type? An "it depends" is not a very nice property for generic programming.

@jakobnissen
Copy link
Contributor Author

It does not truncate - except when the argument is a Vector{UInt8}. Yes, that's bad and in my opinion it shouldn't have been like that. But it's documented, and therefore can't change. I made this PR to make sure more exceptions aren't added.

@stevengj
Copy link
Member

stevengj commented May 9, 2024

Do we need an underscore here? takestring! is pretty readable

base/strings/string.jl Outdated Show resolved Hide resolved
@oscardssmith
Copy link
Member

oscardssmith commented May 9, 2024

Triage had a long talk about this, #54369, #54156 and #54273. The conclusions we came to are

  1. String(Memory) should copy
  2. String(Array) should truncate the array but not the Memory
  3. triage was right last time wrt view and reshape of Memory
  4. using the an Array aliased with another Array that is turned into a String is UB.
  5. take_string! is a much better API for IOBuffer et al.

@oscardssmith oscardssmith added domain:arrays [a, r, r, a, y, s] domain:strings "Strings!" and removed status:triage This should be discussed on a triage call status:awaiting review PR is complete and seems ready to merge. Has tests and news/compat if needed. CI failures unrelated. labels May 9, 2024
@nhz2 nhz2 mentioned this pull request May 9, 2024
@stevengj
Copy link
Member

stevengj commented May 9, 2024

The style guide says not to use underscores "when readable" without, which I think applies here.

@jakobnissen jakobnissen changed the title Add take_string! and make String(::Memory) not truncate Add takestring! and make String(::Memory) not truncate May 10, 2024
@topolarity
Copy link
Member

topolarity commented May 10, 2024

  1. using the an Array aliased with another Array that is turned into a String is UB.

Can this be refined to "mutating an Array after its alias has been turned into a String is UB"?

edit: For the record , I don't think it's right for us to put this contract on an API that's not marked unsafe. I expect that this 'contract' is broken much more often than the UB is actually triggered though, which is probably what's saving us in practice.

@oscardssmith
Copy link
Member

Yes. observing it is not great but not UB. the only UB is if you mutate it.

@nhz2
Copy link
Contributor

nhz2 commented May 10, 2024

takestring! can mutate the bytes in the underlying memory of a Vector input, it is not just making a String view.

@jakobnissen
Copy link
Contributor Author

jakobnissen commented May 10, 2024

After triage, this PR now contains the following changes:

  • Add takestring!. This function can be used with IOBuffer and Vector{UInt8}, returning a String. It resets the input to its initial state. This means it empties vectors, and empties IOBuffer if and only if the buffer is writable. This behaviour of not empying the input if the input is not writable is slightly weird, but matches take!
  • Add the internal methods Base.unsafe_takestring! and Base.unsafe_takestring, which are similar to takestring!, except they leave the argument (IOBuffer or Memory) in an unusable corrupt state. They are useful for the common pattern when code creates a buffer, takes a string from the buffer and then returns the string without touching the buffer again.
  • String(::Memory) now no longer truncates the memory
  • String(::Vector{UInt8}) now no longer truncates the underlying memory of the string. It still empties the vector, and assigns a new (empty) memory to the vector just like before.
  • The many different calls to String(take!(::IOBuffer)) and String(_unsafe_take!(::IOBuffer)) has been replaced with takestring! and the two internal methods Base.unsafe_takestring! and Base.unsafe_takestring). Not all calls to String(take!(...)) has been replaced, as there are many, many of such calls in the test suite and no need to change them.
  • Replace most uses of String(take!(::IOBuffer)) in Base and the stdlibs
  • Add tests

jakobnissen and others added 14 commits May 13, 2024 13:36
This commit builds upon JuliaLang#54438, re-using the old String(::Vector{UInt8}) impl,
which now no longer truncates.
Also remove takestring!(::Memory) - that function probably should not be defined
Co-authored-by: Steven G. Johnson <stevenj@mit.edu>
And use it unstead of takestring!(::Memory)
@jakobnissen jakobnissen added the status:merge me PR is reviewed. Merge when all tests are passing label May 13, 2024
@jakobnissen
Copy link
Contributor Author

This should be good to go as soon as tests pass. See this comment for a complete list of changes in this PR: #54372 (comment)

@KristofferC
Copy link
Sponsor Member

KristofferC commented May 13, 2024

The current behavior is already approved by triage (#24388 (comment)) so it feels a bit odd that there is a new triage later that just goes and overrides this without any reference (at least from what I can see) to the original decision.

This also lacks review from people heavily invested in the original design like @JeffBezanson.

base/iobuffer.jl Show resolved Hide resolved
@KristofferC KristofferC removed the status:merge me PR is reviewed. Merge when all tests are passing label May 13, 2024
@oscardssmith
Copy link
Member

@KristofferC It is useful context and was not discussed in the previous triage call. That said, I think none of what triage said in this call contradicts the previous call. Almost all of the decision points for triage here were about the behavior when Memory is involved which didn't exist in 2017. The reason re-discussion of such a similar issue was needed is that as of the previous conversation, the solution was to 0 out the length field which was relatively bullet-proof since Julia at the time didn't let you access the underlying fixed length thing upon which Vector is built. As of Julia 1.11, Array is a "normal" julia object, and Memory is the fixed length backing, so the conversation got pushed a layer further down, at which point the previous solution no longer works without some modification.

The one part where the current triage call somewhat diverged from the previous is the decision that a new function that does the combination of removing data from an IOBuffer and turning that into a String is a good idea. That doesn't seem to have been on the radar as a possible option in the original triage decision, and I do not know why.

@topolarity
Copy link
Member

topolarity commented May 13, 2024

Is the problem here then that we don't have any way to do a safe runtime "move" out of a Memory?

I'm aware of three options for a copy/performance tradeoff like this:

  • guarantee a copy
  • leave behind an empty object (C++-style "move")
  • leave behind an invalid object (e.g. one which may be potentially freed or whose mutation violates immutability, etc. - triggers UB)

I think our policy on the ownership theft of String was that it was allowed because we implemented a runtime "move" in the style of (2). The user might see unexpected mutation/moves, but usage of the left-over object doesn't directly trigger UB.

I'm assuming that's not an option for Memory - because of the assumption that it's always allocated and fixed-size?

@nhz2
Copy link
Contributor

nhz2 commented May 13, 2024

String(Memory) has to copy according to the docs of String since Memory is an AbstractVector{UInt8} and isn't a Vector{UInt8}.

@jakobnissen
Copy link
Contributor Author

jakobnissen commented May 14, 2024

On @KristofferC 's recommendation, I have now read through the old decisions in #24388, #25241, #25846 and #26093.
I still believe the changes in this PR and #54452 are for the better.

In #24388, the two main points are:
1. It should not be possible to mutate strings except through unsafe_.
2. Always requiring a copy is a performance disaster

The other PRs are less relevant - they pertain to codeunits and unsafe_wrap, except #26093 which reiterates that strings should not be mutated, and also says:

One place where something like String! or string! would be valuable though is to replace String(take!(iobuffer)). Perhaps take!(iobuffer, String). That would allow us to avoid an extra allocation since we know the caller doesn't want the vector, just the string underneath, so the buffer can reuse the vector object (just not its storage).

Which essentially suggests encapsulating this whole sketchy thing with creating strings from mutable data in a single function, as done here.

So, I just don't see how the old triage decisions conflict with the PR here, and given that triage has approved this already, I think we should move forward with it.

To motivate it a little more, I think the status quo is not acceptable because it leads to memory bugs too easily:

Here is a way that works on released versions of Julia unless this PR or #54452 is applied to mutate strings

b = IOBuffer()
write(b, "abc")
v1 = take!(b)
v2 = view(v1, :)
s = String(v1) # "abc"
fill!(v2, 0)
s # now all zeros

Here is a bug on master where a vector now is able to read from out of bounds (fixed by this PR):

v1 = Memory{UInt8}(codeunits("abc"))
v2 = view(v1, 1:3)
s = String(v1)
println(isempty(v2.ref.mem)) # true
v2[1] # accesses v2.ref.mem OOB

More broadly speaking, I'm not suggesting that we compromise on our APIs in order to categorically prevent unsafe behaviour. We're not Rust, and need to be practical about what is unsafe. However:

  1. This is not some crazy edge case, it's pretty easily accessible memory bugs that can happen by accident way too easily through no obvious fault of the user

  2. This is not an API compromise. In my opinion, takestring! provided here, and the change in Do not give StringVectors to users #54452 makes Julia's API more consistent, and less surprising, because it allows users to not have to rely on the weird magic behaviour of truncating String, while not providing people with vectors that are sometimes string-backed and sometimes not.

base/iobuffer.jl Outdated Show resolved Hide resolved
@JeffBezanson
Copy link
Sponsor Member

@jakobnissen Thank you for your tireless efforts here. I do like takestring! and I agree with the approach recommended by triage. See comments about ensuring copies are avoided where possible.

From the implementation here the difference between takestring! and unsafe_takestring! is not clear to me. Can an unsafe version share data in more cases? And my understanding is that the unsafe version would be used internally when we know the IOBuffer has no other references? However in the changes here it is not consistently used that way. Maybe we don't need both functions for IOBuffer?

@jakobnissen
Copy link
Contributor Author

jakobnissen commented May 18, 2024

Thanks for the feedback. I've now changed both takestring! and unsafe_takestring! to avoid making copies, where possible. I didn't do that previously, because I mistakenly thought that IOBuffers' memory wasn't backed by strings so it would copy anyway. That was weird, because I did know that it was supposed to be zero-copy, but I guess after revising it a few times I lost the forest for the trees.

The unsafe variant is a tiny bit faster because a) it assumes the buffer is writable and seekable, and b) it doesn't spend time resetting the buffer to a usable state. In benchmarks, it's only about 10% faster, but still, worth having IMO.

Edit: I think the 32-bit Windows test failure is spurious. Tests do seem to pass.

Edit2: With these changes, takestring! is now a little faster than String(take!(buf)):
The following code

function foo(data)
    buf = IOBuffer()
    write(buf, data)
    String(take!(buf)) # or takestring!(buf) or unsafe_takestring!(buf)
end

@benchmark foo("abc")

Takes the following time:

  • String(take!(buf)) on 1.11-beta1: 74 ns
  • String(take!(buf)) on this PR: 64 ns
  • takestring!(buf): 54 ns
  • unsafe_takestring!(buf): 48 ns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays [a, r, r, a, y, s] domain:strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String(v::Memory{UInt8}) doesn't make a copy
8 participants