Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove liboauthcpp #8030

Merged
merged 2 commits into from May 20, 2024
Merged

Remove liboauthcpp #8030

merged 2 commits into from May 20, 2024

Conversation

Osyotr
Copy link
Contributor

@Osyotr Osyotr commented Apr 29, 2024

Depends on #8048

@Osyotr Osyotr marked this pull request as draft April 29, 2024 06:02
coding/base64.cpp Outdated Show resolved Hide resolved
coding/sha1.cpp Show resolved Hide resolved
@Osyotr Osyotr marked this pull request as ready for review April 29, 2024 21:30
coding/sha1.hpp Show resolved Hide resolved
coding/base64.cpp Outdated Show resolved Hide resolved
coding/coding_tests/sha1_test.cpp Show resolved Hide resolved
coding/coding_tests/sha1_test.cpp Show resolved Hide resolved
coding/coding_tests/sha1_test.cpp Outdated Show resolved Hide resolved
uint32_t digest[5];
sha1.get_digest(digest);
for (auto & b : digest)
b = boost::core::byteswap(b);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why do you think a byte swap is necessary compared to the previous implementation? Why are implementations different?
  2. How critical is ExtractHash performance? Does it make sense to paste here boost's implementation and fix it to avoid bits rotation? Or maybe paste here the necessary part from liboauth implementation? Which implementation is faster, or are they the same?

Copy link
Contributor Author

@Osyotr Osyotr Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think a byte swap is necessary compared to the previous implementation?

Digest from liboauthcpp is little-endian (controlled by SHA1_LITTLE_ENDIAN).

#if !defined(SHA1_LITTLE_ENDIAN) && !defined(SHA1_BIG_ENDIAN)
#define SHA1_LITTLE_ENDIAN
#endif

Digest from boost is little-endian big-endian.

Why are implementations different?

I can only speculate but it's probably just a preference.

How critical is ExtractHash performance?

Not critical. It's used to verify downloaded map data in a separate thread.

Does it make sense to paste here boost's implementation and fix it to avoid bits rotation? Or maybe paste here the necessary part from liboauth implementation?

It's up to you to decide ;)

Which implementation is faster, or are they the same?

Since the code is not critical, I don't see a reason to write benchmarks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digest from boost is little-endian.

You likely meant big-endian? Does the result depend on the current arch?

Maybe use boost::endian::endian_reverse_inplace ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You likely meant big-endian?

Yes.

Does the result depend on the current arch?

Hmm I have no idea how to verify that.

Osyotr added 2 commits May 1, 2024 02:09
Signed-off-by: Osyotr <Osyotr@users.noreply.github.com>
Signed-off-by: Osyotr <Osyotr@users.noreply.github.com>
{
SHA1::Hash ExtractHash(boost::uuids::detail::sha1 & sha1)
{
uint32_t digest[5];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boost::uuids::detail::sha1::digest_type ?

uint32_t digest[5];
sha1.get_digest(digest);
for (auto & b : digest)
b = boost::core::byteswap(b);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digest from boost is little-endian.

You likely meant big-endian? Does the result depend on the current arch?

Maybe use boost::endian::endian_reverse_inplace ?


SHA1::Hash result;
static_assert(result.size() == sizeof(digest));
std::copy_n(reinterpret_cast<uint8_t const *>(digest), sizeof(digest), std::begin(result));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be avoided in favor of writing into result directly?

@Osyotr
Copy link
Contributor Author

Osyotr commented May 2, 2024

I'm inclined to C&P one of implementations into the codebase, but you need to decide:

  1. Keep as-is
    • ugly conversions
  2. C&P and modify boost::uuids::detail::sha1:
    • Is in namespace detail, so there's one in a million chance that it's API and/or behavior will change
    • BSL-1.0 licenced
  3. C&P CSHA1 from liboauthcpp
    • Does not require additional changes compared to the original implementation
    • 100% free public domain for whatever that means

@biodranik
Copy link
Member

Lincenses are not a big deal, they're already included in copyright.txt

Does the currently printed sha match standard Linux/mac utilities output? I've checked it with

echo 'Organic Maps is the ultimate companion app for travellers, tourists, hikers, and cyclists!' | sha1sum -b
d73a52fac194560cfe61583253f38b0518e279b9 *-

... and the output doesn't match. Does Boost's default output without modifications match it?

@vng the checksum calculation can be safely migrated to a different algo, right? A new version will contain a new algo with newly generated checksums.

Are there better/faster algos with a shorter hash?

@biodranik
Copy link
Member

If the algo can be easily replaced, then this PR can be merged, and replacement can be done later. If it's better to leave the same implementation, then less ugly/faster one is preferred.

The worst case is when a user downloads 65+Gb of mwm (whole planet) and all hashes are calculated.

@vng
Copy link
Member

vng commented May 2, 2024

Why SHA is different? unit test shows the same sha? It should be the same for current countries.txt

@Osyotr
Copy link
Contributor Author

Osyotr commented May 2, 2024

and the output doesn't match

I honestly don't know why. PowerShell and various online services give the correct output.

Are there better/faster algos with a shorter hash?

https://github.com/Cyan4973/xxHash?tab=readme-ov-file#benchmarks

@biodranik
Copy link
Member

Why SHA is different? unit test shows the same sha? It should be the same for current countries.txt

My bad, I forgot -n parameter to echo and it included a newline. SHA1 matches.

@biodranik
Copy link
Member

https://github.com/Cyan4973/xxHash?tab=readme-ov-file#benchmarks

Thanks! What's the best/simplest can be used?

@Osyotr
Copy link
Contributor Author

Osyotr commented May 4, 2024

Thanks! What's the best/simplest can be used?

Depends on the usage.
Non-cryptographic: xxhash64
Cryptographic: blake3

https://jolynch.github.io/posts/use_fast_data_algorithms/

@biodranik
Copy link
Member

We don't need cryptographic strength here. Would it be hard to integrate https://xxhash.com/ ?

@Osyotr
Copy link
Contributor Author

Osyotr commented May 4, 2024

We don't need cryptographic strength here. Would it be hard to integrate https://xxhash.com/ ?

Here's drop-in replacement:

3party/CMakeLists.txt:

include(FetchContent)
FetchContent_Declare(
  xxHash
  GIT_REPOSITORY https://github.com/Cyan4973/xxHash.git
  GIT_TAG        bbb27a5efb85b92a0486cf361a8635715a53f6ba # v0.8.2
  SOURCE_SUBDIR cmake_unofficial
  #FIND_PACKAGE_ARGS NAMES xxHash # CMake 3.24+
)

set(BUILD_SHARED_LIBS OFF)
set(XXHASH_BUILD_XXHSUM OFF)
FetchContent_MakeAvailable(xxHash)

coding/CMakeLists.txt

target_link_libraries(${PROJECT_NAME} xxHash::xxhash)
uint64_t Calculate(std::string const & filePath)
{
  uint32_t constexpr kFileBufferSize = 8192;
  try
  {
    base::FileData file(filePath, base::FileData::OP_READ);
    uint64_t const fileSize = file.Size();

    struct XXH3StateDeleter
    {
      void operator()(XXH3_state_t * p) const
      {
        (void)XXH3_freeState(p);
      }
    };
    using XXH3StatePtr = std::unique_ptr<XXH3_state_t, XXH3StateDeleter>;
    XXH3StatePtr state = XXH3StatePtr(XXH3_createState());
    if (!state || XXH3_64bits_reset(state.get()) != XXH_OK)
    {
      LOG(LERROR, ("Could not create XXH3 state."));
      return {};
    }

    uint64_t currSize = 0;
    unsigned char buffer[kFileBufferSize];
    while (currSize < fileSize)
    {
      auto const toRead = std::min(kFileBufferSize, static_cast<uint32_t>(fileSize - currSize));
      file.Read(currSize, buffer, toRead);
      if (XXH3_64bits_update(state.get(), buffer, toRead) != XXH_OK)
      {
        LOG(LERROR, ("Could not update XXH3 state."));
        return {};
      }
      currSize += toRead;
    }

    return XXH3_64bits_digest(state.get());
  }
  catch (Reader::Exception const & ex)
  {
    LOG(LERROR, ("Error reading file:", filePath, ex.what()));
  }
  return {};
}

uint64_t CalculateForString(std::string_view data)
{
  return XXH3_64bits(data.data(), data.size());
}

Note that you need to update countries.txt to reflect changes (I don't know the current process so can't help here).

Copy link
Member

@vng vng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the next steps here? LGTM, but first of all we can ensure that tests only commit works on current master.

Copy link
Member

@biodranik biodranik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Getting rid of some 3p library is good
  • Introducing a change brings potential risks
  • Changing to boost instead of already working 3p lib without clear benefits brings a bit of risk
  • Changing to a faster, more optimal implementation may bring benefits that overweight risks of introducing changes

Using FetchContent in CMake looks risky to me compared to using submodules or plain copies of the code in the main repo, because if there's no internet connection then you're stuck. Maybe we need to get used to it first.

Regarding the choice of the implementation: we don't need constexpr calculations at compile time, do we?

There is BSD-licensed C++17 header-only implementation, that can be easily copied into the main repo without any dependencies and complexities: https://github.com/RedSpah/xxhash_cpp

Changing client's code will also require modifying the code in the python generator part in tools/python/post_generation/hierarchy_to_countries.py:

def get_mwm_hash(path, name):
    filename = os.path.join(path, f"{name}.mwm")
    h = hashlib.sha1()
    with open(filename, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            h.update(chunk)
    return str(base64.b64encode(h.digest()), "utf-8")

We likely may use this drop-in replacement: https://pypi.org/project/xxhash/ , it will require adding one more dependency to the generator.

WDYT about it?

@vng
Copy link
Member

vng commented May 20, 2024

This PR makes our codebase much better and doesn't break an existing behavior.

Changing hash for countries is a more complex task and may break existing behavior. Can investigate in a separate PR.

@vng vng merged commit 621eaaf into organicmaps:master May 20, 2024
15 checks passed
@Osyotr Osyotr deleted the liboauthcpp-removal branch May 20, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants