Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an MD5() function to NebulaGraph #5840

Open
QingZ11 opened this issue Mar 21, 2024 · 5 comments
Open

Add an MD5() function to NebulaGraph #5840

QingZ11 opened this issue Mar 21, 2024 · 5 comments
Labels
good first issue Community: perfect as the first pull request type/feature req Type: feature request

Comments

@QingZ11
Copy link
Contributor

QingZ11 commented Mar 21, 2024

The idea was provided by @AntiTopQuark , thank you.

The description below is in English, if you wish to read the description in Chinese, please scroll to the end of this issue.

Requirement Description

MD5 (Message Digest Algorithm 5) is a widely used cryptographic hash function that produces a 128-bit hash value, commonly used to ensure the integrity and consistency of information transmission. In graph databases, MD5 can be used for rapidly comparing node or edge attribute values, generating unique identifiers, etc. Therefore, implementing the MD5 function will enhance NebulaGraph's data processing capabilities and provide more flexibility in data operations.

NebulaGraph already supports a variety of functions and expressions, such as split and concat. The implementations of these functions provide us with a reference framework for writing custom functions, which includes but is not limited to:

  • Function registration: Understanding how to register new functions into NebulaGraph's function library.

  • Argument handling: Becoming familiar with how to process input arguments, including type checking and conversion.

  • Result returning: Mastering how to compute results and return them to the caller in the appropriate data type.

Implementation Approach

  1. Refer to Existing Functions

By consulting the official documentation, we can find that NebulaGraph has implemented a variety of functions and expressions, such as string functions. These existing implementations provide us with valuable references, allowing us to quickly locate the corresponding implementations and commit records in the codebase by searching for specific function names, like json_extract. For example, we can find commit records and merge request information related to json_extract.

By analyzing these historical commits, we learn that implementing a new function involves the following steps:

  • Add the registration and computational logic of the function in the src/common/function/FunctionManager.cpp file.

  • Add corresponding unit tests in the src/common/function/test/FunctionManagerTest.cpp file.

  • For new features, add TCK integration tests.

  • (Optional) Update the documentation in the nebula-docs repository, providing explanations and examples for the new feature to help users better understand and use the new functionality.

image

  1. MD5 Function Implementation

Implementing the MD5 function may not require us to handle the details of the function from scratch; we can start by conducting a global search in the codebase to see if there are any existing implementations for reference. A simple search can easily locate some available function instances.

image

  1. Related Code Implementation

Drawing from the implementation of the json_extract function, we can write the relevant code for computing MD5 and compile and test it to verify its correctness.

# git diff src/common/function/FunctionManager.cpp
diff --git a/src/common/function/FunctionManager.cpp b/src/common/function/FunctionManager.cpp
index cc2393a77..6ae2b445e 100644
--- a/src/common/function/FunctionManager.cpp
+++ b/src/common/function/FunctionManager.cpp
@@ -11,6 +11,7 @@
 #include <boost/algorithm/string.hpp>
 #include <boost/algorithm/string/replace.hpp>
 #include <cstdint>
+#include <proxygen/lib/utils/CryptUtil.h>

 #include "FunctionUdfManager.h"
 #include "common/base/Base.h"
@@ -442,6 +443,7 @@ std::unordered_map<std::string, std::vector<TypeSignature>> FunctionManager::typ
      {TypeSignature({Value::Type::STRING}, Value::Type::MAP),
       TypeSignature({Value::Type::STRING}, Value::Type::NULLVALUE)}},
     {"score", {TypeSignature({}, Value::Type::__EMPTY__)}},
+    {"md5",{TypeSignature({Value::Type::STRING}, Value::Type::STRING)}}, // 输入为String类型,输出也为String类型
 };

 // static
@@ -3000,6 +3002,27 @@ FunctionManager::FunctionManager() {
       return Value::kNullValue;
     };
   }
+  // md5 function
+  {
+    auto &attr = functions_["md5"];
+    attr.minArity_ = 1;
+    attr.maxArity_ = 1;
+    attr.isAlwaysPure_ = true;
+    attr.body_ = [](const auto &args) -> Value {
+      switch (args[0].get().type()) {
+        case Value::Type::NULLVALUE: {
+          return Value::kNullValue;
+        }
+        case Value::Type::STRING: {
+          std::string value(args[0].get().getStr());
+          return proxygen::md5Encode(folly::StringPiece(value));
+        }
+        default: {
+          return Value::kNullBadType;
+        }
+      }
+    };
+  }
 }  // NOLINT

 // static

After starting the server, we need to verify that the implemented feature meets the expectations.

image

  1. Add Unit Tests

Referring to the test cases of the json_extract function, we added some unit tests for the MD5 function to ensure that it can correctly handle NULL values and regular values.

# git diff src/common/function/test/FunctionManagerTest.cpp
diff --git a/src/common/function/test/FunctionManagerTest.cpp b/src/common/function/test/FunctionManagerTest.cpp
index 88ff49888..5e15b8fba 100644
--- a/src/common/function/test/FunctionManagerTest.cpp
+++ b/src/common/function/test/FunctionManagerTest.cpp
@@ -170,7 +170,8 @@ std::unordered_map<std::string, std::vector<Value>> FunctionManagerTest::args_ =
     {"json_extract1", {"{\"a\": 1, \"b\": 0.2, \"c\": {\"d\": true}}"}},
     {"json_extract2", {"_"}},
     {"json_extract3", {"{a: 1, \"b\": 0.2}"}},
-    {"json_extract4", {"{\"a\": \"foo\", \"b\": 0.2, \"c\": {\"d\": {\"e\": 0.1}}}"}}};
+    {"json_extract4", {"{\"a\": \"foo\", \"b\": 0.2, \"c\": {\"d\": {\"e\": 0.1}}}"}},
+    {"md5", {"abcdefghijkl"}}};

 #define TEST_FUNCTION(expr, ...)                   \
   do {                                             \
@@ -248,6 +249,7 @@ TEST_F(FunctionManagerTest, testNull) {
   TEST_FUNCTION(concat, args_["nullvalue"], Value::kNullValue);
   TEST_FUNCTION(concat_ws, std::vector<Value>({Value::kNullValue, 1, 2}), Value::kNullValue);
   TEST_FUNCTION(concat_ws, std::vector<Value>({1, 1, 2}), Value::kNullValue);
+  TEST_FUNCTION(md5, args_["nullvalue"], Value::kNullValue);
 }

 TEST_F(FunctionManagerTest, functionCall) {
@@ -474,6 +476,10 @@ TEST_F(FunctionManagerTest, functionCall) {
                   args_["json_extract4"],
                   Value(Map({{"a", Value("foo")}, {"b", Value(0.2)}, {"c", Value(Map())}})));
   }
+  {
+     TEST_FUNCTION(
+        md5, args_["md5"], "9fc9d606912030dca86582ed62595cf7");
+  }
   {
     auto result = FunctionManager::get("hash", 1);
     ASSERT_TRUE(result.ok());

In implementing the MD5 feature, we used the third-party proxygen::md5Encode function. Directly compiling the unit tests might encounter some linkage errors, such as undefined references. Therefore, we need to add some dependencies to the Makefile of the unit tests.

# git diff src/common/function/test/CMakeLists.txt
diff --git a/src/common/function/test/CMakeLists.txt b/src/common/function/test/CMakeLists.txt
index ab547cf44..c0410a30c 100644
--- a/src/common/function/test/CMakeLists.txt
+++ b/src/common/function/test/CMakeLists.txt
@@ -19,6 +19,7 @@ nebula_add_test(
     LIBRARIES
         gtest
         gtest_main
+        ${PROXYGEN_LIBRARIES}
 )

 nebula_add_test(
@@ -37,6 +38,7 @@ nebula_add_test(
         $<TARGET_OBJECTS:fs_obj>
     LIBRARIES
         gtest
+        ${PROXYGEN_LIBRARIES}
 )

 nebula_add_test(
@@ -52,5 +54,6 @@ nebula_add_test(
     LIBRARIES
         gtest
         gtest_main
+        ${PROXYGEN_LIBRARIES}
 )

Execute the unit tests to see if the results meet the expectations:

root@19e780468222:/home/build/src/common/function/test# pwd
/home/build/src/common/function/test
root@19e780468222:/home/build/src/common/function/test# make test
Running tests...
Test project /home/build/src/common/function/test
    Start 1: function_manager_test
1/3 Test #1: function_manager_test ............   Passed    0.04 sec
    Start 2: twice_timezone_conversion_test
2/3 Test #2: twice_timezone_conversion_test ...   Passed    0.03 sec
    Start 3: agg_function_manager_test
3/3 Test #3: agg_function_manager_test ........   Passed    0.02 sec

100% tests passed, 0 tests failed out of 3

Label Time Summary:
common/function    =   0.08 sec*proc (3 tests)

Total Test time (real) =   0.09 sec

Great, the results are as expected.

  1. Add TCK Tests

Referring to the article on how to add a test case to NebulaGraph, we added TCK integration tests for the MD5 function and verified its correctness.

Feature: md5 Function

  Background:
    Test md5 function

  Scenario: Test md5 function Positive Cases
    When executing query:
      """
      YIELD md5('abcdefg') AS result;
      """
    Then the result should be, in any order:
      | result                             |
      | "7ac66c0f148de9519b8bd264312c4d64" |
    When executing query:
      """
      YIELD md5('1234567') AS result;
      """
    Then the result should be, in any order:
      | result                             |
      | "fcea920f7412b5da7be0cf42b8c93759" |
  Scenario: Test md5 function Cases With NULL
    When executing query:
      """
      YIELD md5(null) AS result;
      """
    Then the result should be, in any order:
      | result   |
      | NULL     |

Execute the integration tests and verify the results to ensure everything works as expected.

image

  1. (Optional) Add New Feature Description to nebula-docs

Writing clear documentation is very valuable to help users better understand and use the newly added MD5 function. We encourage each contributor to write related documentation when implementing new features, which is beneficial not only to users but also enhances the contributor's sense of achievement.

中文版描述

为 Nebula Graph 新增一个 MD5() 函数

需求描述

MD5(Message Digest Algorithm 5)是一种广泛使用的哈希算法,生成 128 位的哈希值,通常用于确保信息传输的完整性和一致性。在图数据库中,MD5 可以用于快速比较节点或边的属性值、生成唯一标识符等。因此,实现 MD5 函数将增强 NebulaGraph 处理数据的能力,提供更多的数据操作灵活性。

NebulaGraph 现已支持多种函数和表达式,如splitconcat等。这些函数的实现为我们提供了编写自定义函数的参考框架,包括但不限于:

  • 函数注册:了解如何将新函数注册到 NebulaGraph 的函数库中。

  • 参数处理:熟悉如何处理输入参数,包括参数类型检查和转换。

  • 结果返回:掌握如何计算结果并以适当的数据类型返回给调用者。

实现思路

  1. 参考已实现的函数

通过查阅官方文档,我们可以发现 NebulaGraph 已经实现了多种函数和表达式,例如:字符串函数。这些现有的实现为我们提供了宝贵的参考,使我们能够在代码库中通过搜索特定的函数名称,如 json_extract,快速定位到相应的实现和提交记录。例如,我们可以查找到 json_extract 相关的提交记录和合并请求信息

通过分析这些历史提交,我们得知实现一个新函数需要执行以下步骤:

  1. src/common/function/FunctionManager.cpp 文件中添加函数的注册和计算逻辑。

  2. src/common/function/test/FunctionManagerTest.cpp 文件中添加相应的单元测试。

  3. 对于新增加的功能,还需要添加TCK集成测试。

  4. (可选)更新 nebula-docs 仓库中的文档,为新增功能提供说明和示例,以帮助用户更好地理解和使用新功能。

image

  1. MD5 函数实现

实现 MD5 函数可能不需要我们从头开始处理函数的细节,可以先在代码库中全局搜索看看是否有现成的实现可供参考。通过简单的搜索,我们可以轻松地找到一些可用的函数实例。

image

  1. 相关代码实现

借鉴 json_extract 函数的实现,我们可以编写计算 MD5 的相关代码,并进行编译和测试以验证其正确性。

# git diff src/common/function/FunctionManager.cpp
diff --git a/src/common/function/FunctionManager.cpp b/src/common/function/FunctionManager.cpp
index cc2393a77..6ae2b445e 100644
--- a/src/common/function/FunctionManager.cpp
+++ b/src/common/function/FunctionManager.cpp
@@ -11,6 +11,7 @@
 #include <boost/algorithm/string.hpp>
 #include <boost/algorithm/string/replace.hpp>
 #include <cstdint>
+#include <proxygen/lib/utils/CryptUtil.h>

 #include "FunctionUdfManager.h"
 #include "common/base/Base.h"
@@ -442,6 +443,7 @@ std::unordered_map<std::string, std::vector<TypeSignature>> FunctionManager::typ
      {TypeSignature({Value::Type::STRING}, Value::Type::MAP),
       TypeSignature({Value::Type::STRING}, Value::Type::NULLVALUE)}},
     {"score", {TypeSignature({}, Value::Type::__EMPTY__)}},
+    {"md5",{TypeSignature({Value::Type::STRING}, Value::Type::STRING)}}, // 输入为String类型,输出也为String类型
 };

 // static
@@ -3000,6 +3002,27 @@ FunctionManager::FunctionManager() {
       return Value::kNullValue;
     };
   }
+  // md5 function
+  {
+    auto &attr = functions_["md5"];
+    attr.minArity_ = 1;
+    attr.maxArity_ = 1;
+    attr.isAlwaysPure_ = true;
+    attr.body_ = [](const auto &args) -> Value {
+      switch (args[0].get().type()) {
+        case Value::Type::NULLVALUE: {
+          return Value::kNullValue;
+        }
+        case Value::Type::STRING: {
+          std::string value(args[0].get().getStr());
+          return proxygen::md5Encode(folly::StringPiece(value));
+        }
+        default: {
+          return Value::kNullBadType;
+        }
+      }
+    };
+  }
 }  // NOLINT

 // static

启动服务器后,我们需要验证实现的功能是否符合预期。

image

  1. 新增单元测试

参考 json_extract 函数的测试用例,我们为 MD5 函数添加了一些单元测试,确保它能够正确处理 NULL 值和正常值。

# git diff src/common/function/test/FunctionManagerTest.cpp
diff --git a/src/common/function/test/FunctionManagerTest.cpp b/src/common/function/test/FunctionManagerTest.cpp
index 88ff49888..5e15b8fba 100644
--- a/src/common/function/test/FunctionManagerTest.cpp
+++ b/src/common/function/test/FunctionManagerTest.cpp
@@ -170,7 +170,8 @@ std::unordered_map<std::string, std::vector<Value>> FunctionManagerTest::args_ =
     {"json_extract1", {"{\"a\": 1, \"b\": 0.2, \"c\": {\"d\": true}}"}},
     {"json_extract2", {"_"}},
     {"json_extract3", {"{a: 1, \"b\": 0.2}"}},
-    {"json_extract4", {"{\"a\": \"foo\", \"b\": 0.2, \"c\": {\"d\": {\"e\": 0.1}}}"}}};
+    {"json_extract4", {"{\"a\": \"foo\", \"b\": 0.2, \"c\": {\"d\": {\"e\": 0.1}}}"}},
+    {"md5", {"abcdefghijkl"}}};

 #define TEST_FUNCTION(expr, ...)                   \
   do {                                             \
@@ -248,6 +249,7 @@ TEST_F(FunctionManagerTest, testNull) {
   TEST_FUNCTION(concat, args_["nullvalue"], Value::kNullValue);
   TEST_FUNCTION(concat_ws, std::vector<Value>({Value::kNullValue, 1, 2}), Value::kNullValue);
   TEST_FUNCTION(concat_ws, std::vector<Value>({1, 1, 2}), Value::kNullValue);
+  TEST_FUNCTION(md5, args_["nullvalue"], Value::kNullValue);
 }

 TEST_F(FunctionManagerTest, functionCall) {
@@ -474,6 +476,10 @@ TEST_F(FunctionManagerTest, functionCall) {
                   args_["json_extract4"],
                   Value(Map({{"a", Value("foo")}, {"b", Value(0.2)}, {"c", Value(Map())}})));
   }
+  {
+     TEST_FUNCTION(
+        md5, args_["md5"], "9fc9d606912030dca86582ed62595cf7");
+  }
   {
     auto result = FunctionManager::get("hash", 1);
     ASSERT_TRUE(result.ok());

在实现 MD5 功能时,我们采用了第三方的 proxygen::md5Encode 函数。直接编译单元测试可能会遇到一些链接错误,如 undefined references。因此,我们需要在单元测试的 Makefile 中添加一些依赖。

# git diff src/common/function/test/CMakeLists.txt
diff --git a/src/common/function/test/CMakeLists.txt b/src/common/function/test/CMakeLists.txt
index ab547cf44..c0410a30c 100644
--- a/src/common/function/test/CMakeLists.txt
+++ b/src/common/function/test/CMakeLists.txt
@@ -19,6 +19,7 @@ nebula_add_test(
     LIBRARIES
         gtest
         gtest_main
+        ${PROXYGEN_LIBRARIES}
 )

 nebula_add_test(
@@ -37,6 +38,7 @@ nebula_add_test(
         $<TARGET_OBJECTS:fs_obj>
     LIBRARIES
         gtest
+        ${PROXYGEN_LIBRARIES}
 )

 nebula_add_test(
@@ -52,5 +54,6 @@ nebula_add_test(
     LIBRARIES
         gtest
         gtest_main
+        ${PROXYGEN_LIBRARIES}
 )

执行单元测试,看看结果是否符合预期:

root@19e780468222:/home/build/src/common/function/test# pwd
/home/build/src/common/function/test
root@19e780468222:/home/build/src/common/function/test# make test
Running tests...
Test project /home/build/src/common/function/test
    Start 1: function_manager_test
1/3 Test #1: function_manager_test ............   Passed    0.04 sec
    Start 2: twice_timezone_conversion_test
2/3 Test #2: twice_timezone_conversion_test ...   Passed    0.03 sec
    Start 3: agg_function_manager_test
3/3 Test #3: agg_function_manager_test ........   Passed    0.02 sec

100% tests passed, 0 tests failed out of 3

Label Time Summary:
common/function    =   0.08 sec*proc (3 tests)

Total Test time (real) =   0.09 sec

很棒,结果是符合我们预期的。

  1. 新增 tck 测试

参考 如何向NebulaGraph添加一个测试用例的文章,我们为 MD5 函数添加了 TCK 集成测试,并验证了其正确性。

Feature: md5 Function

  Background:
    Test md5 function

  Scenario: Test md5 function Positive Cases
    When executing query:
      """
      YIELD md5('abcdefg') AS result;
      """
    Then the result should be, in any order:
      | result                             |
      | "7ac66c0f148de9519b8bd264312c4d64" |
    When executing query:
      """
      YIELD md5('1234567') AS result;
      """
    Then the result should be, in any order:
      | result                             |
      | "fcea920f7412b5da7be0cf42b8c93759" |
  Scenario: Test md5 function Cases With NULL
    When executing query:
      """
      YIELD md5(null) AS result;
      """
    Then the result should be, in any order:
      | result   |
      | NULL     |

执行集成测试并验证结果,确保一切按预期工作。

image

  1. (可选)为 nebula-docs 增加新功能描述

为了帮助用户更好地理解和使用新增的 MD5 函数,编写清晰的文档是非常有价值的。我们鼓励每位贡献者在实现新功能时,能够撰写相关文档,这不仅有助于用户,也能提升自己的成就感。

@QingZ11 QingZ11 added good first issue Community: perfect as the first pull request type/feature req Type: feature request labels Mar 21, 2024
@fansehep
Copy link

hi, @QingZ11 ,thinks for your good guide. I try to follow your guide. when i finish it, get compiler linking error.

2024-03-31_11-03-1711855876
I build it in the nebula-dev:ubuntu2004.
It seems this library link the wangle library fail, i don't hava any experiences abount cmake. Can you help to solve it?

@QingZ11
Copy link
Contributor Author

QingZ11 commented Mar 31, 2024

hi, @QingZ11 ,thinks for your good guide. I try to follow your guide. when i finish it, get compiler linking error.

2024-03-31_11-03-1711855876
I build it in the nebula-dev:ubuntu2004.
It seems this library link the wangle library fail, i don't hava any experiences abount cmake. Can you help to solve it?

Hi, @fansehep thank you for your practical contribution. I am pleasure to assist you. @Salieri-004 is nebula bravo core developer, and he will help you with this issue.

@Salieri-004
Copy link
Contributor

@fansehep Hello, have you ever tried modifying cmake as described in the Implementation Approach?
{1Y 3~DA4NR_ %{V28UG4

@fansehep
Copy link

@fansehep Hello, have you ever tried modifying cmake as described in the Implementation Approach?

yep, i had modify it. but this change just about test binary. I think the error is due to a problem with the library not linking correctly to wangle.

nebula_add_library(
function_manager_obj OBJECT
FunctionManager.cpp
../geo/GeoFunction.cpp
FunctionUdfManager.cpp
GraphFunction.h
)
nebula_add_library(
agg_function_manager_obj OBJECT
AggFunctionManager.cpp
)
nebula_add_subdirectory(test)

I think the error is about this?

@Salieri-004
Copy link
Contributor

Salieri-004 commented Mar 31, 2024

@fansehep Hello, have you ever tried modifying cmake as described in the Implementation Approach?

yep, i had modify it. but this change just about test binary. I think the error is due to a problem with the library not linking correctly to wangle.

nebula_add_library(
function_manager_obj OBJECT
FunctionManager.cpp
../geo/GeoFunction.cpp
FunctionUdfManager.cpp
GraphFunction.h
)
nebula_add_library(
agg_function_manager_obj OBJECT
AggFunctionManager.cpp
)
nebula_add_subdirectory(test)

I think the error is about this?

From the error messages in your image, we can see that some test compilations are failing because they can't find the implementation of md5Encode. I believe adding PROXYGEN_LIBRARIES to these tests may be enough to solve your problem.
For example

nebula_add_executable(
    NAME arena_bm
    SOURCES ArenaBenchmark.cpp
    OBJECTS
        $<TARGET_OBJECTS:base_obj>
        $<TARGET_OBJECTS:datatypes_obj>
        $<TARGET_OBJECTS:expression_obj>
        $<TARGET_OBJECTS:function_manager_obj>
        $<TARGET_OBJECTS:agg_function_manager_obj>
        $<TARGET_OBJECTS:memory_obj>
        $<TARGET_OBJECTS:time_obj>
        $<TARGET_OBJECTS:time_utils_obj>
        $<TARGET_OBJECTS:fs_obj>
        $<TARGET_OBJECTS:ast_match_path_obj>
        $<TARGET_OBJECTS:wkt_wkb_io_obj>
        $<TARGET_OBJECTS:datetime_parser_obj>
    LIBRARIES
        follybenchmark
        ${THRIFT_LIBRARIES}
        ${PROXYGEN_LIBRARIES}
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Community: perfect as the first pull request type/feature req Type: feature request
Projects
None yet
Development

No branches or pull requests

3 participants