Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows Agentd Integration tests fails for 4.8.0 RC2 #23465

Closed
Rebits opened this issue May 16, 2024 · 9 comments
Closed

Windows Agentd Integration tests fails for 4.8.0 RC2 #23465

Rebits opened this issue May 16, 2024 · 9 comments
Assignees
Labels

Comments

@Rebits
Copy link
Member

Rebits commented May 16, 2024

Wazuh version Component Install type Install method Platform
v4.8.0-rc2 Authd Agent Packages Windows Server 2016

Description

During Release 4.8.0 - RC 2 - Integration tests it has been detected that Windows Agentd integration tests fails.
These tests were migrated from Jenkins to GHA #17696. However, these are not available for 4.8.0, so it is necessary a manual review

Cases

It has been only fails the test test_agentd/test_agentd_reconnection.py::test_agentd_reconection_enrollment_no_keys_file[tcp]. Regarding the report, it seems the agent was never able to enroll to the manager

    
        # Wait until Agent asks keys for the first time
>       log_monitor.start(timeout=50, callback=wait_enrollment,
                          error_message="Agent never enrolled for the first time.")

test_agentd\test_agentd_reconnection.py:385: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
C:\Python311\Lib\site-packages\wazuh_testing\tools\monitoring.py:197: in start
    self._result = monitor.start(timeout=timeout, callback=callback, accum_results=accum_results,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <wazuh_testing.tools.monitoring.QueueMonitor object at 0x00000213230B10D0>
timeout = 50, callback = <function wait_enrollment at 0x00000213231353A0>
accum_results = 1, update_position = True, timeout_extra = 0
error_message = 'Agent never enrolled for the first time.'

    def start(self, timeout=-1, callback=_callback_default, accum_results=1, update_position=True, timeout_extra=0,
              error_message=''):
        """Start the queue monitoring until the stop method is called."""
        if not self._continue:
            self._continue = True
            self._abort = False
            result = None
    
            while self._continue:
                if self._abort:
                    self.stop()
                    if error_message:
                        logger.error(error_message)
                        logger.error(f"Results accumulated: "
                                     f"{len(result) if isinstance(result, list) else 0}")
                        logger.error(f"Results expected: {accum_results}")
>                   raise TimeoutError(error_message)
E                   TimeoutError: Agent never enrolled for the first time.

Evidences

Report: agent_windows_html_report_Test_integration_B47206(2).zip

@nbertoldo
Copy link
Member

This test deletes the 'clients.keys' file in order to force agentd to request a new key and then verifies that this new key has been received by monitoring the message "Valid key received" in the ossec.log file:

def wait_enrollment(line):
    """Callback function to wait for enrollment."""
    if 'Valid key received' in line:
        return line
    return None

The log has been analyzed and while this message does not appear, neither does the message “Requesting a key from server: ”. However, the agent connects to the manager without problems:

2024/05/15 10:45:44 wazuh-agent[4620] start_agent.c:93 at connect_server(): INFO: Trying to connect to server ([127.0.0.1]:1514/tcp).
2024/05/15 10:45:44 wazuh-agent[4620] start_agent.c:365 at agent_handshake_to_server(): INFO: (4102): Connected to the server ([127.0.0.1]:1514/tcp).
2024/05/15 10:45:44 wazuh-agent[4620] notify.c:198 at run_notify(): DEBUG: Sending keep alive: #!-Microsoft Windows Server 2016 Datacenter [Ver: 10.0.14393.3204] - Wazuh v4.8.0 / ab73af41699f13fdd81903b5f23d8d00
4a8724b20dee0124ff9656783c490c4e merged.mg
#"_agent_ip":127.0.0.1
2024/05/15 10:45:44 wazuh-agent[4620] receiver.c:96 at receive_msg(): DEBUG: Received message: '#!-agent ack '

This indicates that agentd does not try to enroll again because it already has a valid key. Therefore, it seems that the 'client.keys' file is not being deleted.

It can be seen that the function does not confirm that the file has been deleted:

def delete_keys():
    """Remove the agent's client.keys file."""
    os.remove(CLIENT_KEYS_PATH)
    sleep(1)

Apparently this is a flaky test. We will re-launch the test to confirm it and make changes to make it more robust.

@nbertoldo
Copy link
Member

We have run the test again: https://ci.wazuh.info/job/Test_integration/47353/. In this case the test test_agentd_reconection_enrollment_no_keys_file[tcp] did not fail:

image

I confirm that it is a flaky test. Therefore, I am going to modify it to make it more robust.

@cborla
Copy link
Member

cborla commented May 21, 2024

Testing the integration test with new changes in jenkins

New job: https://ci.wazuh.info/job/Test_integration/47361/
Changes: wazuh/wazuh-qa@4.8.0...fix/23465-windows-agentd-integration-tests-fails-for-480-rc2
image

@nbertoldo
Copy link
Member

Update

I have run the test test_agentd_reconnection.py locally several times and no errors occurred:

image

Report

I have manually deleted the file client.keys and run test_agentd_reconection_enrollment_no_keys_file[tcp]. In this case an exception occurs when os.remove() tries to remove the file that does not exist:

image

Report-removed-keys

I then manually opened the file and run the test. In this case the test does not fail:

image

Report-opened-keys.zip

That is, in none of the cases has it been possible to reproduce the same failure reported in the issue.

@vikman90
Copy link
Member

vikman90 commented May 22, 2024

Multiple hypothesis

  • File client.keys was not properly removed:
    If the file was not deleted, we expected the test to raise any exception. For this, either try to lock the file (Error Could not move (ossec.log) to (logs/2023/Oct/ossec-08.log) in Windows Footprint tests #19603 (comment)) or add a statement like os.remove('C:/random.txt').
    In this case, we would see a "Trying to connect" log just after the agent startup.
  • Manager response delay issues:
    If this was happening, the agent's log would get stuck just after a log like "Requesting a key from server". This is not what we're seeing here.

As we're unable to reproduce this problem reliably, in order to find out the cause of the problem, it's indispensable to have access to the environment where this issue occurred. Please open an issue to QA team.

@nbertoldo
Copy link
Member

Update

Hypothesis: File client.keys was not properly removed

In order to confirm this hypothesis, this tests have been performed:

1. Open client.keys file without SHARE_DELETE option

open.c
#include <windows.h>
#include <stdio.h>

int main() {
    const char * TARGET = "C:\\Program Files (x86)\\ossec-agent\\client.keys";

    printf("Creating target file\n");

    const DWORD dwDesiredAccess = GENERIC_READ | GENERIC_WRITE;
    const DWORD dwShareMode = FILE_SHARE_READ | FILE_SHARE_WRITE;
    const DWORD dwCreationDisposition = CREATE_ALWAYS;
    const DWORD dwFlagsAndAttributes = FILE_ATTRIBUTE_NORMAL;

    HANDLE hFile = CreateFile(TARGET, dwDesiredAccess, dwShareMode, NULL, dwCreationDisposition, dwFlagsAndAttributes, NULL);

    if (hFile == INVALID_HANDLE_VALUE) {
        printf("CreateFile(): %u\n", (unsigned)GetLastError());
        return 1;
    }

    Sleep(10000);

    CloseHandle(hFile);
    printf("OK\n");

    return 0;
}

Run test_agentd_reconnection.py. The test fails raising an exception:

def delete_keys():
"""Remove the agent's client.keys file."""
> os.remove(CLIENT_KEYS_PATH)
E PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Program Files (x86)\\ossec-agent\\client.keys'

test_agentd\test_agentd_reconnection.py:178: PermissionError

Report1.zip

2. Open client.keys file with SHARE_DELETE option

open-delete.c
#include <windows.h>
#include <stdio.h>

int main() {
    const char * TARGET = "C:\\Program Files (x86)\\ossec-agent\\client.keys";

    printf("Creating target file\n");

    const DWORD dwDesiredAccess = GENERIC_READ | GENERIC_WRITE;
    const DWORD dwShareMode = FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE;
    const DWORD dwCreationDisposition = CREATE_ALWAYS;
    const DWORD dwFlagsAndAttributes = FILE_ATTRIBUTE_NORMAL;

    HANDLE hFile = CreateFile(TARGET, dwDesiredAccess, dwShareMode, NULL, dwCreationDisposition, dwFlagsAndAttributes, NULL);

    if (hFile == INVALID_HANDLE_VALUE) {
        printf("CreateFile(): %u\n", (unsigned)GetLastError());
        return 1;
    }

    Sleep(10000);

    CloseHandle(hFile);
    printf("OK\n");

    return 0;
}

Run test_agentd_reconnection.py. the file is deleted and the test passes

image

Therefore, this hypothesis has not been confirmed.

Hypothesis: Manager response delay issues

As @vikman90 points out, this hypothesis has already been discarded previously, since the agent's log shows that there is a file with a valid key, so it does not try to enroll and connects with that key:

2024/05/15 10:45:43 wazuh-agent[4620] win_utils.c:156 at local_start(): INFO: (1410): Reading authentication keys file.
2024/05/15 10:45:44 wazuh-agent[4620] start_agent.c:93 at connect_server(): INFO: Trying to connect to server ([127.0.0.1]:1514/tcp)
2024/05/15 10:45:44 wazuh-agent[4620] start_agent.c:365 at agent_handshake_to_server(): INFO: (4102): Connected to the server ([127.0.0.1]:1514/tcp).

@vikman90
Copy link
Member

vikman90 commented May 24, 2024

Local execution

I've launched this test locally lots of times, it has always passed.

🟢 local ossec.log
🔴 critical ossec.log

We can see that the critical log is truncated.

Meet with @Rebits

  • This test failed once only.
  • This test has been migrated (4.9.0).
  • We do not believe that there is any case of overlap between different tests.

Conclusion

  • The log was truncated.
  • This should have been a flaky-test condition.
  • No changes related to enrollment have been introduced in 4.8.0.
  • The test was migrated. The new version never failed.

That's why, unless we find any new evidence today, we can close this issue.

@vikman90 vikman90 self-assigned this May 24, 2024
@vikman90
Copy link
Member

We're blocking this issue until checking "Release 4.8.0 - RC 3 - Integration tests"

@vikman90
Copy link
Member

vikman90 commented Jun 3, 2024

The failed result at #23434 that raised this issue is not positive: #23823.

So, either the problem was solved in another way, or it was a sporadic error whose cause we have not been able to find. In any case, I proceed to close this issue.

@vikman90 vikman90 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants