Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSLIOP Transports are used after being cached with entry state IDLE #2184

Open
eommc opened this issue Jan 23, 2024 · 0 comments
Open

SSLIOP Transports are used after being cached with entry state IDLE #2184

eommc opened this issue Jan 23, 2024 · 0 comments
Labels
needs review Needs to be reviewed question

Comments

@eommc
Copy link

eommc commented Jan 23, 2024

Version

We are using the ACE ORB supported by OCI, https://theaceorb.com/, wich is TAO 2.2. But looking at the code here on github, it seems that the code is the same as the one on master branch here, with TAO 3.1.

Host machine and operating system

Built on an almalinux 9.2 docker image, run on Red Hat Enterprise Linux release 9.3

Compiler name and version (including patch level)

g++ (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)

The $ACE_ROOT/ace/config.h file

#if defined FD_SETSIZE
#       undef FD_SETSIZE
#endif
#define FD_SETSIZE 8192
#define ACE_HAS_IPV6 1
#define TAO_HAS_ZIOP 1
#include "config-linux.h"

The $ACE_ROOT/include/makeinclude/platform_macros.GNU file

exceptions=1
optimize=0
threads=1
ssl=1
zlib=1
include $(ACE_ROOT)/include/makeinclude/platform_linux.GNU
buildbits = 64

Contents of $ACE_ROOT/bin/MakeProjectCreator/config/default.features

ssl=1
zlib = 1

Our application acts as a server for various other programs, but also delegates a few functions to another service. In this interaction, the application is acting as a client towards the other service, and multiple threads can perform requests concurrently. The ORBs use an svc config file with the following content:

    static Client_Strategy_Factory "-ORBClientConnectionHandler rw -ORBTransportMuxStrategy exclusive -ORBConnectStrategy blocked -ORBConnectionHandlerCleanup 1"
    static Resource_Factory "-ORBFlushingStrategy blocking -ORBProtocolFactory SSLIOP_Factory"
    dynamic SSLIOP_Factory Service_Object*
        TAO_SSLIOP:_make_TAO_SSLIOP_Protocol_Factory() "-v -SSLAuthenticate SERVER_AND_CLIENT -SSLCertificate ... -SSLPrivateKey ... -SSLPassword ... -SSLCAfile ... -SSLVersionList TLSv1.2"

We've seing a few threads remaining stuck waiting for a replies, and these errors in the logs:

    Exclusive_TMS::dispatch_reply - <2 != 1>

I've raised the log levels and investigated the issue. Whent the SSLIOP connector creates a new Transport, it is added to the Transport Cache Manager with state ENTRY_IDLE_AND_PURGABLE. Under heavy load, a different thread can fetch the same Transport from tha cache while the first thread is using it to send a request.

There are a few consequences:

  • when using the mux strategy exclusive, the code will discard one of the responses, because it violates the constraint that there can only be one pending operation at the time. This is what causes our threads to remain stuck
  • when using the muxed strategy, replies are not discarded, but we can see a few SSL errors in the logs, because the effect is not just interleaved requests and responses, but requests can be sent concurrently on the same transport

Analyzing the code I found a big difference between IIOP and SSLIOP connections.

Both TAO_IIOP_Connector and TAO::SSLIOP::Connector inherit from TAO_Connector. In the former case, the transport creation passes through the base class, TAO_Connector::connect. This method has an infinte while loop that wraps all code. If an idle transport is found in the cache, it is returned. When a new Transport is created, it is not returned to the caller, instead it is cached and then the loop runs once again, and hopefully fetches the transport from the cache before a different thread snatches it.
The Transport is created by TAO_IIOP_Connector::make_connection, which adds the transport in the cache manager using the default sate, ENTRY_IDLE_AND_PURGABLE (a default paramter). The state of the transport is changed to BUSY later, by the method Transport_Cache_Manager_T::find_transport, which ends up calling find_i.

    // Transport_Cache_Manager_T.cpp : 301
    entry->item ().recycle_state (ENTRY_BUSY);

On the other hand, TAO::SSLIOP::Connector overwrites the connect method (SSLIOP_Connector.cpp line 83), which internally calls ssliop_connect (line 190). Neither the method connect, nor ssliop_connect, have the same loop as the base class. Instead, the transport is cached with the default state ENTRY_IDLE_AND_PURGABLE, and then returned and used as is

    int retval =
      this->orb_core ()->
        lane_resources ().transport_cache ().cache_transport (desc,
                                                              transport);

With this pattern it is possible that a different thread finds the same transport in the cache, in idle state, and starts using it concurrently.

I tried a quick patch, adding the state ENTRY_BUSY as third parameter to the cache_transport invocation, in SSLIOP_Connector.cpp line 713, and our problem seems to be solved. All I saw seems to hint to a bug, but the code is 13 years old, so maybe there's something wrong in our configuration.

@jwillemsen jwillemsen added needs review Needs to be reviewed question labels Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs review Needs to be reviewed question
Development

No branches or pull requests

2 participants