Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic failures due used tcp ports #5008

Open
shrids opened this issue Jul 31, 2020 · 12 comments · Fixed by #5007 · May be fixed by #7008
Open

Sporadic failures due used tcp ports #5008

shrids opened this issue Jul 31, 2020 · 12 comments · Fixed by #5007 · May be fixed by #7008
Assignees
Labels
tag/flakeyTest Test is not reliable, and causes build errors

Comments

@shrids
Copy link
Contributor

shrids commented Jul 31, 2020

Problem description
Sporadic failures with ControllerServiceStarter

java.lang.IllegalStateException: Expected the service ControllerServiceStarter [FAILED] to be RUNNING, but the service has FAILED
	at com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:366)
	at com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:302)
	at com.google.common.util.concurrent.AbstractIdleService.awaitRunning(AbstractIdleService.java:165)
	at io.pravega.test.integration.demo.ControllerWrapper.awaitRunning(ControllerWrapper.java:183)
	at io.pravega.test.integration.ReaderGroupNotificationTest.setUp(ReaderGroupNotificationTest.java:102)
Caused by: java.lang.IllegalStateException: Expected the service GRPCServer [FAILED] to be RUNNING, but the service has FAILED
	at com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:366)
	at com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:302)
	at com.google.common.util.concurrent.AbstractIdleService.awaitRunning(AbstractIdleService.java:165)
Caused by: java.net.BindException: Address already in use
	at sun.nio.ch.Net.bind0(Native Method)
	at sun.nio.ch.Net.bind(Net.java:433)
	at sun.nio.ch.Net.bind(Net.java:425)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)

Problem location
Free port detection logic

Suggestions for an improvement
Fix the issue.

@derekm
Copy link
Contributor

derekm commented Jul 31, 2020

I've also seen this in ExtendedS3StorageTest. (See #4998.)

2020-07-29T19:34:49.6376770Z io.pravega.storage.extendeds3.ExtendedS3StorageTest$RollingStorageTests > testOpen FAILED
2020-07-29T19:34:49.6377117Z     java.net.BindException: Address already in use
2020-07-29T19:34:49.6377391Z         at java.base/sun.nio.ch.Net.bind0(Native Method)
2020-07-29T19:34:49.6377668Z         at java.base/sun.nio.ch.Net.bind(Net.java:479)
2020-07-29T19:34:49.6379548Z         at java.base/sun.nio.ch.Net.bind(Net.java:468)
2020-07-29T19:34:49.6382851Z         at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:220)
2020-07-29T19:34:49.6383205Z         at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:88)
2020-07-29T19:34:49.6383519Z         at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:321)
2020-07-29T19:34:49.6383847Z         at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
2020-07-29T19:34:49.6384321Z         at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:236)
2020-07-29T19:34:49.6385056Z         at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
2020-07-29T19:34:49.6385547Z         at org.eclipse.jetty.server.Server.doStart(Server.java:366)
2020-07-29T19:34:49.6386042Z         at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
2020-07-29T19:34:49.6387180Z         at org.gaul.s3proxy.S3Proxy.start(S3Proxy.java:389)
2020-07-29T19:34:49.6387548Z         at io.pravega.storage.extendeds3.S3ProxyImpl.start(S3ProxyImpl.java:83)
2020-07-29T19:34:49.6387911Z         at io.pravega.storage.extendeds3.ExtendedS3StorageTest$TestContext.<init>(ExtendedS3StorageTest.java:336)
2020-07-29T19:34:49.6388536Z         at io.pravega.storage.extendeds3.ExtendedS3StorageTest$RollingStorageTests.setUp(ExtendedS3StorageTest.java:299)

@andreipaduroiu
Copy link
Member

ExtendedS3TestContext uses TestUtils.getAvailableListenPort already. So that means this method may not return a good value?

@derekm
Copy link
Contributor

derekm commented Aug 1, 2020

@andreipaduroiu -- Or somehow there's a race for the port it found? I was testing in a constrained environment, so if closing the tested port is async, maybe it isn't closed yet by the time the test tries to open the port that was found for it?

Binding to port 0 and then looking up what port the OS assigned is probably a better strategy.

@shrids
Copy link
Contributor Author

shrids commented Aug 3, 2020

Updates:

  • With the approach in PR:5007 I am not able to see the BindException errors on Jenkins (tried a couple of runs)
  • Binding to port 0 would also require the change setting SO_REUSEADDR to true. I believe the code being suggested is....
          try {
                ServerSocket serverSocket = new ServerSocket(0);
                int candidatePort = serverSocket.getLocalPort(); // -1 is returned if not bound...
                //Enabling SO_REUSEADDR prior to binding the socket using bind(SocketAddress) allows the socket
                //to be bound even though a previous connection is in a timeout state.
                serverSocket.setReuseAddress(true);
                serverSocket.close();
                return candidatePort;
            } catch (IOException e) {
               // Do nothing. Try another port.
            }

@derekm
Copy link
Contributor

derekm commented Aug 3, 2020

@shrids -- I was testing in a constrained environment, so I was running into every intermittent failure that results from coarse-grained timing issues (bind failure was just one of many observed failures, see #4998).

My suggestion was more to do with moving away from opening and closing a throwaway socket, but that would require refactoring of our server components to expose getLocalPort() so we can know what port it is using.

I expect the timing failure I was seeing around bind has more to do with the open/close of the socket and less to do with how the port number was being determined.

Maybe either case needs setReuseAddress(true)? (EDIT: Looking at the PR, that's what you've done, used setReuseAddress(true) on the throwaway socket, rather than change how the free port number was determined. 👍)

@derekm
Copy link
Contributor

derekm commented Jan 10, 2021

Seen in recent failed Actions build:

io.pravega.storage.extendeds3.ExtendedS3SimpleStorageTests$ExtendedS3ChunkStorageSystemJournalTests > testSimpleBootstrapWithTruncateSingleChunk FAILED
    java.io.IOException: Failed to bind to /127.0.0.1:49812
        at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:346)
        at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:307)
        at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
        at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:231)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
        at org.eclipse.jetty.server.Server.doStart(Server.java:385)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
        at org.gaul.s3proxy.S3Proxy.start(S3Proxy.java:389)
        at io.pravega.storage.extendeds3.S3ProxyImpl.start(S3ProxyImpl.java:83)
        at io.pravega.storage.extendeds3.ExtendedS3TestContext.<init>(ExtendedS3TestContext.java:47)
        at io.pravega.storage.extendeds3.ExtendedS3SimpleStorageTests$ExtendedS3ChunkStorageSystemJournalTests.before(ExtendedS3SimpleStorageTests.java:117)

        Caused by:
        java.net.BindException: Address already in use
            at java.base/sun.nio.ch.Net.bind0(Native Method)
            at java.base/sun.nio.ch.Net.bind(Net.java:455)
            at java.base/sun.nio.ch.Net.bind(Net.java:447)
            at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
            at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
            at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:342)
            ... 10 more

    java.lang.NullPointerException
        at io.pravega.test.common.ThreadPooledTestSuite.after(ThreadPooledTestSuite.java:34)
        at io.pravega.segmentstore.storage.chunklayer.SystemJournalTests.after(SystemJournalTests.java:53)
        at io.pravega.storage.extendeds3.ExtendedS3SimpleStorageTests$ExtendedS3ChunkStorageSystemJournalTests.after(ExtendedS3SimpleStorageTests.java:126)

@derekm derekm reopened this Jan 10, 2021
@derekm derekm added the tag/flakeyTest Test is not reliable, and causes build errors label Jan 10, 2021
@derekm
Copy link
Contributor

derekm commented Jan 10, 2021

io.pravega.storage.extendeds3.ExtendedS3SimpleStorageTests$ExtendedS3ChunkStorageTests > testDeleteAfterOpen FAILED
    java.io.IOException: Failed to bind to /127.0.0.1:36664
        at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:346)
        at org.eclipse.jetty.server.ServerConnector.open(ServerConnector.java:307)
        at org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:80)
        at org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:231)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
        at org.eclipse.jetty.server.Server.doStart(Server.java:385)
        at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
        at org.gaul.s3proxy.S3Proxy.start(S3Proxy.java:389)
        at io.pravega.storage.extendeds3.S3ProxyImpl.start(S3ProxyImpl.java:83)
        at io.pravega.storage.extendeds3.ExtendedS3TestContext.<init>(ExtendedS3TestContext.java:47)
        at io.pravega.storage.extendeds3.ExtendedS3SimpleStorageTests$ExtendedS3ChunkStorageTests.before(ExtendedS3SimpleStorageTests.java:81)

        Caused by:
        java.net.BindException: Address already in use
            at java.base/sun.nio.ch.Net.bind0(Native Method)
            at java.base/sun.nio.ch.Net.bind(Net.java:455)
            at java.base/sun.nio.ch.Net.bind(Net.java:447)
            at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
            at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
            at org.eclipse.jetty.server.ServerConnector.openAcceptChannel(ServerConnector.java:342)
            ... 10 more

@andreipaduroiu
Copy link
Member

This change couldn't have possibly made any difference. The current code:

  for (int i = 0; i < MAX_PORT_COUNT; i++) {
            int candidatePort = BASE_PORT + NEXT_PORT.getAndIncrement() % MAX_PORT_COUNT;
            try {
                ServerSocket serverSocket = new ServerSocket(candidatePort);
                //Enabling SO_REUSEADDR prior to binding the socket using bind(SocketAddress) allows the socket
                //to be bound even though a previous connection is in a timeout state.
                serverSocket.setReuseAddress(true);
                serverSocket.close();
                return candidatePort;
            } catch (IOException e) {
                // Do nothing. Try another port.
            }
        }

... sets the setReuseAddress(true) after the new ServerSocket(candidatePort) is invoked. But if you decompile that constructor, you'll see it automatically tries to bind to that port. So this "new" option will not even be executed if it fails to bind.

This leads me to believe the PR associated with this issue had no effect. Whatever the root cause is for this problem, it is still unaddressed.

@shrids
Copy link
Contributor Author

shrids commented Jan 20, 2021

Another reproduction in the jenkins build.

io.pravega.test.integration.SingleThreadEndToEndTest > testReadWrite FAILED
02:24:26      java.lang.IllegalStateException: Expected the service ControllerServiceMain [FAILED] to be TERMINATED, but the service has FAILED
02:24:26          at com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:366)
02:24:26          at com.google.common.util.concurrent.AbstractService.awaitTerminated(AbstractService.java:329)
02:24:26          at com.google.common.util.concurrent.AbstractExecutionThreadService.awaitTerminated(AbstractExecutionThreadService.java:223)
02:24:26          at io.pravega.test.integration.demo.ControllerWrapper.close(ControllerWrapper.java:237)
02:24:26          at io.pravega.test.integration.utils.SetupUtils.stopAllServices(SetupUtils.java:138)
02:24:26          at io.pravega.test.integration.SingleThreadEndToEndTest.testReadWrite(SingleThreadEndToEndTest.java:29)
02:24:26  
02:24:26          Caused by:
02:24:26          java.lang.IllegalStateException: Expected the service ControllerServiceStarter [FAILED] to be RUNNING, but the service has FAILED
02:24:26              at com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:366)
02:24:26              at com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:302)
02:24:26              at com.google.common.util.concurrent.AbstractIdleService.awaitRunning(AbstractIdleService.java:165)
02:24:26              at io.pravega.controller.server.ControllerServiceMain.run(ControllerServiceMain.java:126)
02:24:26              at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66)
02:24:26              at com.google.common.util.concurrent.Callables$4.run(Callables.java:119)
02:24:26              at java.base/java.lang.Thread.run(Thread.java:834)
02:24:26  
02:24:26              Caused by:
02:24:26              java.lang.IllegalStateException: Expected the service RESTServer [FAILED] to be RUNNING, but the service has FAILED
02:24:26                  at com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:366)
02:24:26                  at com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:302)
02:24:26                  at com.google.common.util.concurrent.AbstractIdleService.awaitRunning(AbstractIdleService.java:165)
02:24:26                  at io.pravega.controller.server.ControllerServiceStarter.startUp(ControllerServiceStarter.java:345)
02:24:26                  at com.google.common.util.concurrent.AbstractIdleService$DelegateService$1.run(AbstractIdleService.java:62)
02:24:26                  ... 2 more
02:24:26  
02:24:26                  Caused by:
02:24:26                  javax.ws.rs.ProcessingException: Failed to start Grizzly HTTP server: Address already in use
02:24:26                      at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpServerFactory.createHttpServer(GrizzlyHttpServerFactory.java:270)
02:24:26                      at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpServerFactory.createHttpServer(GrizzlyHttpServerFactory.java:114)
02:24:26                      at io.pravega.controller.server.rest.RESTServer.startUp(RESTServer.java:83)
02:24:26                      ... 3 more
02:24:26  
02:24:26                      Caused by:
02:24:26                      java.net.BindException: Address already in use
02:24:26                          at java.base/sun.nio.ch.Net.bind0(Native Method)
02:24:26                          at java.base/sun.nio.ch.Net.bind(Net.java:455)
02:24:26                          at java.base/sun.nio.ch.Net.bind(Net.java:447)
02:24:26                          at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
02:24:26                          at java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
02:24:26                          at org.glassfish.grizzly.nio.transport.TCPNIOBindingHandler.bindToChannelAndAddress(TCPNIOBindingHandler.java:107)
02:24:26                          at org.glassfish.grizzly.nio.transport.TCPNIOBindingHandler.bind(TCPNIOBindingHandler.java:64)
02:24:26                          at org.glassfish.grizzly.nio.transport.TCPNIOTransport.bind(TCPNIOTransport.java:215)
02:24:26                          at org.glassfish.grizzly.nio.transport.TCPNIOTransport.bind(TCPNIOTransport.java:195)
02:24:26                          at org.glassfish.grizzly.nio.transport.TCPNIOTransport.bind(TCPNIOTransport.java:186)
02:24:26                          at org.glassfish.grizzly.http.server.NetworkListener.start(NetworkListener.java:711)
02:24:26                          at org.glassfish.grizzly.http.server.HttpServer.start(HttpServer.java:256)
02:24:26                          at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpServerFactory.createHttpServer(GrizzlyHttpServerFactory.java:267)
02:24:26                          ... 5 more

@RaulGracia
Copy link
Contributor

Fixed via #4999.

@RaulGracia
Copy link
Contributor

We still see problems binding ports, so reopening.

@tkaitchuck
Copy link
Member

Every location that is encountering this should be sure that it is setting setReuseAddress(true)

@kuldeepk3 kuldeepk3 linked a pull request Mar 20, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tag/flakeyTest Test is not reliable, and causes build errors
Projects
None yet
5 participants