Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blocked-at is unusable on express web servers - spikes CPU to 100% #36

Open
powersjcb opened this issue Mar 25, 2024 · 6 comments
Open

Comments

@powersjcb
Copy link

powersjcb commented Mar 25, 2024

I did some independent testing and confirmed we are having the same performance issues as described in this post. Our CPU utilization spiked to 100% and the node fell over when I tried to use this to find blocking stack traces from a dev envionrment.

https://www.ashbyhq.com/blog/engineering/detecting-event-loop-blockers

I was running node:18.18.1-alpine in k8s with express 4.17.3

@powersjcb powersjcb changed the title this library is unusable on express web servers - spikes CPU to 100% blocked-at is unusable on express web servers - spikes CPU to 100% Mar 25, 2024
@naugtur
Copy link
Owner

naugtur commented Mar 25, 2024

Yes, async hooks are expensive. The more promises.oer request the worse.
This tool is not fit for running in prod.

I think I could improve it's performance a lot tho. Care to share a minimal reproduction case?

@powersjcb
Copy link
Author

powersjcb commented Mar 25, 2024

Sorry, I don't have any public repos I'm testing with.

The configuration I tested looks similar to this:

const blocked = require('blocked-at')
const express = require('express')

const datadogLib = require('./datadog');

const app = express()
const port = 3000

blocked((delay, stack, resource) => {
  datadogLib.histogram('node.eventloop.blocked', delay);
  logger.warn('blocked event loop', { delay, stack, resource });
}, { threshhold: 100, });

app.get('/', (req, res) => {
  // todo: make API requests
  res.send('Hello World!')
})

app.listen(port, () => {
  console.log(`Example app listening on port ${port}`)
})

I was seeing baseline CPU utilization increase from 3 mcores to 400 mcores. (memory remained about the same)

@naugtur
Copy link
Owner

naugtur commented Mar 25, 2024

Datadog might be key here. They do a great deal of monkey patching and use hooks too. Might be clashing.
If you could test without datadog for me that would be helpful.

@powersjcb
Copy link
Author

Yeah, that seems sensible. I will give that a shot.

I'm seeing that many DD lines are present in our stack traces so that wouldn't surprise me.

  "parsed": {
    "_timestamp": "2024-03-25T19:34:29.029Z",
    "blockedMs": 102.66803601074218,
    "level": "error",
    "message": "node event loop blocked",
    "resource": {
      "type": "HTTPINCOMINGMESSAGE"
    },
    "stack": [
      "    at AsyncHook.init (/app/node_modules/blocked-at/index.js:31:11)",
      "    at HTTPServerAsyncResource.emitInitNative (node:internal/async_hooks:202:43)",
      "    at connectionListenerInternal (node:_http_server:645:10)",
      "    at defaultTriggerAsyncIdScope (node:internal/async_hooks:463:18)",
      "    at Server.connectionListener (node:_http_server:618:3)",
      "    at Server.emit (node:events:529:35)",
      "    at Server.emit (/app/node_modules/dd-trace/packages/datadog-instrumentations/src/http/server.js:74:17)",
      "    at TCP.onconnection (node:net:2104:8)",
      "    at TCP.callbackTrampoline (node:internal/async_hooks:130:17)"
    ]
    "level": "error",
    "message": "node event loop blocked",
    "resource": {
      "type": "PROMISE"
    },
    "stack": [
      "    at AsyncHook.init (/app/node_modules/blocked-at/index.js:31:11)",
      "    at emitInitNative (node:internal/async_hooks:202:43)",
      "    at emitInitScript (node:internal/async_hooks:504:3)",
      "    at promiseInitHook (node:internal/async_hooks:324:3)",
      "    at promiseInitHookWithDestroyTracking (node:internal/async_hooks:328:3)",
... business logic redacted
      "    at /app/node_modules/dd-trace/packages/datadog-instrumentations/src/router.js:50:25",
      "    at /app/node_modules/dd-trace/packages/datadog-shimmer/src/shimmer.js:26:21",
      "    at Layer.handle [as handle_request] (/app/node_modules/express/lib/router/layer.js:95:5)",
      "    at next (/app/node_modules/express/lib/router/route.js:137:13)",
      "    at /app/node_modules/dd-trace/packages/datadog-instrumentations/src/router.js:100:12",
      "    at /app/node_modules/express-validation/lib/index.js:31:18",
      "    at processTicksAndRejections (node:internal/process/task_queues:95:5)"
    ]
    ```

@powersjcb
Copy link
Author

powersjcb commented Mar 25, 2024

My intuition is that we are probably seeing the performance issues because of this line.

https://github.com/naugtur/blocked-at/blob/master/index.js#L36C1-L37C1

Where we invoke a fairly expensive function to capture and store a stack trace on every event loop cycle. (this agrees with what this article mentions)
image

Do we know if there is a way to capture stack trace information only if the processing is too slow?

(testing a fork now to see if the stack collection is the problem)

@naugtur
Copy link
Owner

naugtur commented Mar 31, 2024

most likely problem is async hooks. they slow things down. delete hook is the most expensive and I know I can probably get rid of it by making things clean up after themselves on report or using a weakmap.
But looking forward to your findings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants