-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
on self-hosted new session recordings are dropped due to high_water_mark_partition #22291
Comments
Thinking out loud, could it be that the misconfiguration We do de-duplication on playback so you should be able to either delete the offset in redis or set it to some sensible value What we should observe then is that ingestion proceeds correctly and the high-water mark protects against reingestion caused by rebalances Can you test that? If that isn't the case then there's definitely something else up here... my hope is the installations have got into an unexpected state and "normal operations" are then unhelpful |
So, what I did today:
My question would be, is it expected behaviour and some migration steps are required to reset watermark count or there's a bug in code that does not reset watermark when needed? |
restarted containers upon update:
|
hmmm we might be missing something here for sure. if we were to do kafka maintenance such that the offset was reset we'd manually reset the redis watermark... (we try to avoid duplication of ingestion but the playback is relatively good at deduplicating so we might be more complex than that in reality but it's certainly close enough for rock and roll...) so there's something that needs fixing or documenting here.
can you clarify what this step is? why you'd do it? (thanks again for these super clear, deep reports) |
We update web:
extends:
file: docker-compose.base.yml
service: web
command: /compose/start
volumes:
- ./compose:/compose
image: posthog/posthog:c2046d5dd1e2fa01614c6077e30cf06c4b8897f1 <---- here and run |
ooh and that resets the kafka offset 🤯 @frankh @fuziontech I'm well beyond my docker knowledge here... but there's a gap between how we operate session replay and what docker compose is doing for self-hosted... is there a way to avoid the kafka offset reset here? |
Nice troubleshooting! Just wanted to add my 2 cents, since we were struggling to get a stable self-hosted deployment recently. I think we got everything working smoothly now, but we did apply quite a few changes. Part of our docker-compose.yml (we also bind-mount to host filesystem instead of using docker volumes): redis:
extends:
file: docker-compose.base.yml
service: redis
restart: always
volumes:
- /srv/posthog/redis:/data We've also added persistence for the kafka:
extends:
file: docker-compose.base.yml
service: kafka
restart: always
depends_on:
- zookeeper
volumes:
- /srv/posthog/kafka/data:/bitnami/kafka |
Redis persistence is what is needed here I think, good catch @jgoclawski , will try this on next update. |
I guess both, kafka and redis, need to write to disk in order for watermark values to survive rebuilds... will test and create PR if it works. |
This is to address PostHog#22291, but potentially fixes other redis/kafka related conditions over containers updates / rebuilds
* add persistence to redis and kafka on hobby This is to address #22291, but potentially fixes other redis/kafka related conditions over containers updates / rebuilds * proper path to kafka data volume * add volume definitions --------- Co-authored-by: Frank Hamand <frank@posthog.com>
Thanks for another fix @feedanal 😍 I'll close this (but folk can re-open if needed) |
* add persistence to redis and kafka on hobby This is to address #22291, but potentially fixes other redis/kafka related conditions over containers updates / rebuilds * proper path to kafka data volume * add volume definitions --------- Co-authored-by: Frank Hamand <frank@posthog.com>
* feat: move query performance polling to its own celery task in a performant manner (#22497) * feat(hogql): inline filters into subqueries (#22468) * feat(hogql): ignore future persons (#22507) * fix: add persistence to redis and kafka on hobby (#22563) * add persistence to redis and kafka on hobby This is to address #22291, but potentially fixes other redis/kafka related conditions over containers updates / rebuilds * proper path to kafka data volume * add volume definitions --------- Co-authored-by: Frank Hamand <frank@posthog.com> * chore(vscode): adapt celery configuration (#22606) * fix: Missing scope on events endpoint (#22604) * revert: Make port a tel (#22609) * chore: add ids to admin for qol when debugging/doing support (#22607) add ids to admin for qol when debugging/doing support * feat: add person profiles instructions to SDKs (#22403) Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * fix(insights): fix stacked line chart (#22598) * fix: filter width (#22615) * feat(data-warehouse): data imports pipeline UI (#22553) * add data warehouse source tables into pipeline ui * add empty state * types * ui * total rows synced column * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (1) * add tooltip --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * remove unnecessary deleteRecordSuccess call in managed proxy (#22475) this was just to restart polling back in the dark ages before we always ran the poll check * fix(data-warehouse): Dont pass password if its not set (#22619) Dont pass password if its not set * chore(deps): Update posthog-js to 1.136.4 (#22630) * fix(data-warehouse): Fix getting ssh enabled var (#22629) Fix getting ssh enabled var * chore: add a data attr (#22621) * feat(hogql): type system (#22587) * restore wip * add numeric operator signatures * Fix all the unit tests * Improvements to type system * Updatd mypy * Fixed mypy issues * Fixed property types * Fixed mypy issues --------- Co-authored-by: eric <eeoneric@gmail.com> * feat(data-warehouse): Resolve expression fields underlying type in db schema query (#22611) * restore wip * add numeric operator signatures * Fix all the unit tests * Improvements to type system * Updatd mypy * Fixed mypy issues * Fixed property types * Fixed mypy issues * Resolve expression fields to their underlying types during the database schema query --------- Co-authored-by: eric <eeoneric@gmail.com> * fix(web-analytics): Fix searching for session properties with multiple words in the search term (#22632) * chore: Add `query-async` to `PERSISTED_FEATURE_FLAGS` (#22628) * chore(insights): Consolidate HogQL flags to `hogql-insights-preview` (#22631) * fix(flags): Better code snippets (#22589) * fix(hogql): derive cache keys from pydantic model_dump (#22465) * feat: Timeout row count if it takes longer than 1/12 of frequency (#22636) * feat: Timeout row count if it takes longer than 1/12 of frequency * fix: Don't fail silently * chore: add two tests (#22641) * feat: Remove FF code for enabling sr in pa onboarding (#22643) * remove ff code * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (2) --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * chore(deps): Update posthog-js to 1.136.5 (#22642) Co-authored-by: posthog-bot <posthog-bot@users.noreply.github.com> * chore: remove addon confirm modal and show message instead (#22616) * Remove addon confirm modal and show message instead * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update BillingProductAddon.tsx * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (2) * Update billing mock date * Update BillingProductAddon.tsx --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * fix: add a product check first for some features (#22581) * add a product check first for some features * Update payGateMiniLogic.tsx * Update payGateMiniLogic.tsx * Update payGateMiniLogic.tsx * fix: Patch the usage_summary value for rows_synced (#22600) * patch the usage_summary value for rows_synced * break up code for readability * chore: remove the email-verification-ticket-submission ff (#22639) * Remove the email-verification-ticket-submission ff * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * chore: remove onboarding reverse proxy ff (#22638) * turn off the onboarding reverse proxy ff * Update onboardingLogic.tsx * Update onboardingLogic.tsx * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * fix(billing UI): fix success color (#22634) * perf: Rewrite retention query to use arrays instead of joins (#22521) * perf: Rewrite retention query to use arrays instead of joins * Fix all events * fix types and tests * fix baseline * fix first time retention weeks * Update query snapshots * fix basline --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * fix: enable compression for reverse_proxy responses (#22237) enable compression for reverse_proxy responses compression is disabled in caddy by default, rendering frontend PAINFULLY slow (as if Django is not slow enough already) * fix: limit kafka log retention to 1h by default (#22270) limit kafka log retention to 1h by default Co-authored-by: Frank Hamand <frankhamand@gmail.com> * feat: live events feed (#22302) * initial commit * initial commit * fix up some types * Add team id * add client side filters * check live events in onboarding * add eventsource * clean up live table logic * add event source module * Delete eventsManagementDescribers.tsx * update event source usage * Update liveEventsTableLogic.ts * Update UI snapshots for `chromium` (2) * add team live events token * Delete liveEventsTableLogic.ts * Update types.ts * switch to use window event source * improvements / feature flags * cleanup * update the live event host * Update UI snapshots for `chromium` (2) * remove event source lib * fix up event source types * Clean up live events view * Delete eventsManagement.ts * Update SDKs.tsx * improve live event typing * add better loading for the table * update the live events table columns * add last batch timestamp check * add toast for error * rename events management to activity * Hookup proper team id * Update start * Fix types * Update some tests * Put SDKs back with no live event changes * Update verifiedDomainsLogic.test.ts.snap * Update verifiedDomainsLogic.test.ts.snap * Update UI snapshots for `chromium` (2) * Update query snapshots * Update query snapshots * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update query snapshots * Update UI snapshots for `chromium` (2) * Update query snapshots * Update UI snapshots for `chromium` (2) * Update query snapshots * Update UI snapshots for `chromium` (2) * Use `preserveParams()` in redirect from old URL * Clean up UI and refactor tabs * Update E2E tests * Update UI snapshots for `chromium` (2) * Don't hide "Reload" when live events available * Remove unused import * Update UI snapshots for `chromium` (2) * Improve local batching reliability * Make console error clearer * Clarify directory structure * Update UI snapshots for `chromium` (2) * Jot down source of `EventSource` type * Remove unused scene code * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update a11.cy.ts * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * Remove any effects for users with flag off * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (2) --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Michael Matloka <michal@matloka.com> Co-authored-by: Michael Matloka <dev@twixes.com> * chore(insights): remove flag hogql-insights-preview (#22651) * remove flag * cleanup * killing code * update image exporter * mypy baseline * test_insight_cache * kill tests that don't work anymore * remove settings * test passes * test pass * mypy fixes * remove commented code * rename and add capture back in * remove references to feature flag * remove the falg * comment out failing test * Update UI snapshots for `chromium` (1) * fix jest * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (1) * Fix storybook test * Update UI snapshots for `chromium` (2) --------- Co-authored-by: Alexander Spicer <aspicer@gmail.com> Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Tom Owers <owerstom@gmail.com> * feat(hog): dicts and arrays (#22618) * fix: if a hedegehog falls in the woods (#22658) * feat: add ability to sort path filter rules (#22633) Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * fix(exports): Fix export to use process_query_dict (#22664) * feat: Pipeline UI: metrics page use insights like date selection (#22601) * fix: console warning from the toolbar (#22660) Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * feat: Improve materialize command (#22666) * chore(data-warehouse): update states (#22580) * add data warehouse source tables into pipeline ui * add empty state * types * ui * total rows synced column * use active instead of completed and failed is error * Update UI snapshots for `chromium` (2) * Update UI snapshots for `chromium` (2) * fix test * change enum * Update UI snapshots for `chromium` (2) --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * feat: use new deployment trigger for temporal worker deployments (#22668) use new deployment trigger for temporal worker deployments these trigger a new workflow in posthog/charts which creates a statefile commit instead of deploying with manually set values from env vars. the statefile commit then triggers a deploy - this means 100% of our deployment state is codified, simplifying rollbacks and deploys * fix(hog): property assigment via dots (#22659) * chore(surveys): refactor survey preview (#22617) * chore: update error message in temporal worker (#22672) this is just to trigger a temporal worker deploy to test the new deployment process * fix: Revert "feat: use new deployment trigger for temporal worker deployments" (#22677) Revert "feat: use new deployment trigger for temporal worker deployments (#22…" This reverts commit 0834410. * chore: Move date fills around to use arrays (#22603) * perf: Single query for funnels with breakdown * also fix trends * Update query snapshots * Update query snapshots * Update query snapshots * fix test * Update query snapshots * Update query snapshots * Update query snapshots * fix * Update query snapshots * Update query snapshots * Update query snapshots * snapshots * chore: Move date fill in trends queries around * fix smoothing * fix formula breakdown * fix * fix tests * fix too many columns * fix * Update query snapshots --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * fix(properties): fix invalid property filters (#22656) * chore(data-warehouse): add beta notice (#22646) * add pricing notice * Update UI snapshots for `chromium` (2) * change pricing to beta notice * Update UI snapshots for `chromium` (1) * Update UI snapshots for `chromium` (2) * change zendesk logo * Update UI snapshots for `chromium` (2) --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * chore(plugin-server): add distinct id batching metric (#22678) * chore(deps): Update posthog-js to 1.136.7 (#22674) * chore(data-warehouse): revert update states (#22681) Revert "chore(data-warehouse): update states (#22580)" This reverts commit 2dcef5c. * fix(experiment): fix MDE modal insight query (#22680) * feat: add a launch config for local billing (#22683) * chore(vscode): add autoreload for celery (#22675) * fix(insights): Fix dashboard export size again (#22671) * feat(insights): make 'all' and multi cohort work in trends actors (#22624) * feat: Current bill value should respect usage limits (#22597) * have current usage dollar value respect billing limits * add test * update according to pr feedback * feat: pipeline-ui-3000 batch export runs (#22552) * fix: Add in checks on dictionary accessors when updating local usage information in posthog (#22688) * fix the dictionary accessors * lint * fix: remove feature gate for action description (#22687) Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> * feat: change language on person profile popup (#22648) * chore: prune unnecessary metrics from usage reports (#22693) * fix(actions): display more than 100 actions (#22692) * feat: allow viewing of activity side panel when impersonating (#22682) * this all works for the non-locked mode, let's get it working for locked stuff now * this handles all the necessary state management, but still renders the HTML * this allows us to not render the HTML if the paywall up --------- Co-authored-by: Sandy Spicer <sandy@posthog.com> Co-authored-by: Marius Andra <marius.andra@gmail.com> Co-authored-by: feedanal <111871756+feedanal@users.noreply.github.com> Co-authored-by: Frank Hamand <frank@posthog.com> Co-authored-by: Thomas Obermüller <thomas.obermueller@gmail.com> Co-authored-by: Ben White <ben@posthog.com> Co-authored-by: Tomás Farías Santana <tomas@tomasfarias.dev> Co-authored-by: Zach Waterfield <zlwaterfield@gmail.com> Co-authored-by: Raquel Smith <raquelmsmith@users.noreply.github.com> Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Eric Duong <eric@posthog.com> Co-authored-by: Frank Hamand <frankhamand@gmail.com> Co-authored-by: Tom Owers <tom@posthog.com> Co-authored-by: PostHog Bot <69588470+posthog-bot@users.noreply.github.com> Co-authored-by: Paul D'Ambra <paul@posthog.com> Co-authored-by: eric <eeoneric@gmail.com> Co-authored-by: Robbie <robbie.coomber@gmail.com> Co-authored-by: Michael Matloka <dev@twixes.com> Co-authored-by: Neil Kakkar <neilkakkar@gmail.com> Co-authored-by: Bianca Yang <21014901+xrdt@users.noreply.github.com> Co-authored-by: posthog-bot <posthog-bot@users.noreply.github.com> Co-authored-by: Juraj Majerik <juro.majerik@gmail.com> Co-authored-by: timgl <tim@glsr.nl> Co-authored-by: Michael Matloka <michal@matloka.com> Co-authored-by: Alexander Spicer <aspicer@gmail.com> Co-authored-by: Tom Owers <owerstom@gmail.com> Co-authored-by: David Newell <d.newell1@outlook.com> Co-authored-by: Julian Bez <julian@posthog.com> Co-authored-by: Tiina Turban <tiina303@gmail.com> Co-authored-by: Brett Hoerner <brett@posthog.com>
Bug description
This is related to #21391 which was closed, but not resolved.
Main symptom is, on self-hosted, session-recordings stop collecting very soon after initial PH install. While there were a number of misconfiguration problems related to blob storage (solved in #22268), new recordings still stopped coming after certain period.
Additional context
Brief look inside the code gave me an idea this might be related to:
posthog/plugin-server/src/main/ingestion-queues/session-recording/session-recordings-consumer.ts
Line 296 in 06297ca
After I enabled debug mode, this surely was the case. Instance logs:
In other words, some condition results in very high
highWaterMarks
value and it prevents new recordings until message offsets reach this value, and I observed session recordings sporadically restart collecting, only to stop again later.If I remove message dropping condition, recordings start flowing in again.
I could not grasp watermark setting logic, but clearly this should not be happening. Kafka runs fine, MinIO is available and stores / serves past recordings fine.
This was occurring on all recent installs (at least 2 weeks back), i.e. this bug is consistent and experienced by many users.
Not sure how helpful, but Kafka's status page:
Debug info
Debug info
No response
The text was updated successfully, but these errors were encountered: