[doc][yba] 2024.1 High availabiliy updates #22186

ddhodge · 2024-04-29T20:06:45Z

High availabiliy updates
DOC-270

@netlify /preview/yugabyte-platform/administer-yugabyte-platform/high-availability/

Summary: Subj Test Plan: manually Reviewers: rmadhavan Reviewed By: rmadhavan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34432

…8s resource spec. Summary: [PLAT-13508] Calcualte k8s ybc throttle params correctly when using k8s resource spec. Test Plan: manual Reviewers: vkumar Reviewed By: vkumar Subscribers: vkumar, sanketh, yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34391

…tables Summary: Commit fad94f7 introduces a new tablet metadata field `skip_table_tombstone_check`, but it doesn't set this field for colocated tables. Fix this by setting the field in `AsyncAddTableToTablet()`. Backport-to: 2024.1 Jira: DB-11043 Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_mini-test --gtest_filter PgMiniTest.SkipTableTombstoneCheckMetadata Reviewers: asrivastava Reviewed By: asrivastava Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D34428

Summary: Add changes to `yb-server-ctl.yml` to make ansibleDestroy idempotent. The order of clean up was as follows: 1. Stop yb-master, yb-tserver, yb-controller services (if systemd) 2. Delete systemd units for (1) + other services 3. Stop + delete systemd units for node exporter + otel collector 4. Clean/Clean logs for (1) 5. Clean out the data directory + home directory. At the end, delete the `yb-server-ctl.sh` The issue is that if step 2 was performed before step 1, retrying ansibleDestroy will always fail. Similarly with retrying (3). Also, at the end of step 5, we delete `yb-server-ctl.sh`. Then if we retry, step 4 and 5 will fail, since they both use the clean up script. For the systemd units not existing case, we first check the status of the systemd unit before stopping. As for the the clean up script, if the clean up script does not exist, we will skip the cleaning up (This is ok, as worst case, we will catch on the node addition and preflight check fails). Test Plan: 1. Create a 3 node rf3 on-prem universe. Stop one of the VMs from the cloud provider. Perform a replace node. Make sure that the node is in decommissioned state. Then start the VM for the node in the decommissioned state. Perform a 'recommission' action. Make sure that this succeeds. 2. Create a 3 node rf3 on-prem universe. Inject an error at the end of the `OnPremDestroyInstancesMethod` method, to make sure it fails. Then replace a node. The node will be placed into the decommissioned state because it failed. At this point, all the systemd units, clean up scripts, home/data directories are already cleaned up. Performing a 'recommission' action will still succeed (before it would not). 3. Verify that the correct systemd type is used, i.e. either user systemd or system level systemdd Reviewers: sanketh, nsingh Reviewed By: nsingh Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34039

…me config on universe form. Summary: Fixes a bug where a fetch for provider level runetime config was not requesting inherited values. This led to incorrect read on runtime config flag if no value was set at the provider level. Test Plan: Set `yb.universe.geo_partitioning_enabled` to `true` at the customer level. Verify that the universe form recognizes the inherited `true` setting. {F171727} {F171728} {F171726} Set `yb.universe.geo_partitioning_enabled` to `false` at the customer level and `true` at the provider level. Verify that the universe form reads the runtime value from the provider. {F171729} {F171730} {F171731} Reviewers: anijhawan, rmadhavan, kkannan Reviewed By: rmadhavan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34312

Summary: When `ysql_ddl_transaction_wait_for_ddl_verification` is enabled, the PG Client performs `WaitForDdlVerificationToFinish`. This waiting for the txn state to be cleared in master. The `TestCleanUpCDCStreamsMetadataDuringTabletSplit` tests work by blocking the catalog manager background task, which used to cause the Table delete to get stuck. This was because `YsqlDdlTxnDropTableHelper` does not clear the Txn state after it has deleted the table. This function is run by `TableSchemaVerificationTask`, or `ReportYsqlDdlTxnStatus`. The table delete does not clear the txn state synchronously because `CatalogManager::CheckTableDeleted` skips the cleanup since it does a `HasTasks` check on the table which will return true due to the `TableSchemaVerificationTask` itself. Other tests dont hit this issue because the background task also runs `CatalogManager::CleanUpDeletedTables` which calls `RemoveDdlTransactionState` which eventually unblocks `WaitForDdlVerificationToFinish`. Even with out the test blocking the task this is not optimal. We should call RemoveDdlTransactionState as soon as the table is marked as DELETING, and unblock the client instead of making it wait on the bg task which will be slow and take arbitrarily long. This change fixes the issue by performing `RemoveDdlTransactionState` in `YsqlDdlTxnDropTableHelper`. Fixes yugabyte#22095 Jira: DB-11021 Test Plan: CDCSDKTabletSplitTest.TestCleanUpCDCStreamsMetadataDuringTabletSplitImplicit CDCSDKTabletSplitTest.TestCleanUpCDCStreamsMetadataDuringTabletSplitExplicit Reviewers: myang Reviewed By: myang Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D34431

Summary: These are some of the leaks discovered. We will need to watch out for leaks like this. Test Plan: Trivial. Reviewers: amalyshev, nbhatia, sanketh, muthu, anijhawan Reviewed By: anijhawan Subscribers: anijhawan, yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34435

* simplifying sys catalog * cleaning up sys catalog * removing store term for views * info about information schema * icons for table/view * update to 5/3 model * Apply suggestions from code review --------- Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>

… time of table creation Summary: In order to support dynamic table addition with PG replication consumption, for tables created after stream creation, we need to set retention barriers on the tablets of such tables at the time of table creation. If replication slot consumption is enabled, then whenever a new table is created we check if atleast one stream exists on the namespace. If it does, then we set the field `cdc_sdk_require_history_cutoff` to true in CreateTablet Request. The tablet_service creates the tablet and based on the value of `cdc_sdk_require_history_cutoff` calls `SetAllInitialCDCSDKRetentionBarriers()` to set the retention barriers on the tablet. The method `PopulateCDCStateTableOnNewTableCreation()` is called in the callback for CreateTablet and adds entries for the tablet in the cdc_state table. **Upgrade/Rollback safety:** The retention barriers will only be set on tablets of newly added table if replication slot consumption is enabled. This is guarded by the flag `ysql_TEST_enable_replication_slot_consumption`. Protos modified: - CreateTabletRequestPB: An optional boolean field cdc_sdk_require_history_cutoff has been added. - CreateTabletResponsePB: An optional OpIdPB field cdc_sdk_safe_op_id has been added. Jira: DB-10538 Test Plan: ./yb_build.sh --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestDynamicTablesAdditionForTableCreatedAfterStream ./yb_build.sh --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestRetentionBarrierRaceWithUpdatePeersAndMetrics ./yb_build.sh --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestFailureSettingRetentionBarrierOnDynamicTable ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testDynamicTableAdditionForAllTablesPublication' Reviewers: asrinivasan, skumar, stiwary, sergei Reviewed By: sergei Subscribers: ybase, ycdcxcluster, bogdan Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D33813

Summary: These releases include: Upgrade cqlsh version to v3.10-yb-20 : https://github.com/yugabyte/cqlsh/releases/tag/v3.10-yb-20 Test Plan: Existing Tests Reviewers: steve.varnau, asrivastava Reviewed By: steve.varnau, asrivastava Differential Revision: https://phorge.dev.yugabyte.com/D34415

…er attributes as well Summary: Currently during our YBA ↔︎ YBDB LDAP sync, we assume that the user name we want to sync will be present on the DN. While that will be true for most of the scenarios, but we can have customers wanting to sync the user present on a different attribute for example - `sAMAccountName` This diff performs the sync based on the attribute the user specified in the payload. If the user has specified `ldapUserfieldAttribute`, the user name will be always retrieved from this attribute on the LDAP server. If this is not specified, `ldapUserfield` should be specified [this is to get the user name from the dn] else the sync fails with the message: `Either of the ldapUserfield or ldapUserfieldAttribute is necessary to perform the sync` Test Plan: - Triggered the sync with the `ldapUserfieldAttribute` and synced only the users that have this attribute set on the LDAP server - Triggered the sync with the `ldapUserfield` and observed the sync where the user name is retrieved from the dn - Triggered the sync with both the `ldapUserfieldAttribute` and the `ldapUserfield` configured, and the pref is given to `ldapUserfieldAttribute` - Triggered the sync with specifying neither of the `ldapUserfieldAttribute` and the `ldapUserfield` and observed the exception thrown. Reviewers: #yba-api-review!, svarshney Reviewed By: svarshney Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34412

Summary: The `SELECT … FOR UPDATE` command, when applied to multiple keys, currently locks each row serially. This approach results in increased latency due to multiple round-trip communications (RPC requests) to the DocDB storage layer for lock acquisition. This can significantly impact the performance of applications relying on transactional consistency for multi-row operations. The primary goal of this revision is to enhance the performance of multi-key `SELECT … FOR UPDATE` queries by implementing a batched locking mechanism. This approach will aggregate lock requests for multiple rows and execute them in a single RPC call to the DocDB layer, thereby reducing latency and improving overall transaction throughput. This revision plans to apply the optimization to all forms of explicit locking supported by PostgreSQL (`FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, FOR KEY SHARE`). In terms of implementation, this means buffering operations for many types of `RowMarkType`. To control the batch size, use the gflag as follows: `SET yb_explicit_row_locking_batch_size = <size>;`, where `<size>` is a positive integer. Note that this flag is set to 1 by default, which disables the feature. As an example, consider the following table with a single primary key column in which we insert 100 rows: ``` CREATE TABLE tbl (k INT PRIMARY KEY); INSERT INTO tbl (SELECT i FROM generate_series(1, 100) AS i); ``` Currently, explicitly acquiring row-level locks for all 100 rows results in `Storage Read Requests: 101`, as we are performing one initial read, and then one read for every row we intend to acquire a lock for: ``` yugabyte=# EXPLAIN (ANALYZE, DIST) SELECT * FROM tbl WHERE k <= 100 FOR UPDATE; QUERY PLAN ----------------------------------------------------------------------------------------------------------- LockRows (cost=0.00..112.50 rows=1000 width=36) (actual time=7.027..156.969 rows=100 loops=1) -> Seq Scan on tbl (cost=0.00..102.50 rows=1000 width=36) (actual time=3.021..3.606 rows=100 loops=1) Remote Filter: (k <= 100) Storage Table Read Requests: 1 Storage Table Read Execution Time: 2.488 ms Storage Table Rows Scanned: 100 Planning Time: 0.098 ms Execution Time: 157.274 ms Storage Read Requests: 101 Storage Read Execution Time: 139.504 ms Storage Rows Scanned: 200 Storage Write Requests: 0 Catalog Read Requests: 0 Catalog Write Requests: 0 Storage Flush Requests: 0 Storage Execution Time: 139.504 ms Peak Memory Usage: 24 kB (17 rows) ``` By reducing the number of RPCs with this optimization, we end up with `Storage Read Requests: 2`. This is because we are performing one initial read request followed by another request for the locks, hence significantly reducting the total execution time: ``` yugabyte=# SET yb_explicit_row_locking_batch_size = 1024; SET yugabyte=# EXPLAIN (ANALYZE, DIST) SELECT * FROM tbl WHERE k <= 100 FOR UPDATE; QUERY PLAN ----------------------------------------------------------------------------------------------------------- LockRows (cost=0.00..112.50 rows=1000 width=36) (actual time=3.883..19.285 rows=100 loops=1) -> Seq Scan on tbl (cost=0.00..102.50 rows=1000 width=36) (actual time=3.810..4.375 rows=100 loops=1) Remote Filter: (k <= 100) Storage Table Read Requests: 1 Storage Table Read Execution Time: 2.532 ms Storage Table Rows Scanned: 100 Planning Time: 0.970 ms Execution Time: 19.621 ms Storage Read Requests: 2 Storage Read Execution Time: 2.535 ms Storage Rows Scanned: 200 Storage Write Requests: 0 Catalog Read Requests: 0 Catalog Write Requests: 0 Storage Flush Requests: 0 Storage Execution Time: 2.535 ms Peak Memory Usage: 24 kB (17 rows) ``` Jira: DB-9512 Test Plan: Added a new SQL regress test `yb_explicit_row_lock_batching.sql/.out` to `yb_misc_serial4_schedule`, which can be run with the following command: `./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressMisc#testPgRegressMiscSerial4'` The test is based off `yb_explicit_row_lock_planning.sql/.out`, but includes `EXPLAIN (ANALYZE, DIST)` commands with deterministic fields to track the number of requests, ensuring that we are flushing once. Also, there are some newly added cases, such as: - Simple `JOIN` with top-level locking - `JOIN` with leaf-level locking (sub-query) - When `LIMIT` returns less than filtered query - Filter on the Postgres side, with `NOW()` Reviewers: kramanathan, dmitry Reviewed By: kramanathan, dmitry Subscribers: yql, smishra, patnaik.balivada Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D32543

…ion_slots view Summary: Summary This is related to the project to support Replication slot API in YSQL (yugabyte#18724). (https://phorge.dev.yugabyte.com/D29194). This is also related to the PG Compatible Logical Replication Consumption project. The schema of the pg_replication_slots view has been modified by adding an extra yb-specific column yb_restart_commit_ht which is a int8. The value of this column is a uint64 representation of the commit Hybrid Time corresponding to the restart_lsn. This can be used by the client (like YB-PG Connector) to perform a consistent snapshot (as of the consistent_point) in the case when a replication slot already exists. UPGRADE/ROLLBACK SAFETY: These changes are protected via the preview flag: ysql_yb_enable_replication_commands Jira: DB-10956 Test Plan: Manual Testing ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot' ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressReplicationSlot' ./yb_build.sh --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotent' ./yb_build.sh --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotentSingleConn' Reviewers: stiwary, skumar Reviewed By: stiwary Subscribers: yql, ycdcxcluster Differential Revision: https://phorge.dev.yugabyte.com/D34279

Summary: Added required data-testid for os patching ui automation Test Plan: Tested manually Reviewers: kkannan Reviewed By: kkannan Subscribers: ui, yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34417

…ot in Publication list Summary: Currently the update peers and metrics thread does not move the retention barrier forward for the tables not included in publication's table list. This can lead to holding up of resources for no reason. This diff makes the necessary changes in the update peers and metrics path to make sure that the retention barriers on such tables are also moved forward. The `record_id_commit_time` in slot entry for cdc_state table gives us the commit time of the last acknowledged transaction for a slot. Since the virtual WAL will not ship any records with commit time less than this, we can safely move the `cdc_sdk_safe_time` forward to this time. Even though, in `cdcsdk_producer` we might receive WAL records with commit time less than this safe_time due to the unsorted nature of WAL, the `cdcsdk_producer` will filter these out based on the commit_time_threshold. In the scenario of multiple slots (streams) existing on a namespace, the minimum `record_id_commit_time` among all the slots will be chosen for `cdc_sdk_safe_time`. These changes are applicable only to the replication slot model of consumption and are guarded by the flag `ysql_TEST_enable_replication_slot_consumption`. Also in the same database environment, we will not support both the yb-connector and pg-connector simultaneously. This is because using this algorithm in such a scenario will move the retention barriers too aggressively for the yb-connector consumption when the yb-connector is lagging. Jira: DB-10691 Test Plan: Jenkins: test regex: .*CDCSDK.* ./yb_build.sh --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestRetentionBarrierMovementForTablesNotInPublication Reviewers: skumar, asrinivasan, siddharth.shah, stiwary Reviewed By: asrinivasan, stiwary Subscribers: ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34021

Summary: Added safety check for cluster membership on deleting Pods. Reuses same logic as VMs: (1) For master: checks if the Pod is in the master config (2) For tserver: checks if the pod is hosting any tablets Test Plan: - Modified UT to accommodate the check - Manually verified Reviewers: anijhawan, nsingh Reviewed By: anijhawan, nsingh Subscribers: nsingh, yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34024

Summary: Added the AreNodesSafeToTakeDown at (1) task precheck before freeze, and (2) before upgrading each node. Test Plan: Verified manually the check works as expected. Modified UTs for correct task flows Reviewers: anijhawan, sanketh, cwang Reviewed By: anijhawan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D33937

…neImage is used Summary: In YBM, we still use the centos7 AMIs that is passed as `machineImage` param during universe create. imageBundle will have the `ec2-user` configured as the sshUser given the fact that we do not specify the AMI during provider creation & ends up creating YBA Managed Bundles. This diff falls back to sshUser configured in the provider in case `machineImage` is passed in the universe create params. Test Plan: Manually verified Reviewers: vbansal Reviewed By: vbansal Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34448

…icated masters Summary: - Added check for disk space during K8s scale down operations - Modified existing checks to handle dedicated nodes cases - We will first checks if the task requires a preflight disk check or not: If there are pods to remove/If existing volume size is reduced. - For kubernetes, the query we run is as follows, based on list of namespaces to check and server type: ``` "sum(kubelet_volume_stats_used_bytes{namespace=~\"%s\"," + " persistentvolumeclaim=~\"(.*)-yb-%s-(.*)\"})/1073741824"; ``` - For dedicated nodes case, we will now provided explicit `exported_instance` names to only gather required volume metrics Test Plan: - Added UTs to check for positive/negative check scenarios Reviewers: anijhawan, #yba-api-review, nsingh Reviewed By: anijhawan, #yba-api-review, nsingh Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34202

Summary: Allow manual role change once auto create user is turned off and rbac is on. Test Plan: manually verified that we are able to change role once auto create user is turned off and rbac is on. Reviewers: #yba-api-review!, svarshney Reviewed By: svarshney Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34453

…tion Slots, tabs become skewed and eventually hidden Summary: Bug: The action button is using position absolute. When we add new items to the menubar, these menu items are hiding behind the "action" button. Fix: Re-arranged the action buttons and removed the position absolute. Test Plan: Tested manually Before: {F173054} After: {F173062} Reviewers: lsangappa, jmak, rmadhavan Reviewed By: lsangappa Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34461

Summary: Avoiding empty filters in graph requests. Test Plan: tested manually Reviewers: rmadhavan Reviewed By: rmadhavan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34477

Summary: This PR fixes the DB audit log capture for multi-line YCQL queries. We now use the multiline config in the Filelog Receiver to split based on the audit log regex pattern prefix. When we run the following YCQL query before this PR: ``` CREATE TABLE emp5( id int primary key); ``` We get output log in datadog like: ``` I0423 15:09:47.628911 16863 audit_logger.cc:518] AUDIT: user:anonymous|host:10.9.74.144:9042|source:10.9.74.144|port:43754|timestamp:1713884987628|type:CREATE_TABLE|category:DDL|ks:mydatabase|scope:emp5|operation:create table emp5( ``` Notice the truncated query. Test Plan: Manually tested the following scenario for YCQL before and after this change: Flow: 1. Create universe 2. Enable DB audit logging 3. Run single line YCQL query: ``` create table emp12( id int PRIMARY KEY ); ``` 4. Run multi line YCQL query: ``` create table emp13( id int PRIMARY KEY ); ``` Step 3 datadog output: ``` I0423 15:48:50.103888 33418 audit_logger.cc:518] AUDIT: user:anonymous|host:10.9.74.144:9042|source:10.9.74.144|port:36596|timestamp:1713887330103|type:CREATE_TABLE|category:DDL|ks:mydatabase|scope:emp12|operation:create table emp12( id int PRIMARY KEY ); ``` Step 4 datadog output: ``` I0423 15:55:22.701594 33417 audit_logger.cc:518] AUDIT: user:anonymous|host:10.9.74.144:9042|source:10.9.74.144|port:36596|timestamp:1713887722701|type:CREATE_TABLE|category:DDL|ks:mydatabase|scope:emp13|operation:create table emp13( id int PRIMARY KEY ); ``` Reviewers: amalyshev Reviewed By: amalyshev Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34425

* change the basedir to /Users/pthangamani/var/node from /tmp/ybd * updating yes/no/partial icons to fa-sharp

…n table drop Summary: Whenever a table is dropped that is part of the CDC stream, `CleanUpCDCSDKStreamsMetadata()` is called to remove the cdc state table entries and sys catalog entries. CleanUpCDCSDKStreamsMetadata computes tablets from two sources: Set A - GetTablets() on all tables part of the stream metadata. Set B - Read cdc_state table for the stream. Remove entries from B that are not present in A. We recently introduced a new state table entry for the replication slot. So, based on the above algorithm, it is deleted from the cdc state table whenever a table is dropped from that stream. To stop this deletion, we are simply not considering the slot entry in the above algorithm. Jira: DB-11044 Test Plan: Jenkins: test regex: .*CDCSDKConsumptionConsistentChangesTest.* ./yb_build.sh --cxx-test cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestConsumptionAfterDroppingTableNotInPublication Reviewers: asrinivasan, stiwary Reviewed By: asrinivasan Subscribers: ycdcxcluster Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34423

Summary: ListLiveTabletServers API that was used for precheck functionality is not implemented for db versions less than 2.8. We should skip checks for versions earlier than that. Test Plan: create a universe with 2.6 version upgrade to 2.8 -> success Reviewers: cwang Reviewed By: cwang Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34420

Summary: ASH integration for TS framework (RCA and Lock Contention) Test Plan: ASH integration for TS framework Reviewers: amalyshev, cdavid Reviewed By: cdavid Differential Revision: https://phorge.dev.yugabyte.com/D34401

Summary: Some fixes/hacks for demo discussed with Raj Test Plan: unit tested Reviewers: rmadhavan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34486

Summary: Update YBA to use the latest TS Framework version Test Plan: Update YBA to use the latest TS Framework version Reviewers: amalyshev, cdavid Reviewed By: cdavid Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34489

Summary: Fix invalid metadata Test Plan: tested manually Reviewers: rmadhavan Reviewed By: rmadhavan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34490

Summary: New GUC variable to fine tune parallel range size. The new variable yb_parallel_range_size is sent to the DocDB where it is used to determine where the next parallel range boundary should be. DocDB has actual data files and can accurately calculate the parallel range size. While it is generally not possible to make ranges of exact size, DocDB tries to make them as close as possible to the requested size. It is quite different from similarly named yb_parallel_range_rows, which is handled on Postgres side. Postgres compares the value of yb_parallel_range_rows to the number of table tuples in the optimizer's statistics to decide, whether to use parallel read from the table at all. Jira: DB-10843 Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressParallel#testPgRegressParallel' Reviewers: timur, tnayak Reviewed By: tnayak Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34043

Summary: The code for tcmalloc profiling is currently in src/yb/server/pprof-path-handlers-util.h / .cc. It should be in the utils folder instead. Jira: DB-11176 Test Plan: Built with `--use_gperftools_tcmalloc` and `--use_google_tcmalloc` and ran `tcmalloc_profile-test.cc` tests. Reviewers: kfranz Reviewed By: kfranz Subscribers: esheng, ybase Differential Revision: https://phorge.dev.yugabyte.com/D34747

Summary: This fixes the build broken by D34747. The landed diff was not the most recent diff. Test Plan: jenkins: skip (already ran jenkins on the latest version of the previous diff) Reviewers: steve.varnau Reviewed By: steve.varnau Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D34839

Summary: The support bundle CRD now allows leaving out "components" fields, which will default to collecting all components. the auto-created provider should now also appear under "managed kubernetes services" tab Test Plan: tested support bundle create Reviewers: anijhawan Reviewed By: anijhawan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34005

Summary: This makes sure to not run already submitted upgrade tasks. Test Plan: Trivial Reviewers: muthu, cwang Reviewed By: cwang Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34842

…nel dialog Summary: Currently `smtpPort` is accepting non-numeric chars which is not expected. This diff adds a validation for the same in the frontend and throws the appropriate error before sending the request to the backend. Validation added for smtpPort: - Should be an int and between 1 and 65535 inclusive Test Plan: Tested manually - smtpPort=0 => error: `SMTP Port must be between 1 and 65535` Reviewers: kkannan, nbhatia, svarshney, lsangappa, ianderson Reviewed By: kkannan, lsangappa, ianderson Subscribers: ianderson, yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34818

Clarify helm chart options for RBAC --------- Co-authored-by: Aishwarya Chakravarthy <achakravarthy@yugabyte.com>

…to-flag Summary: Combination of two commits: yugabyte@decb104111 + yugabyte@6de97906 introduced a compatibility issue between pre-decb104111 and post-6de97906 tserver versions if they are running at the same time during upgrade. With the post-6de97906 logic `WriteQuery::DoCompleteExecute` when isolation level is not `SERIALIZABLE_ISOLATION` it can add WRITE_OP that contains both `write_pairs` and `read_pairs` to the Raft log. Corresponding `read_pair` will have key set to encoded row key and value set to `KeyEntryTypeAsChar::kNullLow`. This WRITE_OP is then processed by `TransactionalWriter::Apply` and corresponding intents are generated and written to the intents DB. In post-6de97906 version `TransactionalWriter::Apply` processes such `read_pair` and as a result generates intent `<row> -> kNullLow` of type `[kStrongRead]`. But in pre-decb104111 version `TransactionalWriter::Apply` processes such `read_pair` in a different way and generates intent `<row> -> kNullLow` of type `[kStrongRead, kStrongWrite]`. Then `ApplyIntentsContext::Entry` processes intents and in pre-decb104111 version due to presence of `kStrongWrite` type this intent gets written into regular DB and results in the following record: `<row> -> kNullLow`, which is incorrect regular DB record. Effect of handing this record by DocDB is different depending on WHERE filter and presence of aggregation, it may result in either affected rows be not visible by the user statement or visible but with non-PK columns set to null. Given that `TransactionalWriter::Apply` only writes `read_pairs` for non-SERIALIZABLE_ISOLATION when new logic is enabled by `ysql_skip_row_lock_for_update`, the solution is to convert it to an auto-flag, so new logic is only enabled after all nodes are on the new version. Also added `PgsqlWriteOperation::use_row_lock_for_update_` which is initialized in constructor to avoid changing behaviour in context of the same `PgsqlWriteOperation` instance (since flag is now runtime and the auto-flag will change at runtime from false to true). Jira: DB-10979 Test Plan: Run TPCC workload in parallel with upgrade from 2.18.7.0-b38 to build with this fix incorporated. Reviewers: pjain, smishra, hsunder, dmitry Reviewed By: pjain, hsunder, dmitry Subscribers: yql, ybase Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34733

…idation Summary: Added Ignore and Save option for Azure ,K8s , GCP Provider validation. Test Plan: Tested manually Reviewers: kkannan, jmak Reviewed By: kkannan Subscribers: ui, yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34819

Summary: Removed console log Test Plan: Tested manully Reviewers: kkannan Reviewed By: kkannan Subscribers: ui, yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34864

Summary: This revision introduces support for the 'test_decoding' output plugin with the PG compatible logical replication support. Key points: 1. The plugin does not take a publication list in the parameters which is different from the pgoutput plugin. Here, we assume that they are interested in all the tables of the database. A future feature can optionally take the publication list as well. 2. The plugin does not send the relation object. So the schema refresh callback is a NOOP. Jira: DB-11193 Test Plan: Jenkins: test regex: .*ReplicationSlot.* ./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testWithTestDecodingPlugin' --- Testing with pg_recvlogical 1. Start the cluster 2. Create the table ``` CREATE TABLE dummy_table ( id SERIAL PRIMARY KEY, col_integer INT, col_bigint BIGINT, col_decimal DECIMAL(10,2), col_text TEXT, col_boolean BOOLEAN, col_timestamp TIMESTAMP, col_date DATE, col_time TIME, col_json JSONB, col_array INT[] ); ``` 3. Create the slot ``` build/latest/postgres/bin/pg_recvlogical -d yugabyte --slot=test --create-slot ``` 4. Insert some data in the table 5. Receive data ``` build/latest/postgres/bin/pg_recvlogical -d yugabyte --slot=test --start -f - ``` Output ``` BEGIN 2 table public.dummy_table: INSERT: id[integer]:1 col_integer[integer]:1 col_bigint[bigint]:10 col_decimal[numeric]:100.5 col_text[text]:'Dummy Text 1' col_boolean[boolean]:false col_timestamp[timestamp without time zone]:'2024-05-08 14:33:48.52087' col_date[date]:'2024-05-07' col_time[time without time zone]:'13:34:48.52087' col_json[jsonb]:'{"key": "value", "key2": "value2"}' col_array[integer[]]:'{1,2,3,4,5,6,7,8,9,10}' COMMIT 2 BEGIN 3 table public.dummy_table: INSERT: id[integer]:2 col_integer[integer]:2 col_bigint[bigint]:20 col_decimal[numeric]:201 col_text[text]:'Dummy Text 2' col_boolean[boolean]:true col_timestamp[timestamp without time zone]:'2024-05-08 14:32:49.717402' col_date[date]:'2024-05-06' col_time[time without time zone]:'12:34:49.717402' col_json[jsonb]:'{"key": "value", "key2": "value2"}' col_array[integer[]]:'{2,4,6,8,10,12,14,16,18,20}' COMMIT 3 ..... ``` Reviewers: asrinivasan Reviewed By: asrinivasan Subscribers: ycdcxcluster, yql Differential Revision: https://phorge.dev.yugabyte.com/D34796

Test Plan: Manually verified that provider validation works after the change Reviewers: svarshney, dkumar Reviewed By: dkumar Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34853

…dex filters on bound columns Summary: In `yb_get_batched_index_paths` it was wrongly assumed that any batchable clause with its LHS belonging to an indexed column in the restriction list must be bound to the index as an index condition. However, it can be the case that such batchable clauses might have the LHS as `indexed_col::bpchar` where `indexed_col` is an indexed text column. In such cases, the LHS doesn't exactly match an index column and, leaving the whole clause to be treated as a qpqual/index filter. Because of this assumption, `yb_get_batched_index_paths` misses checking if such unbound batchable clauses are present. There is already a check to prevent batching in the presence of a qual whose LHS column is not one that's bound to any batched clause (i.e: not present in `batched_inner_attnos`). This diff moves logic here to also prevent batching in the other case where such a qual might have an LHS column that's already bound to a batched clause. One can technically also achieve the desired effect by maintaining a list of bound index conditions and checking amongst the quals to see which one is bound but this would be a bit more computationally intensive. Needs backports to: 2024.1, 2.20, 2.18 Jira: DB-10870 Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressJoin' Reviewers: mtakahara Reviewed By: mtakahara Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D34809

* test change * added warning * changed for 2.21 as well

Summary: [PLAT-13873] Adding exception handling to support bundle creation crd Test Plan: itest. Caused an exception, verified YBA Did not crash. ``` Reconciler no end date given, setting to current time dev:yba logs yugaware: 2024-05-08T21:09:01.291Z [error] SupportBundleReconciler.java:81 [-1887041250-pool-17-thread-25] com.yugabyte.yw.common.operator.SupportBundleReconciler Failed to add support bundle dev:yba logs yugaware: java.lang.NullPointerException: Cannot invoke "java.util.List.stream()" because the return value of "io.yugabyte.operator.v1alpha1.SupportBundleSpec.getComponents()" is null dev:yba logs yugaware: at com.yugabyte.yw.common.operator.SupportBundleReconciler.onAddInternal(SupportBundleReconciler.java:124) dev:yba logs yugaware: at com.yugabyte.yw.common.operator.SupportBundleReconciler.onAdd(SupportBundleReconciler.java:79) dev:yba logs yugaware: at com.yugabyte.yw.common.operator.SupportBundleReconciler.onAdd(SupportBundleReconciler.java:31) dev:yba logs yugaware: at io.fabric8.kubernetes.client.informers.impl.cache.ProcessorListener$AddNotification.handle(ProcessorListener.java:103) dev:yba logs yugaware: at io.fabric8.kubernetes.client.informers.impl.cache.ProcessorListener.add(ProcessorListener.java:50) dev:yba logs yugaware: at io.fabric8.kubernetes.client.informers.impl.cache.SharedProcessor.lambda$distribute$0(SharedProcessor.java:91) dev:yba logs yugaware: at io.fabric8.kubernetes.client.informers.impl.cache.SharedProcessor.lambda$distribute$1(SharedProcessor.java:114) dev:yba logs yugaware: at io.fabric8.kubernetes.client.utils.internal.SerialExecutor.lambda$execute$0(SerialExecutor.java:58) dev:yba logs yugaware: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) dev:yba logs yugaware: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) dev:yba logs yugaware: at java.base/java.lang.Thread.run(Thread.java:833) dev:yba logs yugaware: 2024-05-08T21:09:01.321Z [warn] SupportBundleReconciler.java:164 [-1887041250-pool-17-thread-24] com.yugabyte.yw.common.operator.SupportBundleReconciler updating support bundle is not supported ``` Reviewers: dshubin, vkumar Reviewed By: dshubin, vkumar Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34841

…e#22310)

…ions that have arm architecture Summary: List DB Versions API call failing to return the DB versions that have arm architecture So today selecting CPU Architecture based on provider is enabled only for AWS So what happens is withb the new release workdflow 2 issues happen 1. When we select a specific CPU architecture, DB version (release API) does not filter the versions based on that 2. When we select AWS where OS patching is enabled, by default Arch value is null even though the pointer points to X86 arechitecture (already exisiting issue I found today) So I ensured to pass x86_64 as default arch type when AWS provider is selected and when OS patching is enabled Test Plan: Please refer to the video {F176414} Reviewers: jmak, lsangappa Reviewed By: jmak Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34836

…sion tests Summary: For colocation regression tests, costs of query plan are irrelevant, so remove costs from test output to benefit PG15 merge. Jira: DB-11225 Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressLegacyColocation#testPgRegressLegacyColocation' ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressColocation#testPgRegressColocation' Reviewers: jason Reviewed By: jason Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D34884

Summary: The UI always assumed that the backend is sending the timestamp in UTC. However, the backend gets the timestamp from the underlying OS. If the underlying OS has a different timestamp other than UTC, the backend sends it as is, and the moment fails to parse it. Fix: This fix specifies the exact input format(ddd MMM DD HH:mm:ss z YYYY) to moment so that it can parse it. Test Plan: Tested on both UTC and Non UTC timestamps Reviewers: lsangappa, jmak, rmadhavan Reviewed By: lsangappa, jmak, rmadhavan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34471

Summary: Current Lag column is not sorting properly because we did not have a field like currentLag in the HeaderColumn. Test Plan: Tested manually **Screenshots** {F176925} {F176926} Reviewers: kkannan Reviewed By: kkannan Subscribers: ui, yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34895

…the catalog manager. Summary: This diff introduces a new class, MasterClusterHandler, to contain some RPC endpoint methods formerly in the CatalogManager. Jira: DB-8520 Test Plan: existing tests Reviewers: asrivastava Reviewed By: asrivastava Subscribers: ybase, slingam Differential Revision: https://phorge.dev.yugabyte.com/D34749

…from Primary Cluster in Overview Page Summary: Make universe name as link after clicking Details link from Primary Cluster in Overview Page Test Plan: Please refer to screenshot Focusing in header text will make it orange and clickable consistent with other places which is already doing the same {F175713} Reviewers: jmak Reviewed By: jmak Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34742

anmalysh-yb and others added 30 commits April 23, 2024 22:58

[PLAT-13628] Fix groupBy

d112c5d

Summary: Subj Test Plan: manually Reviewers: rmadhavan Reviewed By: rmadhavan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34432

fix : Added missing data-test id for ui automation

3c83938

Summary: Added required data-testid for os patching ui automation Test Plan: Tested manually Reviewers: kkannan Reviewed By: kkannan Subscribers: ui, yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34417

[PLAT-13626] Fix uneven query distribution ASH graphs

d60e381

Summary: Avoiding empty filters in graph requests. Test Plan: tested manually Reviewers: rmadhavan Reviewed By: rmadhavan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34477

Change the basedir to ${HOME}/var/node from /tmp/ybd (yugabyte#22124)

a9569f1

* change the basedir to /Users/pthangamani/var/node from /tmp/ybd * updating yes/no/partial icons to fa-sharp

[PLAT-13664]: ASH integration for TS framework

078d035

Summary: ASH integration for TS framework (RCA and Lock Contention) Test Plan: ASH integration for TS framework Reviewers: amalyshev, cdavid Reviewed By: cdavid Differential Revision: https://phorge.dev.yugabyte.com/D34401

[PLAT-13626] Some ASH fixes

daf12ff

Summary: Some fixes/hacks for demo discussed with Raj Test Plan: unit tested Reviewers: rmadhavan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34486

[PLAT-13626] One more ASH fix

c38fb8b

Summary: Fix invalid metadata Test Plan: tested manually Reviewers: rmadhavan Reviewed By: rmadhavan Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34490

andrei-mart and others added 26 commits May 7, 2024 14:11

[PLAT-13859] Discard already submitted node-agent tasks in HA follower

aca3474

Summary: This makes sure to not run already submitted upgrade tasks. Test Plan: Trivial Reviewers: muthu, cwang Reviewed By: cwang Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34842

[YBA] [docs] Document serviceAccount helm value (yugabyte#22299)

209b0c3

Clarify helm chart options for RBAC --------- Co-authored-by: Aishwarya Chakravarthy <achakravarthy@yugabyte.com>

fix : remove console

a2399ad

Summary: Removed console log Test Plan: Tested manully Reviewers: kkannan Reviewed By: kkannan Subscribers: ui, yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34864

Turn on provider validation by default for azure and gcp

d525561

Test Plan: Manually verified that provider validation works after the change Reviewers: svarshney, dkumar Reviewed By: dkumar Subscribers: yugaware Differential Revision: https://phorge.dev.yugabyte.com/D34853

ta updates (yugabyte#22284)

dcb5840

fix 404 (yugabyte#22280)

37a57b6

[docs] Release notes 2.20.3.0 updates (yugabyte#22314)

1f3a1b5

* test change * added warning * changed for 2.21 as well

Update yb-master.md to include hide_dead_node_threshold_mins (yugabyt…

c4c87c5

…e#22310)

Merge branch 'master' into doc_yba_haupdate

b2a8d90

changes from review

199d13b

chidmuthu approved these changes May 20, 2024

View reviewed changes

jharveysmith force-pushed the master branch from a6efc57 to 23ce4a3 Compare May 24, 2024 23:37

svarnau force-pushed the master branch 2 times, most recently from fe7d36e to d212e43 Compare May 25, 2024 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[doc][yba] 2024.1 High availabiliy updates #22186

[doc][yba] 2024.1 High availabiliy updates #22186

ddhodge commented Apr 29, 2024

[doc][yba] 2024.1 High availabiliy updates #22186

Are you sure you want to change the base?

[doc][yba] 2024.1 High availabiliy updates #22186

Conversation

ddhodge commented Apr 29, 2024