Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc][yba] 2024.1 High availabiliy updates #22186

Open
wants to merge 335 commits into
base: master
Choose a base branch
from

Conversation

ddhodge
Copy link
Contributor

@ddhodge ddhodge commented Apr 29, 2024

High availabiliy updates
DOC-270

@netlify /preview/yugabyte-platform/administer-yugabyte-platform/high-availability/

anmalysh-yb and others added 30 commits April 23, 2024 22:58
Summary: Subj

Test Plan: manually

Reviewers: rmadhavan

Reviewed By: rmadhavan

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34432
…8s resource spec.

Summary: [PLAT-13508] Calcualte k8s ybc throttle params correctly when using k8s resource spec.

Test Plan: manual

Reviewers: vkumar

Reviewed By: vkumar

Subscribers: vkumar, sanketh, yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34391
…tables

Summary:
Commit fad94f7 introduces a new tablet metadata field `skip_table_tombstone_check`, but it doesn't set this field for colocated tables. Fix this by setting the field in `AsyncAddTableToTablet()`.

Backport-to: 2024.1
Jira: DB-11043

Test Plan: ./yb_build.sh --cxx-test pgwrapper_pg_mini-test --gtest_filter PgMiniTest.SkipTableTombstoneCheckMetadata

Reviewers: asrivastava

Reviewed By: asrivastava

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D34428
Summary:
Add changes to `yb-server-ctl.yml` to make ansibleDestroy idempotent.

The order of clean up was as follows:
1. Stop yb-master, yb-tserver, yb-controller services (if systemd)
2. Delete systemd units for (1) + other services
3. Stop + delete systemd units for node exporter + otel collector
4. Clean/Clean logs for (1)
5. Clean out the data directory + home directory. At the end, delete the `yb-server-ctl.sh`

The issue is that if step 2 was performed before step 1, retrying ansibleDestroy will always fail. Similarly with retrying (3). Also, at the end of step 5, we delete `yb-server-ctl.sh`. Then if we retry, step 4 and 5 will fail, since they both use the clean up script.

For the systemd units not existing case, we first check the status of the systemd unit before stopping. As for the the clean up script, if the clean up script does not exist, we will skip the cleaning up (This is ok, as worst case, we will catch on the node addition and preflight check fails).

Test Plan:
1. Create a 3 node rf3 on-prem universe. Stop one of the VMs from the cloud provider. Perform a replace node. Make sure that the node is in decommissioned state. Then start the VM for the node in the decommissioned state. Perform a 'recommission' action. Make sure that this succeeds.

2. Create a 3 node rf3 on-prem universe. Inject an error at the end of the `OnPremDestroyInstancesMethod` method, to make sure it fails. Then replace a node. The node will be placed into the decommissioned state because it failed. At this point, all the systemd units, clean up scripts, home/data directories are already cleaned up. Performing a 'recommission' action will still succeed (before it would not).

3. Verify that the correct systemd type is used, i.e. either user systemd or system level systemdd

Reviewers: sanketh, nsingh

Reviewed By: nsingh

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34039
…me config on universe form.

Summary:
Fixes a bug where a fetch for provider level runetime config was not
requesting inherited values.
This led to incorrect read on runtime config
flag if no value was set at the provider level.

Test Plan:
Set `yb.universe.geo_partitioning_enabled` to `true` at the customer level.
Verify that the universe form recognizes the inherited `true` setting.
{F171727}
{F171728}
{F171726}

Set `yb.universe.geo_partitioning_enabled` to `false` at the customer level
and `true` at the provider level.
Verify that the universe form reads the runtime value from the provider.
{F171729}
{F171730}
{F171731}

Reviewers: anijhawan, rmadhavan, kkannan

Reviewed By: rmadhavan

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34312
Summary:
When `ysql_ddl_transaction_wait_for_ddl_verification` is enabled, the PG Client performs `WaitForDdlVerificationToFinish`. This waiting for the txn state to be cleared in master.
The `TestCleanUpCDCStreamsMetadataDuringTabletSplit` tests work by blocking the catalog manager background task, which used to cause the Table delete to get stuck.

This was because `YsqlDdlTxnDropTableHelper`  does not clear the Txn state after it has deleted the table. This function is run by `TableSchemaVerificationTask`, or `ReportYsqlDdlTxnStatus`. The table delete does not clear the txn state synchronously because `CatalogManager::CheckTableDeleted` skips the cleanup since it does a `HasTasks` check on the table which will return true due to the `TableSchemaVerificationTask` itself.

Other tests dont hit this issue because the background task also runs `CatalogManager::CleanUpDeletedTables` which calls `RemoveDdlTransactionState` which eventually unblocks `WaitForDdlVerificationToFinish`.

Even with out the test blocking the task this is not optimal. We should call RemoveDdlTransactionState as soon as the table is marked as DELETING, and unblock the client instead of making it wait on the bg task which will be slow and take arbitrarily long.

This change fixes the issue by performing `RemoveDdlTransactionState` in `YsqlDdlTxnDropTableHelper`.

Fixes yugabyte#22095
Jira: DB-11021

Test Plan:
CDCSDKTabletSplitTest.TestCleanUpCDCStreamsMetadataDuringTabletSplitImplicit
CDCSDKTabletSplitTest.TestCleanUpCDCStreamsMetadataDuringTabletSplitExplicit

Reviewers: myang

Reviewed By: myang

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D34431
Summary: These are some of the leaks discovered. We will need to watch out for leaks like this.

Test Plan: Trivial.

Reviewers: amalyshev, nbhatia, sanketh, muthu, anijhawan

Reviewed By: anijhawan

Subscribers: anijhawan, yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34435
* simplifying sys catalog

* cleaning up sys catalog

* removing store term for views

* info about information schema

* icons for table/view

* update to 5/3 model

* Apply suggestions from code review

---------

Co-authored-by: Dwight Hodge <79169168+ddhodge@users.noreply.github.com>
… time of table creation

Summary:
In order to support dynamic table addition with PG replication consumption, for tables created after stream creation, we need to set retention barriers on the tablets of such tables at the time of table creation.

If replication slot consumption is enabled, then whenever a new table is created we check if atleast one stream exists on the namespace. If it does, then we set the field
`cdc_sdk_require_history_cutoff` to true in CreateTablet Request.

The tablet_service creates the tablet and based on the value of `cdc_sdk_require_history_cutoff` calls `SetAllInitialCDCSDKRetentionBarriers()` to set the retention barriers on the tablet. The method `PopulateCDCStateTableOnNewTableCreation()` is called in the callback for CreateTablet and adds entries for the tablet in the cdc_state table.

**Upgrade/Rollback safety:**
The retention barriers will only be set on tablets of newly added table if replication slot consumption is enabled. This is guarded by the flag `ysql_TEST_enable_replication_slot_consumption`.

Protos modified:
 - CreateTabletRequestPB: An optional boolean field cdc_sdk_require_history_cutoff has been added.
 - CreateTabletResponsePB: An optional OpIdPB field cdc_sdk_safe_op_id has been added.
Jira: DB-10538

Test Plan:
./yb_build.sh --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestDynamicTablesAdditionForTableCreatedAfterStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestRetentionBarrierRaceWithUpdatePeersAndMetrics
./yb_build.sh --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestFailureSettingRetentionBarrierOnDynamicTable
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testDynamicTableAdditionForAllTablesPublication'

Reviewers: asrinivasan, skumar, stiwary, sergei

Reviewed By: sergei

Subscribers: ybase, ycdcxcluster, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D33813
Summary:
These releases include:

	Upgrade cqlsh version to v3.10-yb-20 : https://github.com/yugabyte/cqlsh/releases/tag/v3.10-yb-20

Test Plan: Existing Tests

Reviewers: steve.varnau, asrivastava

Reviewed By: steve.varnau, asrivastava

Differential Revision: https://phorge.dev.yugabyte.com/D34415
…er attributes as well

Summary:
Currently during our YBA ↔︎ YBDB LDAP sync, we assume that the user name we want to sync will be present on the DN. While that will be true for most of the scenarios, but we can have customers wanting to sync the user present on a different attribute for example - `sAMAccountName`

This diff performs the sync based on the attribute the user specified in the payload. If the user has specified `ldapUserfieldAttribute`, the user name will be always retrieved from this attribute on the LDAP server. If this is not specified, `ldapUserfield` should be specified [this is to get the user name from the dn] else the sync fails with the message: `Either of the ldapUserfield or ldapUserfieldAttribute is necessary to perform the sync`

Test Plan:
  - Triggered the sync with the `ldapUserfieldAttribute` and synced only the users that have this attribute set on the LDAP server
  - Triggered the sync with the `ldapUserfield` and observed the sync where the user name is retrieved from the dn
  - Triggered the sync with both the `ldapUserfieldAttribute` and the `ldapUserfield` configured, and the pref is given to `ldapUserfieldAttribute`
  - Triggered the sync with specifying neither of the `ldapUserfieldAttribute` and the `ldapUserfield` and observed the exception thrown.

Reviewers: #yba-api-review!, svarshney

Reviewed By: svarshney

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34412
Summary:
The `SELECT … FOR UPDATE` command, when applied to multiple keys, currently locks each row serially. This approach results in increased latency due to multiple round-trip communications (RPC requests) to the DocDB storage layer for lock acquisition. This can significantly impact the performance of applications relying on transactional consistency for multi-row operations.

The primary goal of this revision is to enhance the performance of multi-key `SELECT … FOR UPDATE` queries by implementing a batched locking mechanism. This approach will aggregate lock requests for multiple rows and execute them in a single RPC call to the DocDB layer, thereby reducing latency and improving overall transaction throughput.

This revision plans to apply the optimization to all forms of explicit locking supported by PostgreSQL (`FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, FOR KEY SHARE`). In terms of implementation, this means buffering operations for many types of `RowMarkType`.

To control the batch size, use the gflag as follows: `SET yb_explicit_row_locking_batch_size = <size>;`, where `<size>` is a positive integer. Note that this flag is set to 1 by default, which disables the feature.

As an example, consider the following table with a single primary key column in which we insert 100 rows:

```
CREATE TABLE tbl (k INT PRIMARY KEY);
INSERT INTO tbl (SELECT i FROM generate_series(1, 100) AS i);
```

Currently, explicitly acquiring row-level locks for all 100 rows results in `Storage Read Requests: 101`, as we are performing one initial read, and then one read for every row we intend to acquire a lock for:

```
yugabyte=# EXPLAIN (ANALYZE, DIST) SELECT * FROM tbl WHERE k <= 100 FOR UPDATE;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 LockRows  (cost=0.00..112.50 rows=1000 width=36) (actual time=7.027..156.969 rows=100 loops=1)
   ->  Seq Scan on tbl  (cost=0.00..102.50 rows=1000 width=36) (actual time=3.021..3.606 rows=100 loops=1)
         Remote Filter: (k <= 100)
         Storage Table Read Requests: 1
         Storage Table Read Execution Time: 2.488 ms
         Storage Table Rows Scanned: 100
 Planning Time: 0.098 ms
 Execution Time: 157.274 ms
 Storage Read Requests: 101
 Storage Read Execution Time: 139.504 ms
 Storage Rows Scanned: 200
 Storage Write Requests: 0
 Catalog Read Requests: 0
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 139.504 ms
 Peak Memory Usage: 24 kB
(17 rows)
```

By reducing the number of RPCs with this optimization, we end up with `Storage Read Requests: 2`. This is because we are performing one initial read request followed by another request for the locks, hence significantly reducting the total execution time:

```
yugabyte=# SET yb_explicit_row_locking_batch_size = 1024;
SET
yugabyte=# EXPLAIN (ANALYZE, DIST) SELECT * FROM tbl WHERE k <= 100 FOR UPDATE;
                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 LockRows  (cost=0.00..112.50 rows=1000 width=36) (actual time=3.883..19.285 rows=100 loops=1)
   ->  Seq Scan on tbl  (cost=0.00..102.50 rows=1000 width=36) (actual time=3.810..4.375 rows=100 loops=1)
         Remote Filter: (k <= 100)
         Storage Table Read Requests: 1
         Storage Table Read Execution Time: 2.532 ms
         Storage Table Rows Scanned: 100
 Planning Time: 0.970 ms
 Execution Time: 19.621 ms
 Storage Read Requests: 2
 Storage Read Execution Time: 2.535 ms
 Storage Rows Scanned: 200
 Storage Write Requests: 0
 Catalog Read Requests: 0
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Storage Execution Time: 2.535 ms
 Peak Memory Usage: 24 kB
(17 rows)
```
Jira: DB-9512

Test Plan:
Added a new SQL regress test `yb_explicit_row_lock_batching.sql/.out` to `yb_misc_serial4_schedule`, which can be run with the following command:

`./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressMisc#testPgRegressMiscSerial4'`

The test is based off `yb_explicit_row_lock_planning.sql/.out`, but includes `EXPLAIN (ANALYZE, DIST)` commands with deterministic fields to track the number of requests, ensuring that we are flushing once.
Also, there are some newly added cases, such as:
- Simple `JOIN` with top-level locking
- `JOIN` with leaf-level locking (sub-query)
- When `LIMIT` returns less than filtered query
- Filter on the Postgres side, with `NOW()`

Reviewers: kramanathan, dmitry

Reviewed By: kramanathan, dmitry

Subscribers: yql, smishra, patnaik.balivada

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D32543
…ion_slots view

Summary:
Summary
This is related to the project to support Replication slot API in YSQL (yugabyte#18724).
(https://phorge.dev.yugabyte.com/D29194).
This is also related to the PG Compatible Logical Replication Consumption project.

The schema of the pg_replication_slots view has been modified by adding an extra yb-specific
column yb_restart_commit_ht which is a int8.

The value of this column is a uint64 representation of the commit Hybrid Time corresponding
to the restart_lsn. This can be used by the client (like YB-PG Connector) to perform a
consistent snapshot (as of the consistent_point) in the case when a replication slot already
exists.

UPGRADE/ROLLBACK SAFETY:
These changes are protected via the preview flag: ysql_yb_enable_replication_commands
Jira: DB-10956

Test Plan:
Manual Testing
./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot'
./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressReplicationSlot'
./yb_build.sh --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotent'
./yb_build.sh --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotentSingleConn'

Reviewers: stiwary, skumar

Reviewed By: stiwary

Subscribers: yql, ycdcxcluster

Differential Revision: https://phorge.dev.yugabyte.com/D34279
Summary: Added required data-testid for os patching ui automation

Test Plan: Tested manually

Reviewers: kkannan

Reviewed By: kkannan

Subscribers: ui, yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34417
…ot in Publication list

Summary:
Currently the update peers and metrics thread does not move the retention barrier forward for the tables not included in publication's table list. This can lead to holding up of resources for no reason.

This diff makes the necessary changes in the update peers and metrics path to make sure that the retention barriers on such tables are also moved forward. The `record_id_commit_time` in slot entry for cdc_state table gives us the commit time of the last acknowledged transaction for a slot. Since the virtual WAL will not ship any records with commit time less than this, we can safely move the `cdc_sdk_safe_time` forward to this time.  Even though, in `cdcsdk_producer` we might receive WAL records with commit time less than this safe_time due to the unsorted nature of WAL, the `cdcsdk_producer` will filter these out based on the commit_time_threshold. In the scenario of multiple slots (streams) existing on a namespace, the minimum `record_id_commit_time` among all the slots will be chosen for `cdc_sdk_safe_time`.

These changes are applicable only to the replication slot model of consumption and are guarded by the flag `ysql_TEST_enable_replication_slot_consumption`.  Also in the same database environment, we will not support both the yb-connector and pg-connector simultaneously. This is because using this algorithm in such a scenario will move the retention barriers too aggressively for the yb-connector consumption when the yb-connector is lagging.
Jira: DB-10691

Test Plan:
Jenkins: test regex: .*CDCSDK.*
./yb_build.sh --cxx-test integration-tests_cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestRetentionBarrierMovementForTablesNotInPublication

Reviewers: skumar, asrinivasan, siddharth.shah, stiwary

Reviewed By: asrinivasan, stiwary

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34021
Summary:
Added safety check for cluster membership on deleting Pods. Reuses same logic as VMs:
(1) For master: checks if the Pod is in the master config
(2) For tserver: checks if the pod is hosting any tablets

Test Plan:
- Modified UT to accommodate the check
- Manually verified

Reviewers: anijhawan, nsingh

Reviewed By: anijhawan, nsingh

Subscribers: nsingh, yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34024
Summary:
Added the AreNodesSafeToTakeDown at (1) task precheck before freeze, and (2) before
upgrading each node.

Test Plan: Verified manually the check works as expected. Modified UTs for correct task flows

Reviewers: anijhawan, sanketh, cwang

Reviewed By: anijhawan

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D33937
…neImage is used

Summary:
In YBM, we still use the centos7 AMIs that is passed as `machineImage` param during
universe create.
imageBundle will have the `ec2-user` configured as the sshUser given the fact that we do not
specify the AMI during
provider creation & ends up creating YBA Managed Bundles.

This diff falls back to sshUser configured in the provider in case `machineImage` is passed in
the universe create params.

Test Plan: Manually verified

Reviewers: vbansal

Reviewed By: vbansal

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34448
…icated masters

Summary:
- Added check for disk space during K8s scale down operations
- Modified existing checks to handle dedicated nodes cases
- We will first checks if the task requires a preflight disk check or not: If there are pods to remove/If existing volume size is reduced.
- For kubernetes, the query we run is as follows, based on list of namespaces to check and server type:
```
"sum(kubelet_volume_stats_used_bytes{namespace=~\"%s\","
          + " persistentvolumeclaim=~\"(.*)-yb-%s-(.*)\"})/1073741824";
```
- For dedicated nodes case, we will now provided explicit `exported_instance` names to only gather required volume metrics

Test Plan: - Added UTs to check for positive/negative check scenarios

Reviewers: anijhawan, #yba-api-review, nsingh

Reviewed By: anijhawan, #yba-api-review, nsingh

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34202
Summary: Allow manual role change once auto create user is turned off and rbac is on.

Test Plan: manually verified that we are able to change role once auto create user is turned off and rbac is on.

Reviewers: #yba-api-review!, svarshney

Reviewed By: svarshney

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34453
…tion Slots, tabs become skewed and eventually hidden

Summary:
Bug:
The action button is using position absolute. When we add new items to the menubar, these menu items are hiding behind the "action" button.
Fix:
Re-arranged the action buttons and removed the position absolute.

Test Plan:
Tested manually

Before:
{F173054}
After:
{F173062}

Reviewers: lsangappa, jmak, rmadhavan

Reviewed By: lsangappa

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34461
Summary: Avoiding empty filters in graph requests.

Test Plan: tested manually

Reviewers: rmadhavan

Reviewed By: rmadhavan

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34477
Summary:
This PR fixes the DB audit log capture for multi-line YCQL queries. We now use the multiline config in the Filelog Receiver to split based on the audit log regex pattern prefix.

When we run the following YCQL query before this PR:
```
CREATE TABLE emp5(
  id int primary key);
```

We get output log in datadog like:
```
I0423 15:09:47.628911 16863 audit_logger.cc:518] AUDIT: user:anonymous|host:10.9.74.144:9042|source:10.9.74.144|port:43754|timestamp:1713884987628|type:CREATE_TABLE|category:DDL|ks:mydatabase|scope:emp5|operation:create table emp5(
```
Notice the truncated query.

Test Plan:
Manually tested the following scenario for YCQL before and after this change:
Flow:

1. Create universe
2. Enable DB audit logging
3. Run single line YCQL query:
```
create table emp12( id int PRIMARY KEY );
```
4. Run multi line YCQL query:
```
create table emp13(
  id int PRIMARY KEY
);
```
Step 3 datadog output:
```
I0423 15:48:50.103888 33418 audit_logger.cc:518] AUDIT: user:anonymous|host:10.9.74.144:9042|source:10.9.74.144|port:36596|timestamp:1713887330103|type:CREATE_TABLE|category:DDL|ks:mydatabase|scope:emp12|operation:create table emp12( id int PRIMARY KEY );
```

Step 4 datadog output:
```
I0423 15:55:22.701594 33417 audit_logger.cc:518] AUDIT: user:anonymous|host:10.9.74.144:9042|source:10.9.74.144|port:36596|timestamp:1713887722701|type:CREATE_TABLE|category:DDL|ks:mydatabase|scope:emp13|operation:create table emp13(
id int PRIMARY KEY
);
```

Reviewers: amalyshev

Reviewed By: amalyshev

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34425
* change the basedir to /Users/pthangamani/var/node from /tmp/ybd

* updating yes/no/partial icons to fa-sharp
…n table drop

Summary:
Whenever a table is dropped that is part of the CDC stream, `CleanUpCDCSDKStreamsMetadata()` is called to remove the cdc state table entries and sys catalog entries.

CleanUpCDCSDKStreamsMetadata computes tablets from two sources:
Set A - GetTablets() on all tables part of the stream metadata.
Set B - Read cdc_state table for the stream.

Remove entries from B that are not present in A.

We recently introduced a new state table entry for the replication slot.  So, based on the above algorithm, it is deleted from the cdc state table whenever a table is dropped from that stream. To stop this deletion, we are simply not considering the slot entry in the above algorithm.
Jira: DB-11044

Test Plan:
Jenkins: test regex: .*CDCSDKConsumptionConsistentChangesTest.*
./yb_build.sh --cxx-test cdcsdk_consumption_consistent_changes-test --gtest_filter CDCSDKConsumptionConsistentChangesTest.TestConsumptionAfterDroppingTableNotInPublication

Reviewers: asrinivasan, stiwary

Reviewed By: asrinivasan

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34423
Summary:
ListLiveTabletServers API that was used for precheck functionality is not implemented for
db versions less than 2.8. We should skip checks for versions earlier than that.

Test Plan:
create a universe with 2.6 version
upgrade to 2.8 -> success

Reviewers: cwang

Reviewed By: cwang

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34420
Summary: ASH integration for TS framework (RCA and Lock Contention)

Test Plan: ASH integration for TS framework

Reviewers: amalyshev, cdavid

Reviewed By: cdavid

Differential Revision: https://phorge.dev.yugabyte.com/D34401
Summary: Some fixes/hacks for demo discussed with Raj

Test Plan: unit tested

Reviewers: rmadhavan

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34486
Summary: Update YBA to use the latest TS Framework version

Test Plan: Update YBA to use the latest TS Framework version

Reviewers: amalyshev, cdavid

Reviewed By: cdavid

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34489
Summary: Fix invalid metadata

Test Plan: tested manually

Reviewers: rmadhavan

Reviewed By: rmadhavan

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34490
andrei-mart and others added 26 commits May 7, 2024 14:11
Summary:
New GUC variable to fine tune parallel range size.

The new variable yb_parallel_range_size is sent to the DocDB where it is used to determine
where the next parallel range boundary should be. DocDB has actual data files and can
accurately calculate the parallel range size. While it is generally not  possible to make ranges
of exact size, DocDB tries to make them as close as possible to the requested size.

It is quite different from similarly named yb_parallel_range_rows, which is handled on Postgres
side. Postgres compares the value of yb_parallel_range_rows to the number of table tuples in
the optimizer's statistics to decide, whether to use parallel read from the table at all.
Jira: DB-10843

Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressParallel#testPgRegressParallel'

Reviewers: timur, tnayak

Reviewed By: tnayak

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34043
Summary:
The code for tcmalloc profiling is currently in src/yb/server/pprof-path-handlers-util.h / .cc. It should be in the utils folder instead.
Jira: DB-11176

Test Plan: Built with `--use_gperftools_tcmalloc` and `--use_google_tcmalloc` and ran `tcmalloc_profile-test.cc` tests.

Reviewers: kfranz

Reviewed By: kfranz

Subscribers: esheng, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D34747
Summary: This fixes the build broken by D34747. The landed diff was not the most recent diff.

Test Plan: jenkins: skip (already ran jenkins on the latest version of the previous diff)

Reviewers: steve.varnau

Reviewed By: steve.varnau

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D34839
Summary:
The support bundle CRD now allows leaving out "components" fields, which will default
to collecting all components.

the auto-created provider should now also appear under "managed kubernetes services" tab

Test Plan: tested support bundle create

Reviewers: anijhawan

Reviewed By: anijhawan

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34005
Summary: This makes sure to not run already submitted upgrade tasks.

Test Plan: Trivial

Reviewers: muthu, cwang

Reviewed By: cwang

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34842
…nel dialog

Summary:
Currently `smtpPort` is accepting non-numeric chars which is not expected. This diff adds a validation for the same in the frontend and throws the appropriate error before sending the request to the backend.

Validation added for smtpPort:
  - Should be an int and between 1 and 65535 inclusive

Test Plan:
Tested manually
  - smtpPort=0 => error: `SMTP Port must be between 1 and 65535`

Reviewers: kkannan, nbhatia, svarshney, lsangappa, ianderson

Reviewed By: kkannan, lsangappa, ianderson

Subscribers: ianderson, yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34818
Clarify helm chart options for RBAC 

---------

Co-authored-by: Aishwarya Chakravarthy <achakravarthy@yugabyte.com>
…to-flag

Summary:
Combination of two commits: yugabyte@decb104111 + yugabyte@6de97906 introduced a compatibility issue between pre-decb104111 and post-6de97906 tserver versions if they are running at the same time during upgrade.

With the post-6de97906 logic `WriteQuery::DoCompleteExecute` when isolation level is not `SERIALIZABLE_ISOLATION` it can add WRITE_OP that contains both `write_pairs` and `read_pairs` to the Raft log. Corresponding `read_pair` will have key set to encoded row key and value set to `KeyEntryTypeAsChar::kNullLow`.

This WRITE_OP is then processed by `TransactionalWriter::Apply` and corresponding intents are generated and written to the intents DB.

In post-6de97906 version `TransactionalWriter::Apply` processes such `read_pair` and as a result generates intent `<row> -> kNullLow` of type `[kStrongRead]`.
But in pre-decb104111 version `TransactionalWriter::Apply` processes such `read_pair` in a different way and generates intent `<row> -> kNullLow` of type `[kStrongRead, kStrongWrite]`.

Then `ApplyIntentsContext::Entry` processes intents and in pre-decb104111 version due to presence of `kStrongWrite` type this intent gets written into regular DB and results in the following record:
`<row> -> kNullLow`, which is incorrect regular DB record. Effect of handing this record by DocDB is different depending on WHERE filter and presence of aggregation, it may result in either affected rows be not visible by the user statement or visible but with non-PK columns set to null.

Given that `TransactionalWriter::Apply` only writes `read_pairs` for non-SERIALIZABLE_ISOLATION when new logic is enabled by `ysql_skip_row_lock_for_update`, the solution is to convert it to an auto-flag, so new logic is only enabled after all nodes are on the new version.

Also added `PgsqlWriteOperation::use_row_lock_for_update_` which is initialized in constructor to avoid changing behaviour in context of the same `PgsqlWriteOperation` instance (since flag is now runtime and the auto-flag will change at runtime from false to true).
Jira: DB-10979

Test Plan: Run TPCC workload in parallel with upgrade from 2.18.7.0-b38 to build with this fix incorporated.

Reviewers: pjain, smishra, hsunder, dmitry

Reviewed By: pjain, hsunder, dmitry

Subscribers: yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34733
…idation

Summary: Added Ignore and Save option for Azure ,K8s , GCP Provider validation.

Test Plan: Tested manually

Reviewers: kkannan, jmak

Reviewed By: kkannan

Subscribers: ui, yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34819
Summary: Removed console log

Test Plan: Tested manully

Reviewers: kkannan

Reviewed By: kkannan

Subscribers: ui, yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34864
Summary:
This revision introduces support for the 'test_decoding' output plugin with the PG compatible logical replication support.

Key points:
1. The plugin does not take a publication list in the parameters which is different from the pgoutput plugin. Here, we assume that they are interested in all the tables of the database. A future feature can optionally take the publication list as well.

2. The plugin does not send the relation object. So the schema refresh callback is a NOOP.
Jira: DB-11193

Test Plan:
Jenkins: test regex: .*ReplicationSlot.*

./yb_build.sh --java-test 'org.yb.pgsql.TestPgReplicationSlot#testWithTestDecodingPlugin'

---
Testing with pg_recvlogical
1. Start the cluster
2. Create the table

```
CREATE TABLE dummy_table (
    id SERIAL PRIMARY KEY,
    col_integer INT,
    col_bigint BIGINT,
    col_decimal DECIMAL(10,2),
    col_text TEXT,
    col_boolean BOOLEAN,
    col_timestamp TIMESTAMP,
    col_date DATE,
    col_time TIME,
    col_json JSONB,
    col_array INT[]
);
```

3. Create the slot

```
build/latest/postgres/bin/pg_recvlogical -d yugabyte --slot=test --create-slot
```

4. Insert some data in the table
5. Receive data

```
build/latest/postgres/bin/pg_recvlogical -d yugabyte --slot=test --start -f -
```

Output
```
BEGIN 2
table public.dummy_table: INSERT: id[integer]:1 col_integer[integer]:1 col_bigint[bigint]:10 col_decimal[numeric]:100.5 col_text[text]:'Dummy Text 1' col_boolean[boolean]:false col_timestamp[timestamp without time zone]:'2024-05-08 14:33:48.52087' col_date[date]:'2024-05-07' col_time[time without time zone]:'13:34:48.52087' col_json[jsonb]:'{"key": "value", "key2": "value2"}' col_array[integer[]]:'{1,2,3,4,5,6,7,8,9,10}'
COMMIT 2
BEGIN 3
table public.dummy_table: INSERT: id[integer]:2 col_integer[integer]:2 col_bigint[bigint]:20 col_decimal[numeric]:201 col_text[text]:'Dummy Text 2' col_boolean[boolean]:true col_timestamp[timestamp without time zone]:'2024-05-08 14:32:49.717402' col_date[date]:'2024-05-06' col_time[time without time zone]:'12:34:49.717402' col_json[jsonb]:'{"key": "value", "key2": "value2"}' col_array[integer[]]:'{2,4,6,8,10,12,14,16,18,20}'
COMMIT 3
.....
```

Reviewers: asrinivasan

Reviewed By: asrinivasan

Subscribers: ycdcxcluster, yql

Differential Revision: https://phorge.dev.yugabyte.com/D34796
Test Plan: Manually verified that provider validation works after the change

Reviewers: svarshney, dkumar

Reviewed By: dkumar

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34853
…dex filters on bound columns

Summary:
In `yb_get_batched_index_paths` it was wrongly assumed that any batchable clause with its LHS belonging to an indexed column in the restriction list must be bound to the index as an index condition. However, it can be the case that such batchable clauses might have the LHS as `indexed_col::bpchar` where `indexed_col` is an indexed text column. In such cases, the LHS doesn't exactly match an index column and, leaving the whole clause to be treated as a qpqual/index filter.

Because of this assumption, `yb_get_batched_index_paths` misses checking if such unbound batchable clauses are present. There is already a check to prevent batching in the presence of a qual whose LHS column is not one that's bound to any batched clause (i.e: not present in `batched_inner_attnos`). This diff moves logic here to also prevent batching in the other case where such a qual might have an LHS column that's already bound to a batched clause.

One can technically also achieve the desired effect by maintaining a list of bound index conditions and checking amongst the quals to see which one is bound but this would be a bit more computationally intensive.

Needs backports to: 2024.1, 2.20, 2.18
Jira: DB-10870

Test Plan: ./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressJoin'

Reviewers: mtakahara

Reviewed By: mtakahara

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D34809
* test change

* added warning

* changed for 2.21 as well
Summary: [PLAT-13873] Adding exception handling to support bundle creation crd

Test Plan:
itest.
 Caused an exception, verified YBA Did not crash.

```
Reconciler no end date given, setting to current time
dev:yba logs  yugaware: 2024-05-08T21:09:01.291Z  [error]  SupportBundleReconciler.java:81 [-1887041250-pool-17-thread-25] com.yugabyte.yw.common.operator.SupportBundleReconciler Failed to add support bundle
dev:yba logs  yugaware: java.lang.NullPointerException: Cannot invoke "java.util.List.stream()" because the return value of "io.yugabyte.operator.v1alpha1.SupportBundleSpec.getComponents()" is null
dev:yba logs  yugaware:         at com.yugabyte.yw.common.operator.SupportBundleReconciler.onAddInternal(SupportBundleReconciler.java:124)
dev:yba logs  yugaware:         at com.yugabyte.yw.common.operator.SupportBundleReconciler.onAdd(SupportBundleReconciler.java:79)
dev:yba logs  yugaware:         at com.yugabyte.yw.common.operator.SupportBundleReconciler.onAdd(SupportBundleReconciler.java:31)
dev:yba logs  yugaware:         at io.fabric8.kubernetes.client.informers.impl.cache.ProcessorListener$AddNotification.handle(ProcessorListener.java:103)
dev:yba logs  yugaware:         at io.fabric8.kubernetes.client.informers.impl.cache.ProcessorListener.add(ProcessorListener.java:50)
dev:yba logs  yugaware:         at io.fabric8.kubernetes.client.informers.impl.cache.SharedProcessor.lambda$distribute$0(SharedProcessor.java:91)
dev:yba logs  yugaware:         at io.fabric8.kubernetes.client.informers.impl.cache.SharedProcessor.lambda$distribute$1(SharedProcessor.java:114)
dev:yba logs  yugaware:         at io.fabric8.kubernetes.client.utils.internal.SerialExecutor.lambda$execute$0(SerialExecutor.java:58)
dev:yba logs  yugaware:         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
dev:yba logs  yugaware:         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
dev:yba logs  yugaware:         at java.base/java.lang.Thread.run(Thread.java:833)
dev:yba logs  yugaware: 2024-05-08T21:09:01.321Z  [warn]  SupportBundleReconciler.java:164 [-1887041250-pool-17-thread-24] com.yugabyte.yw.common.operator.SupportBundleReconciler updating support bundle is not supported

```

Reviewers: dshubin, vkumar

Reviewed By: dshubin, vkumar

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34841
…ions that have arm architecture

Summary:
List DB Versions API call failing to return the DB versions that have arm architecture
So today selecting CPU Architecture based on provider is enabled only for AWS

So what happens is withb the new release workdflow 2 issues happen
1. When we select a specific CPU architecture, DB version (release API) does not filter the versions based on that
2. When we select AWS where OS patching is enabled, by default Arch value is null even though the pointer points to X86 arechitecture (already exisiting issue I found today)

So I ensured to pass x86_64 as default arch type when AWS provider is selected and when OS patching is enabled

Test Plan:
Please refer to the video
{F176414}

Reviewers: jmak, lsangappa

Reviewed By: jmak

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34836
…sion tests

Summary:
For colocation regression tests, costs of query plan are irrelevant, so remove costs from test output to benefit PG15 merge.
Jira: DB-11225

Test Plan:
./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressLegacyColocation#testPgRegressLegacyColocation'
./yb_build.sh --java-test 'org.yb.pgsql.TestPgRegressColocation#testPgRegressColocation'

Reviewers: jason

Reviewed By: jason

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D34884
Summary:
The UI always assumed that the backend is sending the timestamp in UTC. However, the backend gets the timestamp from the underlying OS. If the underlying OS has a different timestamp other than UTC, the backend sends it as is, and the moment fails to parse it.

Fix:

This fix specifies the exact input format(ddd MMM DD HH:mm:ss z YYYY) to moment so that it can parse it.

Test Plan: Tested on both UTC and Non UTC timestamps

Reviewers: lsangappa, jmak, rmadhavan

Reviewed By: lsangappa, jmak, rmadhavan

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34471
Summary: Current Lag column is not sorting properly because we did not have a field like currentLag in the HeaderColumn.

Test Plan:
Tested manually

**Screenshots**

{F176925}

{F176926}

Reviewers: kkannan

Reviewed By: kkannan

Subscribers: ui, yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34895
…the catalog manager.

Summary:
This diff introduces a new class, MasterClusterHandler, to contain some RPC endpoint methods formerly in the CatalogManager.
Jira: DB-8520

Test Plan: existing tests

Reviewers: asrivastava

Reviewed By: asrivastava

Subscribers: ybase, slingam

Differential Revision: https://phorge.dev.yugabyte.com/D34749
…from Primary Cluster in Overview Page

Summary: Make universe name as link after clicking Details link from Primary Cluster in Overview Page

Test Plan:
Please refer to screenshot
Focusing in header text will make it orange and clickable consistent with other places which is already doing the same
{F175713}

Reviewers: jmak

Reviewed By: jmak

Subscribers: yugaware

Differential Revision: https://phorge.dev.yugabyte.com/D34742
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Documentation needed
Projects
Documentation
In progress
Development

Successfully merging this pull request may close these issues.

None yet