Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operations timeout while inserting data into ScyllaDB cluster at very low throughput #18632

Closed
amitesh88 opened this issue May 12, 2024 · 11 comments

Comments

@amitesh88
Copy link

I have a 3 node scyllaDB cluster
32 CPU ,64GB RAM , scylla version: 5.4.3
io_properties.yaml :
read_iops: 36764
read_bandwidth: 769690880
write_iops: 42064
write_bandwidth: 767818944
When application has increased writes operations from 1200 to 10000 tps , which is far less than claimed write_iops, it was getting error below:
Error inserting Data : Operation timed out for xxx_xxx.xxx_xxxxx_240512 - received only 1 responses from 2 CL=QUORUM.
On ScyllaDB node the only log can be seen is:
[shard 8:comp] large_data - Writing large partition xxx_xxx.xxx_xxxxx_240512: xxx (37041816 bytes) to me-3gg2_13mq_3jyhc2r2wxx7hvxxw4-big-Data.db
CPU utilisation on each node is hardly 15% , but application failed to write
Note: RF of system_auth and other keyspaces is already equal to number of nodes

Need insights on this
Thanks in advance

@mykaul
Copy link
Contributor

mykaul commented May 13, 2024

Hi @amitesh88 - you can't compare the io_properties IOPS in any way to the CQL OPs - ScyllaDB does a whole lot more 'raw' IOPS per every CQL transaction. For example, commit log I/O, or compaction.
However, I do encourage you to test with fio the disk - it may be that iotune is configuring vastly less IOPs than the disks can sustain and you may be able to raise the numbers somewhat. Unsure if that will solve your issues, but worth a try.

@amitesh88
Copy link
Author

image

we are using ssd disk on gcp vm which has good io throughput , refer image above

Can writing large partition be the issue related to partition key not properly distributing load??

@mykaul
Copy link
Contributor

mykaul commented May 14, 2024

@amitesh88 - as you can see, the numbers quoted above and iotune are vastly different. I'd also compare with fio. If fio is substantially better, I'd change the number manually to higher values and try again. See scylladb/seastar#1297 for reference

@amitesh88
Copy link
Author

Using FIO , I am getting below result
scylla_io_2: (groupid=0, jobs=16): err= 0: pid=16100: Wed May 15 16:28:36 2024
write: IOPS=37.8k, BW=185MiB/s (193MB/s)(10.8GiB/60008msec); 0 zone resets

@mykaul
Copy link
Contributor

mykaul commented May 15, 2024

Using FIO , I am getting below result scylla_io_2: (groupid=0, jobs=16): err= 0: pid=16100: Wed May 15 16:28:36 2024 write: IOPS=37.8k, BW=185MiB/s (193MB/s)(10.8GiB/60008msec); 0 zone resets

That's a bit low - I expected more. Can you share the full fio command line and results?

@amitesh88
Copy link
Author

Below is the command with output:

fio --filename=/var/lib/scylla/a --direct=1 --rw=randrw --refill_buffers --size=1G --norandommap --randrepeat=0 --ioengine=libaio --bs=5kb --rwmixread=0 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=scylla_io_2
scylla_io_2: (g=0): rw=randrw, bs=(R) 5120B-5120B, (W) 5120B-5120B, (T) 5120B-5120B, ioengine=libaio, iodepth=16
...
fio-3.16
Starting 16 processes
scylla_io_2: Laying out IO file (1 file / 1024MiB)
Jobs: 16 (f=16): [w(16)][100.0%][w=179MiB/s][w=36.8k IOPS][eta 00m:00s]
scylla_io_2: (groupid=0, jobs=16): err= 0: pid=16100: Wed May 15 16:28:36 2024
write: IOPS=37.8k, BW=185MiB/s (193MB/s)(10.8GiB/60008msec); 0 zone resets
slat (usec): min=3, max=1636, avg=10.87, stdev=13.92
clat (usec): min=391, max=26669, avg=6759.47, stdev=1146.82
lat (usec): min=549, max=26699, avg=6770.58, stdev=1146.98
clat percentiles (usec):
| 1.00th=[ 2245], 5.00th=[ 4621], 10.00th=[ 5932], 20.00th=[ 6390],
| 30.00th=[ 6587], 40.00th=[ 6783], 50.00th=[ 6915], 60.00th=[ 7046],
| 70.00th=[ 7177], 80.00th=[ 7373], 90.00th=[ 7635], 95.00th=[ 7963],
| 99.00th=[ 9110], 99.50th=[10421], 99.90th=[13566], 99.95th=[15008],
| 99.99th=[20579]
bw ( KiB/s): min=179810, max=321463, per=99.99%, avg=188943.08, stdev=1561.74, samples=1920
iops : min=35962, max=64289, avg=37788.25, stdev=312.34, samples=1920
lat (usec) : 500=0.01%, 750=0.01%, 1000=0.02%
lat (msec) : 2=0.62%, 4=3.40%, 10=95.36%, 20=0.58%, 50=0.01%
cpu : usr=1.39%, sys=3.04%, ctx=1606741, majf=0, minf=194
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2267856,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
WRITE: bw=185MiB/s (193MB/s), 185MiB/s-185MiB/s (193MB/s-193MB/s), io=10.8GiB (11.6GB), run=60008-60008msec

Disk stats (read/write):
sdb: ios=0/2266088, merge=0/0, ticks=0/15152759, in_queue=15152760, util=99.87%

@mykaul
Copy link
Contributor

mykaul commented May 15, 2024

Very strange. This is what I'm getting on my laptop :
Run status group 0 (all jobs):
WRITE: bw=3684MiB/s (3863MB/s), 3684MiB/s-3684MiB/s (3863MB/s-3863MB/s), io=16.0GiB (17.2GB), run=4447-4447msec

And of course, if I switch to 4KB bs, it's slightly better.
Run status group 0 (all jobs):
WRITE: bw=4011MiB/s (4206MB/s), 4011MiB/s-4011MiB/s (4206MB/s-4206MB/s), io=16.0GiB (17.2GB), run=4085-4085msec

@avikivity
Copy link
Member

Please check the advanced dashboard in per-shard view mode to see if some shard is the bottleneck.

@amitesh88
Copy link
Author

Thanks a lot
Can we check this on opensource Scylla??

@mykaul
Copy link
Contributor

mykaul commented May 15, 2024

Thanks a lot Can we check this on opensource Scylla??

Yes, you can use the monitor with open source Scylla.

@amitesh88
Copy link
Author

I got the issue, It was due to partition key which was not letting data to be equally divided on the nodes , thats why getting
large_data - Writing large partition
We have corrected it to uuid and now data is equally distributed on both DC1 and DC2
Thanks for your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants