You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I faced a really bad problem with getting all data from millions of hash keys, each hash around 9 key/value.
Yea, Redis is pretty good with simple SET/GET msgpack+gzip compression. But what if we only want hash (hset, hget, hgetall...) ?
I know, for 1 million rows on i9 9900k would take 44 seconds (9 key/value per hash)
What should I do ?
It's crazy but simple. We will make a group of replica from 1 master. The more replica(s) the more speed with can gain.
Let's build a group of Redis servers with 1 master and ... 20 replicas :P
My_Data = list()
In Python first I will need a chunk maker function to split list of hash keys def build_chunks(lst, n): .....for i in range(0, len(lst), n): ..........yield lst[i:i + n]
Example in the zrange_key is a list of 1,000,000 hash keys: zrange_key = ["hash::key1", "hash::key2" ,... "hash::key1000000"] partitions = list(build_chunks(zrange_key, 10000))
Why did I split 1,000,000 keys with 10,000 each chunk ? I will make 100 parallel calls to that 20 Redis replica groups def hgetall_data(chunk): .....chunk_pipline = client.pipeline() .....chunk_pipline.multi() .....### Init empty data ### ....._data = [] .....### Loop 10,000 keys per chunk ### .....[chunk_pipline.hgetall(key) for key in chunk] .....def c_app(values): .........._data.append(values) .....[c_app(values) for values in chunk_pipline.execute()] .....return _data
Oh why ? Each replica make a long call to get all key/value of a hash, we split it to 20 replicas to do that at the same time. So we will get faster response time. ### User ThreadPoolExecutor , MAX_WORKERS=16 ### with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor: .....### partitions = group [ list(10,000keys), list(10,000keys), ... list(10,000keys) ] ### .....### chunk = each list(10,000keys) ### .....result_list = [executor.submit(hgetall_data, chunk) for chunk in partitions] .....for c_result_list in result_list: ..........[My_Data.append(itm) for itm in c_result_list.result()]
Alright, this is a very good ThreadPoolExecutor group to call HGETALL command against 20 Redis replicas at the same time. I wish someone teach a better code, this is what all I got 👍 from my experience of ... STACK OVERFLOWWWW ...
Now, we build HAPROXY for that 20 replicas, we will need ROUND ROBIN connection. So our python client will connect to this HAPROXY server, the ROUND ROBIN and CONNECTION things we don't need to care.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello guys,
I faced a really bad problem with getting all data from millions of hash keys, each hash around 9 key/value.
Yea, Redis is pretty good with simple SET/GET msgpack+gzip compression. But what if we only want hash (hset, hget, hgetall...) ?
I know, for 1 million rows on i9 9900k would take 44 seconds (9 key/value per hash)
What should I do ?
It's crazy but simple. We will make a group of replica from 1 master. The more replica(s) the more speed with can gain.
Let's build a group of Redis servers with 1 master and ... 20 replicas :P
My_Data = list()
In Python first I will need a chunk maker function to split list of hash keys
def build_chunks(lst, n):
.....for i in range(0, len(lst), n):
..........yield lst[i:i + n]
Example in the zrange_key is a list of 1,000,000 hash keys:
zrange_key = ["hash::key1", "hash::key2" ,... "hash::key1000000"]
partitions = list(build_chunks(zrange_key, 10000))
Why did I split 1,000,000 keys with 10,000 each chunk ? I will make 100 parallel calls to that 20 Redis replica groups
def hgetall_data(chunk):
.....chunk_pipline = client.pipeline()
.....chunk_pipline.multi()
.....### Init empty data ###
....._data = []
.....### Loop 10,000 keys per chunk ###
.....[chunk_pipline.hgetall(key) for key in chunk]
.....def c_app(values):
.........._data.append(values)
.....[c_app(values) for values in chunk_pipline.execute()]
.....return _data
Oh why ? Each replica make a long call to get all key/value of a hash, we split it to 20 replicas to do that at the same time. So we will get faster response time.
### User ThreadPoolExecutor , MAX_WORKERS=16 ###
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
.....### partitions = group [ list(10,000keys), list(10,000keys), ... list(10,000keys) ] ###
.....### chunk = each list(10,000keys) ###
.....result_list = [executor.submit(hgetall_data, chunk) for chunk in partitions]
.....for c_result_list in result_list:
..........[My_Data.append(itm) for itm in c_result_list.result()]
Alright, this is a very good ThreadPoolExecutor group to call HGETALL command against 20 Redis replicas at the same time. I wish someone teach a better code, this is what all I got 👍 from my experience of ... STACK OVERFLOWWWW ...
Now, we build HAPROXY for that 20 replicas, we will need ROUND ROBIN connection. So our python client will connect to this HAPROXY server, the ROUND ROBIN and CONNECTION things we don't need to care.
My HAPROXY config:
frontend ft_redis
.....mode tcp
.....bind *:6379
.....default_backend bk_redis
backend bk_redis
.....mode tcp
.....balance roundrobin
.....option tcp-check
.....tcp-check send PING\r\n
.....tcp-check expect string +PONG
.....tcp-check send info\ replication\r\n
.....tcp-check expect string role:slave
.....tcp-check send QUIT\r\n
.....tcp-check expect string +OK
.....server redisreplica1 redisreplica1:6379 maxconn 20000 check inter 1s
.....server redisreplica2 redisreplica2:6379 maxconn 20000 check inter 1s
.....server redisreplica3 redisreplica3:6379 maxconn 20000 check inter 1s
.....server redisreplica4 redisreplica4:6379 maxconn 20000 check inter 1s
.....server redisreplica5 redisreplica5:6379 maxconn 20000 check inter 1s
.....server redisreplica6 redisreplica6:6379 maxconn 20000 check inter 1s
.....server redisreplica7 redisreplica7:6379 maxconn 20000 check inter 1s
.....server redisreplica8 redisreplica8:6379 maxconn 20000 check inter 1s
.....server redisreplica9 redisreplica9:6379 maxconn 20000 check inter 1s
.....server redisreplica10 redisreplica10:6379 maxconn 20000 check inter 1s
.....server redisreplica11 redisreplica11:6379 maxconn 20000 check inter 1s
.....server redisreplica12 redisreplica12:6379 maxconn 20000 check inter 1s
.....server redisreplica13 redisreplica13:6379 maxconn 20000 check inter 1s
.....server redisreplica14 redisreplica14:6379 maxconn 20000 check inter 1s
.....server redisreplica15 redisreplica15:6379 maxconn 20000 check inter 1s
.....server redisreplica16 redisreplica16:6379 maxconn 20000 check inter 1s
.....server redisreplica17 redisreplica17:6379 maxconn 20000 check inter 1s
.....server redisreplica18 redisreplica18:6379 maxconn 20000 check inter 1s
.....server redisreplica19 redisreplica19:6379 maxconn 20000 check inter 1s
.....server redisreplica20 redisreplica20:6379 maxconn 20000 check inter 1s
EOF
And the result .....
1 master + 3 replica:
query speed: 44.63 seconds
1 master + 20 replica:
query speed: 4.87 seconds
Beta Was this translation helpful? Give feedback.
All reactions