is_bucket_to_bucket backup for s3.sink only #5054

itsfarbod · 2023-11-27T16:34:48Z

What problem are we solving?

There was a problem of backing up all the buckets in one bucket in one path on s3.sink filer.backup
So I tried to add an option called is_bucket_to_bucket to solve this problem

How are we solving the problem?

I added an option called is_bucket_to_bucket and if it be enable the sourcePath will change to /buckets and the excludedPaths should also start with /buckets
Note that this solution only backups the content of /buckets and not other paths.
When you enable is_bucket_to_bucket it will replace sourcePath with /buckets, then will check the excludedPaths to be valid. then it will set the directory to an empty string to not having a problem with key building and also the bucket name to an empty string in Initializer which will be ignored and replaced with the specific bucket name.

How is the PR tested?

I tested it in 100k PUT and DEL using warp

Checks

I will add related wiki document changes and link to this PR after merging.

modified: weed/replication/sink/azuresink/azure_sink.go modified: weed/replication/sink/b2sink/b2_sink.go modified: weed/replication/sink/filersink/filer_sink.go modified: weed/replication/sink/gcssink/gcs_sink.go modified: weed/replication/sink/localsink/local_sink.go modified: weed/replication/sink/replication_sink.go modified: weed/replication/sink/s3sink/s3_sink.go Added is_bucket_to_bucket option for replication.toml sinks

Changed Initializer of s3.sink to be compatible with is_bucket_to_bucekt option in replication.toml

Adding bucket creation and deletion functions in s3.sink

Logic of bucket handling when is_bucket_to_bucket enabled in s3.sink

Modified filer_backup to set the source path to /buckets when is_bucket_to_bucket enabled The logic behind this is that it should only backup buckets not other paths

a log message to show the default values when is_bucket_to_bucket is true

Is-bucket_to_bucket option

is_bucket_to_bucketonly for s3.sink

feature

chrislusf · 2023-11-27T16:43:08Z

Please explain more about what the problem is.

itsfarbod · 2023-11-28T10:16:29Z

Hi,
Its a new feature in s3.sink which when it wants to backup the s3 buckets in seaweedFS, each bucket has a corresponding bucket in the destination backup s3. when the is_bucket_to_bucket enabled, it will ignore the sourcePath and will change it to /buckets to only backup the buckets. actually i didn't found a solution for other paths to backing up them in separate buckets in the backup S3. So i decided to only backup the S3 buckets in the seaweedFS S3 and put each bucket and its data in a corresponding bucket. i have tested it out and it works fine. It can recognize the creation and deletion of buckets and will put each data in its actual bucket in the destination. So in conclusion we have a full sync in S3 area between seaweedFS S3 and a backup S3 which can be a Minio etc.
The behavior of code when enabling the is_bucket_to_bucket option is:
First it will check the source path and change it to /buckets and will log this change to user.
Second it will check the exludedPaths and warning if some of them don't start with /buckets and will log and say that they won't affect in the backup, because the sourcePath is /buckets
Third it will set the initial bucket and directory of s3.sink to empty strings. The empty directory will help to properly create the keys for write, And the empty bucket will change to corresponding bucket in any changes.
When the code wants to create buckets for writing it will check the existence of bucket and create it if it does not exist.
In deletion it will check the existence of the bucket and delete it if exists.
Also it will ignore the path(I mean the prefix of files) creation and deletion in buckets because they don't need to be handled and they affect in writes and deletes when we use keys.
All steps of the process will log the error and the results to the user and its all clear.
Feel free to review the code and give me any feature requests.
Thanks,

chrislusf · 2023-11-28T17:12:06Z

It's a change only to s3 sink. Can you make the change specific to S3 and no changes to other code for other sinks?
the attribute name is_bucket_to_bucket is not descriptive.

itsfarbod · 2023-11-28T17:27:33Z

I have to mention adding is_bucket_to_bucket to sink struct will make the implementation much more easier and for other sinks than S3 I will write a warning if the option was enabled, and say that it will be ignored.
OK, What do you suggest? What about s3_sync?

itsfarbod · 2023-11-29T08:57:52Z

Any suggestions?
And what about the option? s3_sync is OK?

sourcePath replacement and exludedPaths checks only to `s3` sink It won't affect in other sinks and if be enabled will log a warning and ignore

Changed is_bucket_to_bucket to sync_s3_to_s3 Now only sink.s3 will have the sync_s3_to_s3 Other sinks doesnt affect from sync_s3_to_s3

itsfarbod · 2023-11-29T14:02:45Z

As you requested I changed the code.
Option name changed from is_bucket_to_bucket to sync_s3_to_s3
I revert ReplicationSink to its initial state
Now only sink.s3 have the option without changing the code of other sinks.
Please review.
Thanks,

chrislusf · 2023-11-29T16:34:15Z

How about this option?

prefix_to_trim=["/buckets","/home"]

chrislusf · 2023-11-29T16:43:03Z

Or add a pair of prefix_from and prefix_to, for more generic operations.

itsfarbod · 2023-12-01T10:25:22Z

I didn't understand your request.
Actually because it's a sync between S3s, so you shouldn't be able to change the source and destination paths.
It's a full sync for the S3. For other filters user can use the old logic you implemented by setting sync_s3_to_s3 to false.
In case that you enable sync_s3_to_s3 it will backup all buckets in their corresponding buckets in the destination. So setting a prefix can't be handled, because the logic only supports /buckets and also setting a prefix for destination would change the name of buckets in the destination. So when you set a prefix for source and destination other than the defaults in sync_s3_to_s3 there is not a straight or standard way to implement these paths in separate buckets.
I have to mention that this feature is for fully syncing the S3 interface, to system be accessible from the backup S3 when seaweedFS is down. I am also investigating the active-passive or active-active models in these situations.
So if a user only wants a backup for S3 that is passive and is accessible when main S3 is down this feature is good for them. This feature allows user to read the data when something is wrong with seaweeds S3.
So with this use case that I provided, user will have a passive backup from S3, not other paths in filer.

chrislusf · 2023-12-02T05:36:15Z

how about naming this feature/option as "support_multiple_buckets"?

itsfarbod · 2023-12-02T06:55:12Z

Actually it will not support multiple buckets in all scenarios. I think syncing S3 has more meaning. Because when we name it support_multiple_buckets user will think that he can set some custom buckets for the backup, but this is not the case. Actually it's multiple bucket but only for S3 and multiple buckets here are buckets in SeaweedFS S3.
Also remember this is a feature only for S3 for syncing the buckets in another S3.

itsfarbod · 2023-12-03T08:07:35Z

Any ideas? Or suggestions?

chrislusf · 2023-12-03T15:48:48Z

The variable name is still confusing and I couldn't tell what it does.

itsfarbod · 2023-12-03T17:36:29Z

Your suggestion support_multiple_buckets is OK? Do you approve?

chrislusf · 2023-12-03T19:09:33Z

That sounds a misunderstanding also as you mentioned. So very confused.

amir-latifi · 2023-12-03T19:49:37Z

Hi. I followed this thread. As it is a valuable feature in my opinion, I have some suggestions and alternative solutions. Hope it would be helpful:

use sink_to_multiple_buckets for variable name.
Implement weed s3.backup instead of weed filer.backup. It's more clear and it would be so easy if the core logic of filer.backup could be reused in s3.backup.
Add another part to replication.toml named [sink.s3_buckets] which will be specifically for syncing seaweedFS s3 (/buckets directory in filer) to another s3 storage, e.g. another seaweedFS s3 or MinIO.

itsfarbod · 2023-12-04T09:10:19Z

That sounds a misunderstanding also as you mentioned. So very confused.

If you take a look at my changes on replication.toml I wrote a verbose description about this feature, so it's kinda clear for user what this feature is:

# It will backup the whole buckets in the /buckets path to the specified S3.
# Each bucket will have its corresponding bucket in the backup S3.
# Source path will be ignored and automatically set to /buckets and is_incremental is always 
# Exclude paths on filer.backup command will only affect if they start with /buckets

Also these suggestions are helpful:

Hi. I followed this thread. As it is a valuable feature in my opinion, I have some suggestions and alternative solutions. Hope it would be helpful:

1. use `sink_to_multiple_buckets` for variable name.

2. Implement `weed s3.backup` instead of `weed filer.backup`. It's more clear and it would be so easy if the core logic of filer.backup could be reused in s3.backup.

3. Add another part to replication.toml named `[sink.s3_buckets]` which will be specifically for syncing seaweedFS s3 (/buckets directory in filer) to another s3 storage, e.g. another seaweedFS s3 or MinIO.

We can have different sink for this option, or even have a different command for it, But i suggest to use this built in option in s3.sink, because it's embedded in code and doesn't make very manipulations on structs or interfaces and it's easy to use.
But if we separate this option from current logic, maintaining the code would be harder.

chrislusf · 2023-12-04T10:06:54Z

Actually it will not support multiple buckets in all scenarios. I think syncing S3 has more meaning. Because when we name it support_multiple_buckets user will think that he can set some custom buckets for the backup, but this is not the case. Actually it's multiple bucket but only for S3 and multiple buckets here are buckets in SeaweedFS S3.
Also remember this is a feature only for S3 for syncing the buckets in another S3.

Why is this only limited to s3 sink?

itsfarbod · 2023-12-04T10:42:42Z

Actually it will not support multiple buckets in all scenarios. I think syncing S3 has more meaning. Because when we name it support_multiple_buckets user will think that he can set some custom buckets for the backup, but this is not the case. Actually it's multiple bucket but only for S3 and multiple buckets here are buckets in SeaweedFS S3.
Also remember this is a feature only for S3 for syncing the buckets in another S3.

Why is this only limited to s3 sink?

Because you have a S3 on seaweed and another sink which is also a S3. So for kinda mirroring the S3 you can use this option. Kinda syncing them together. The backup S3 will be available when Seaweed S3 is down.

You can mirror other syncs but you have to check the structure of the destination backup sink.
But here the source and destination(backup) are both S3 and working with buckets

chrislusf · 2023-12-04T18:47:42Z

Is it duplicating with https://github.com/seaweedfs/seaweedfs/wiki/Gateway-to-Remote-Object-Storage ?

itsfarbod · 2023-12-09T12:59:20Z

While I was checking filer.remote.gateway I encountered an issue.
As said in documentation I configured a remote.storage with the type of s3 like this in weed shell:
> remote.configure -name=myremote -type=s3 -s3.endpoint=<endpoint> -s3.access_key=<access_key> -s3.secret_key=<secret_key>
So it was possible for me to see the remote storage when executing remote.storage in weed shell:

"type":  "s3",
  "name":  "myremote",
  "s3AccessKey":  "ftR43YYrCwb2s7Td2f32",
  "s3SecretKey":  "****************************************",
  "s3Region":  "us-east-2",
  "s3Endpoint":  "<s3_endpoint>",
  "s3StorageClass":  "",
  "s3ForcePathStyle":  true,
  "s3SupportTagging":  true,
  "s3V4Signature":  false,
  "gcsGoogleApplicationCredentials":  "",
  "gcsProjectId":  "",
  "azureAccountName":  "",
  "azureAccountKey":  "",
  "backblazeKeyId":  "",
  "backblazeApplicationKey":  "",
  "backblazeEndpoint":  "",
  "backblazeRegion":  "us-west-002",
  "aliyunAccessKey":  "",
  "aliyunSecretKey":  "",
  "aliyunEndpoint":  "",
  "aliyunRegion":  "",
  "tencentSecretId":  "",
  "tencentSecretKey":  "",
  "tencentEndpoint":  "",
  "baiduAccessKey":  "",
  "baiduSecretKey":  "",
  "baiduEndpoint":  "",
  "baiduRegion":  "",
  "wasabiAccessKey":  "",
  "wasabiSecretKey":  "",
  "wasabiEndpoint":  "",
  "wasabiRegion":  "",
  "filebaseAccessKey":  "",
  "filebaseSecretKey":  "",
  "filebaseEndpoint":  "",
  "storjAccessKey":  "",
  "storjSecretKey":  "",
  "storjEndpoint":  "",
  "contaboAccessKey":  "",
  "contaboSecretKey":  "",
  "contaboEndpoint":  "",
  "contaboRegion":  ""

Then I head to running the weed filer.remote.gateway
First time i created a separate instance with this command:
weed filer.remote.gateway -filer=<filer_ip>:<filer_port> -createBucketAt=myremote
And second time I tried it inside the filer:
weed filer.remote.gateway -createBucketAt=myremote

In both attempts I encountered this error:
read mount info: remote storage is not configured in filer server

It seems it tries to read the remote storage from filer but it's not present, But when I run remote.configure in weed shell I see the remote storage like the list above.

Something important to mention, I did all the weed shell commands in master.

So actually I wasn't able to test it out to fully compare it to the current sink.s3 solution.

I will appreciate if you guide me through this.
Thanks,

itsfarbod · 2023-12-14T12:23:44Z

So I checked the filer.remote.gateway and I encountered some errors, I solved them in these PRs:
#5100
#5102

After making filer.remote.gateway to work I tested it out to see its performance. Unfortunately It seems it has performance issues and is much more slower than the actual filer.backup. Maybe because of mount overhead or something.
E.g for a 5 min writes with 1Kb file size I wrote 140k files and the sync took 8 hours to complete!
In other hands configuring the filer.remote.gateway for backup is kinda complicated and filer backup is much more easier to use.
Also this backup process(s3.sink) is passive despite the mounting features. I mean changes on the backup S3 can make conflicts in local S3 and manipulate the local data while using remote.gateway. But here we have only a backup just like the traditional s3.sink but instead of one bucket it backups in multiple buckets.

Please check this issue and do a little code review on what I did. Thanks,

itsfarbod · 2023-12-25T08:28:07Z

So performance issue of filer.remote.gateway found and solved by me in this PR: #5123

Now this feature(filer.remote.gateway) is enough for me and does the work fine.
But I think this sync_s3_to_s3 is good to be present in sinks.
Also I will develop its performance to be just like filer.remote.gateway and filer.remote.sync.
The isolation level of this feature is better than mount options. So this is still an interesting feature to me .
I will support and develop its further features and also solve the bugs if we encounter one.
So I will be thankful to merge this feature before the next release. Thanks

itsfarbod and others added 12 commits November 26, 2023 13:54

modified: weed/replication/sink/s3sink/s3_sink.go

d454535

Changed Initializer of s3.sink to be compatible with is_bucket_to_bucekt option in replication.toml

modified: weed/replication/sink/s3sink/s3_sink.go

cbbe130

Adding bucket creation and deletion functions in s3.sink

modified: weed/replication/sink/s3sink/s3_sink.go

f8a6ed2

Logic of bucket handling when is_bucket_to_bucket enabled in s3.sink

modified: weed/command/filer_backup.go

e77403a

Modified filer_backup to set the source path to /buckets when is_bucket_to_bucket enabled The logic behind this is that it should only backup buckets not other paths

Some refactor on sink_s3.go Initializer and wrote

d380bc2

a log message to show the default values when is_bucket_to_bucket is true

Some much more verbose logs on developed

1cba2e1

Is-bucket_to_bucket option

Changing replication.toml scaffold to support

5671c0e

is_bucket_to_bucketonly for s3.sink

Final logs and refactoring of the is_bucket_to_bucket

8a588cf

feature

Merge branch 'seaweedfs:master' into master

10d1181

Changed logging of sourcePath and excludePath logging to warning

fb2935b

Merge branch 'master' of github.com:itsfarbod/seaweedfs

66e498c

Merge branch 'seaweedfs:master' into master

1f62a02

itsfarbod added 3 commits November 29, 2023 14:03

Dedicated the is_bucket_to_bucket feature and

d8c764f

sourcePath replacement and exludedPaths checks only to `s3` sink It won't affect in other sinks and if be enabled will log a warning and ignore

Merge branch 'master' of github.com:itsfarbod/seaweedfs

af05050

Deleting isBucketToBucket from ReplicationSink

040f2c7

Changed is_bucket_to_bucket to sync_s3_to_s3 Now only sink.s3 will have the sync_s3_to_s3 Other sinks doesnt affect from sync_s3_to_s3

Merge branch 'seaweedfs:master' into master

51286f9

Merge branch 'master' into master

b09df9d

itsfarbod added 2 commits December 10, 2023 12:19

Merge branch 'master' into master

bec52d9

Merge branch 'seaweedfs:master' into master

4c2ae96

itsfarbod mentioned this pull request Dec 12, 2023

Remote gateway Bug fix #5100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is_bucket_to_bucket backup for s3.sink only #5054

is_bucket_to_bucket backup for s3.sink only #5054

itsfarbod commented Nov 27, 2023 •

edited

chrislusf commented Nov 27, 2023

itsfarbod commented Nov 28, 2023

chrislusf commented Nov 28, 2023

itsfarbod commented Nov 28, 2023 •

edited

itsfarbod commented Nov 29, 2023 •

edited

itsfarbod commented Nov 29, 2023

chrislusf commented Nov 29, 2023

chrislusf commented Nov 29, 2023

itsfarbod commented Dec 1, 2023

chrislusf commented Dec 2, 2023

itsfarbod commented Dec 2, 2023

itsfarbod commented Dec 3, 2023

chrislusf commented Dec 3, 2023

itsfarbod commented Dec 3, 2023

chrislusf commented Dec 3, 2023

amir-latifi commented Dec 3, 2023

itsfarbod commented Dec 4, 2023

chrislusf commented Dec 4, 2023

itsfarbod commented Dec 4, 2023

chrislusf commented Dec 4, 2023

itsfarbod commented Dec 9, 2023

itsfarbod commented Dec 14, 2023

itsfarbod commented Dec 25, 2023

is_bucket_to_bucket backup for s3.sink only #5054

Are you sure you want to change the base?

is_bucket_to_bucket backup for s3.sink only #5054

Conversation

itsfarbod commented Nov 27, 2023 • edited

What problem are we solving?

How are we solving the problem?

How is the PR tested?

Checks

chrislusf commented Nov 27, 2023

itsfarbod commented Nov 28, 2023

chrislusf commented Nov 28, 2023

itsfarbod commented Nov 28, 2023 • edited

itsfarbod commented Nov 29, 2023 • edited

itsfarbod commented Nov 29, 2023

chrislusf commented Nov 29, 2023

chrislusf commented Nov 29, 2023

itsfarbod commented Dec 1, 2023

chrislusf commented Dec 2, 2023

itsfarbod commented Dec 2, 2023

itsfarbod commented Dec 3, 2023

chrislusf commented Dec 3, 2023

itsfarbod commented Dec 3, 2023

chrislusf commented Dec 3, 2023

amir-latifi commented Dec 3, 2023

itsfarbod commented Dec 4, 2023

chrislusf commented Dec 4, 2023

itsfarbod commented Dec 4, 2023

chrislusf commented Dec 4, 2023

itsfarbod commented Dec 9, 2023

itsfarbod commented Dec 14, 2023

itsfarbod commented Dec 25, 2023

itsfarbod commented Nov 27, 2023 •

edited

itsfarbod commented Nov 28, 2023 •

edited

itsfarbod commented Nov 29, 2023 •

edited