Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_bucket_to_bucket backup for s3.sink only #5054

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

itsfarbod
Copy link
Contributor

@itsfarbod itsfarbod commented Nov 27, 2023

What problem are we solving?

There was a problem of backing up all the buckets in one bucket in one path on s3.sink filer.backup
So I tried to add an option called is_bucket_to_bucket to solve this problem

How are we solving the problem?

I added an option called is_bucket_to_bucket and if it be enable the sourcePath will change to /buckets and the excludedPaths should also start with /buckets
Note that this solution only backups the content of /buckets and not other paths.
When you enable is_bucket_to_bucket it will replace sourcePath with /buckets, then will check the excludedPaths to be valid. then it will set the directory to an empty string to not having a problem with key building and also the bucket name to an empty string in Initializer which will be ignored and replaced with the specific bucket name.

How is the PR tested?

I tested it in 100k PUT and DEL using warp

Checks

  • I will add related wiki document changes and link to this PR after merging.

itsfarbod and others added 12 commits November 26, 2023 13:54
	modified:   weed/replication/sink/azuresink/azure_sink.go
	modified:   weed/replication/sink/b2sink/b2_sink.go
	modified:   weed/replication/sink/filersink/filer_sink.go
	modified:   weed/replication/sink/gcssink/gcs_sink.go
	modified:   weed/replication/sink/localsink/local_sink.go
	modified:   weed/replication/sink/replication_sink.go
	modified:   weed/replication/sink/s3sink/s3_sink.go

    Added is_bucket_to_bucket option for replication.toml sinks
    Changed Initializer of s3.sink to be compatible with is_bucket_to_bucekt option in replication.toml
    Adding bucket creation and deletion functions in s3.sink
    Logic of bucket handling when is_bucket_to_bucket enabled in s3.sink
    Modified filer_backup to set the source path to /buckets when is_bucket_to_bucket enabled
    The logic behind this is that it should only backup buckets not other paths
a log message to show the default values when
is_bucket_to_bucket is true
Is-bucket_to_bucket option
is_bucket_to_bucketonly for s3.sink
@chrislusf
Copy link
Collaborator

Please explain more about what the problem is.

@itsfarbod
Copy link
Contributor Author

Hi,
Its a new feature in s3.sink which when it wants to backup the s3 buckets in seaweedFS, each bucket has a corresponding bucket in the destination backup s3. when the is_bucket_to_bucket enabled, it will ignore the sourcePath and will change it to /buckets to only backup the buckets. actually i didn't found a solution for other paths to backing up them in separate buckets in the backup S3. So i decided to only backup the S3 buckets in the seaweedFS S3 and put each bucket and its data in a corresponding bucket. i have tested it out and it works fine. It can recognize the creation and deletion of buckets and will put each data in its actual bucket in the destination. So in conclusion we have a full sync in S3 area between seaweedFS S3 and a backup S3 which can be a Minio etc.
The behavior of code when enabling the is_bucket_to_bucket option is:
First it will check the source path and change it to /buckets and will log this change to user.
Second it will check the exludedPaths and warning if some of them don't start with /buckets and will log and say that they won't affect in the backup, because the sourcePath is /buckets
Third it will set the initial bucket and directory of s3.sink to empty strings. The empty directory will help to properly create the keys for write, And the empty bucket will change to corresponding bucket in any changes.
When the code wants to create buckets for writing it will check the existence of bucket and create it if it does not exist.
In deletion it will check the existence of the bucket and delete it if exists.
Also it will ignore the path(I mean the prefix of files) creation and deletion in buckets because they don't need to be handled and they affect in writes and deletes when we use keys.
All steps of the process will log the error and the results to the user and its all clear.
Feel free to review the code and give me any feature requests.
Thanks,

@chrislusf
Copy link
Collaborator

  • It's a change only to s3 sink. Can you make the change specific to S3 and no changes to other code for other sinks?
  • the attribute name is_bucket_to_bucket is not descriptive.

@itsfarbod
Copy link
Contributor Author

itsfarbod commented Nov 28, 2023

  • I have to mention adding is_bucket_to_bucket to sink struct will make the implementation much more easier and for other sinks than S3 I will write a warning if the option was enabled, and say that it will be ignored.
  • OK, What do you suggest? What about s3_sync?

@itsfarbod
Copy link
Contributor Author

itsfarbod commented Nov 29, 2023

Any suggestions?
And what about the option? s3_sync is OK?

itsfarbod added 3 commits November 29, 2023 14:03
    sourcePath replacement and exludedPaths checks
    only to `s3` sink

    It won't affect in other sinks and if be enabled will
    log a warning and ignore
    Changed is_bucket_to_bucket to sync_s3_to_s3
    Now only sink.s3 will have the sync_s3_to_s3
    Other sinks doesnt affect from sync_s3_to_s3
@itsfarbod
Copy link
Contributor Author

As you requested I changed the code.
Option name changed from is_bucket_to_bucket to sync_s3_to_s3
I revert ReplicationSink to its initial state
Now only sink.s3 have the option without changing the code of other sinks.
Please review.
Thanks,

@chrislusf
Copy link
Collaborator

How about this option?

prefix_to_trim=["/buckets","/home"]

@chrislusf
Copy link
Collaborator

Or add a pair of prefix_from and prefix_to, for more generic operations.

@itsfarbod
Copy link
Contributor Author

I didn't understand your request.
Actually because it's a sync between S3s, so you shouldn't be able to change the source and destination paths.
It's a full sync for the S3. For other filters user can use the old logic you implemented by setting sync_s3_to_s3 to false.
In case that you enable sync_s3_to_s3 it will backup all buckets in their corresponding buckets in the destination. So setting a prefix can't be handled, because the logic only supports /buckets and also setting a prefix for destination would change the name of buckets in the destination. So when you set a prefix for source and destination other than the defaults in sync_s3_to_s3 there is not a straight or standard way to implement these paths in separate buckets.
I have to mention that this feature is for fully syncing the S3 interface, to system be accessible from the backup S3 when seaweedFS is down. I am also investigating the active-passive or active-active models in these situations.
So if a user only wants a backup for S3 that is passive and is accessible when main S3 is down this feature is good for them. This feature allows user to read the data when something is wrong with seaweeds S3.
So with this use case that I provided, user will have a passive backup from S3, not other paths in filer.

@chrislusf
Copy link
Collaborator

how about naming this feature/option as "support_multiple_buckets"?

@itsfarbod
Copy link
Contributor Author

Actually it will not support multiple buckets in all scenarios. I think syncing S3 has more meaning. Because when we name it support_multiple_buckets user will think that he can set some custom buckets for the backup, but this is not the case. Actually it's multiple bucket but only for S3 and multiple buckets here are buckets in SeaweedFS S3.
Also remember this is a feature only for S3 for syncing the buckets in another S3.

@itsfarbod
Copy link
Contributor Author

Any ideas? Or suggestions?

@chrislusf
Copy link
Collaborator

The variable name is still confusing and I couldn't tell what it does.

@itsfarbod
Copy link
Contributor Author

Your suggestion support_multiple_buckets is OK? Do you approve?

@chrislusf
Copy link
Collaborator

That sounds a misunderstanding also as you mentioned. So very confused.

@amir-latifi
Copy link

Hi. I followed this thread. As it is a valuable feature in my opinion, I have some suggestions and alternative solutions. Hope it would be helpful:

  1. use sink_to_multiple_buckets for variable name.
  2. Implement weed s3.backup instead of weed filer.backup. It's more clear and it would be so easy if the core logic of filer.backup could be reused in s3.backup.
  3. Add another part to replication.toml named [sink.s3_buckets] which will be specifically for syncing seaweedFS s3 (/buckets directory in filer) to another s3 storage, e.g. another seaweedFS s3 or MinIO.

@itsfarbod
Copy link
Contributor Author

That sounds a misunderstanding also as you mentioned. So very confused.

If you take a look at my changes on replication.toml I wrote a verbose description about this feature, so it's kinda clear for user what this feature is:

# It will backup the whole buckets in the /buckets path to the specified S3.
# Each bucket will have its corresponding bucket in the backup S3.
# Source path will be ignored and automatically set to /buckets and is_incremental is always 
# Exclude paths on filer.backup command will only affect if they start with /buckets

Also these suggestions are helpful:

Hi. I followed this thread. As it is a valuable feature in my opinion, I have some suggestions and alternative solutions. Hope it would be helpful:

1. use `sink_to_multiple_buckets` for variable name.

2. Implement `weed s3.backup` instead of `weed filer.backup`. It's more clear and it would be so easy if the core logic of filer.backup could be reused in s3.backup.

3. Add another part to replication.toml named `[sink.s3_buckets]` which will be specifically for syncing seaweedFS s3 (/buckets directory in filer) to another s3 storage, e.g. another seaweedFS s3 or MinIO.

We can have different sink for this option, or even have a different command for it, But i suggest to use this built in option in s3.sink, because it's embedded in code and doesn't make very manipulations on structs or interfaces and it's easy to use.
But if we separate this option from current logic, maintaining the code would be harder.

@chrislusf
Copy link
Collaborator

Actually it will not support multiple buckets in all scenarios. I think syncing S3 has more meaning. Because when we name it support_multiple_buckets user will think that he can set some custom buckets for the backup, but this is not the case. Actually it's multiple bucket but only for S3 and multiple buckets here are buckets in SeaweedFS S3.
Also remember this is a feature only for S3 for syncing the buckets in another S3.

Why is this only limited to s3 sink?

@itsfarbod
Copy link
Contributor Author

Actually it will not support multiple buckets in all scenarios. I think syncing S3 has more meaning. Because when we name it support_multiple_buckets user will think that he can set some custom buckets for the backup, but this is not the case. Actually it's multiple bucket but only for S3 and multiple buckets here are buckets in SeaweedFS S3.
Also remember this is a feature only for S3 for syncing the buckets in another S3.

Why is this only limited to s3 sink?

Because you have a S3 on seaweed and another sink which is also a S3. So for kinda mirroring the S3 you can use this option. Kinda syncing them together. The backup S3 will be available when Seaweed S3 is down.

You can mirror other syncs but you have to check the structure of the destination backup sink.
But here the source and destination(backup) are both S3 and working with buckets

@chrislusf
Copy link
Collaborator

@itsfarbod
Copy link
Contributor Author

While I was checking filer.remote.gateway I encountered an issue.
As said in documentation I configured a remote.storage with the type of s3 like this in weed shell:
> remote.configure -name=myremote -type=s3 -s3.endpoint=<endpoint> -s3.access_key=<access_key> -s3.secret_key=<secret_key>
So it was possible for me to see the remote storage when executing remote.storage in weed shell:

"type":  "s3",
  "name":  "myremote",
  "s3AccessKey":  "ftR43YYrCwb2s7Td2f32",
  "s3SecretKey":  "****************************************",
  "s3Region":  "us-east-2",
  "s3Endpoint":  "<s3_endpoint>",
  "s3StorageClass":  "",
  "s3ForcePathStyle":  true,
  "s3SupportTagging":  true,
  "s3V4Signature":  false,
  "gcsGoogleApplicationCredentials":  "",
  "gcsProjectId":  "",
  "azureAccountName":  "",
  "azureAccountKey":  "",
  "backblazeKeyId":  "",
  "backblazeApplicationKey":  "",
  "backblazeEndpoint":  "",
  "backblazeRegion":  "us-west-002",
  "aliyunAccessKey":  "",
  "aliyunSecretKey":  "",
  "aliyunEndpoint":  "",
  "aliyunRegion":  "",
  "tencentSecretId":  "",
  "tencentSecretKey":  "",
  "tencentEndpoint":  "",
  "baiduAccessKey":  "",
  "baiduSecretKey":  "",
  "baiduEndpoint":  "",
  "baiduRegion":  "",
  "wasabiAccessKey":  "",
  "wasabiSecretKey":  "",
  "wasabiEndpoint":  "",
  "wasabiRegion":  "",
  "filebaseAccessKey":  "",
  "filebaseSecretKey":  "",
  "filebaseEndpoint":  "",
  "storjAccessKey":  "",
  "storjSecretKey":  "",
  "storjEndpoint":  "",
  "contaboAccessKey":  "",
  "contaboSecretKey":  "",
  "contaboEndpoint":  "",
  "contaboRegion":  ""

Then I head to running the weed filer.remote.gateway
First time i created a separate instance with this command:
weed filer.remote.gateway -filer=<filer_ip>:<filer_port> -createBucketAt=myremote
And second time I tried it inside the filer:
weed filer.remote.gateway -createBucketAt=myremote

In both attempts I encountered this error:
read mount info: remote storage is not configured in filer server

It seems it tries to read the remote storage from filer but it's not present, But when I run remote.configure in weed shell I see the remote storage like the list above.

Something important to mention, I did all the weed shell commands in master.

So actually I wasn't able to test it out to fully compare it to the current sink.s3 solution.

I will appreciate if you guide me through this.
Thanks,

@itsfarbod
Copy link
Contributor Author

So I checked the filer.remote.gateway and I encountered some errors, I solved them in these PRs:
#5100
#5102

After making filer.remote.gateway to work I tested it out to see its performance. Unfortunately It seems it has performance issues and is much more slower than the actual filer.backup. Maybe because of mount overhead or something.
E.g for a 5 min writes with 1Kb file size I wrote 140k files and the sync took 8 hours to complete!
In other hands configuring the filer.remote.gateway for backup is kinda complicated and filer backup is much more easier to use.
Also this backup process(s3.sink) is passive despite the mounting features. I mean changes on the backup S3 can make conflicts in local S3 and manipulate the local data while using remote.gateway. But here we have only a backup just like the traditional s3.sink but instead of one bucket it backups in multiple buckets.

Please check this issue and do a little code review on what I did. Thanks,

@itsfarbod
Copy link
Contributor Author

So performance issue of filer.remote.gateway found and solved by me in this PR: #5123

Now this feature(filer.remote.gateway) is enough for me and does the work fine.
But I think this sync_s3_to_s3 is good to be present in sinks.
Also I will develop its performance to be just like filer.remote.gateway and filer.remote.sync.
The isolation level of this feature is better than mount options. So this is still an interesting feature to me .
I will support and develop its further features and also solve the bugs if we encounter one.
So I will be thankful to merge this feature before the next release. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants