Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Lemmy instances sending every activity twice #4609

Open
5 tasks done
freamon opened this issue Apr 8, 2024 · 17 comments
Open
5 tasks done

[Bug]: Lemmy instances sending every activity twice #4609

freamon opened this issue Apr 8, 2024 · 17 comments
Labels
area: federation support federation via activitypub bug Something isn't working

Comments

@freamon
Copy link

freamon commented Apr 8, 2024

Requirements

  • Is this a bug report? For questions or discussions use https://lemmy.ml/c/lemmy_support
  • Did you check to see if this issue already exists?
  • Is this only a single bug? Do not put multiple bugs in one issue.
  • Do you agree to follow the rules in our Code of Conduct?
  • Is this a backend issue? Use the lemmy-ui repo for UI / frontend issues.

Summary

This is discussed this post on on Lemmy - you can see me protesting that it should be me filing the report, but I've slept since then, and Nutomic said please (?!). There's a question of whether it's due to instance configuration, Lemmy itself, or some combination of the two (it only stated with 0.19.x, but not all instances do it, and there isn't always a correlation between it happening and the amount of containers an instance is using). I was reluctant to make this Issue because there's information you'll need about instance config that I'm not going to be able to give to you, because I'm not running Lemmy, so this is all second-hand info and observation.

The post on Lemmy has a video. The background to it is that I wanted to use ActivityPub for an Automod on lemmy.world/c/starwarsmemes, so I sent a Follow request to that community (using a Ruby script). When anything happens lemmy.world sends an Announce activity to my server, and the Automod can, amongst other things, send a notification to my phone when a new post is made. In the video, the red terminal is showing a tail -f on the nginx log; the green terminal is showing the .object.id, .object.type, and .object.actor from the JSON contained inside each request. starwarsmemes is pretty quiet at the minute, so I was able to monitor the results of activity that I myself was generating as a user on lemmy.world. Hopefully, the video makes it clear that duplicate activity is being sent every time.

I've seen similar behaviour before: post on lemmy.ca and post on endlesstalk.org. The background to those posts is that I was running a fake lemmy community (best viewed from here), that was using a basic ActivityPub server to federate comments and votes, and I could tell from that that those instances (among others) were duplicating everything.

I saw it first with lemmy.world when they upgraded to 0.19.3, and my dumb AutoMod pinged my phone 4 times to tell me about the same new post (I mentioned it to the admins in this comment).

Outside of my server, you can also see duplicate activity trip other recipients up. One place is Lemmy modlogs. For example, this modlog shows me manually restoring a comment that the AutoMod removed. I did this with my own fair hand, using the lemmy-ui, so I know that whilst I didn't do the same thing twice, activity was sent twice to endlesstalk.org about it. Another is PieFed. They've tightened up their ability to filter out duplicates now, but - if they've not been deleted since - you can still see what happened here and here, as well as here and here (the originals weren't updated, btw). In both circumstances (modlogs and piefed), I'm guessing it's hit-and-miss because it depends on how much time there was between the original activity and its duplicate.

It's entirely possible that is down to instance mis-configuration (in which, quite a few instances are making the same mistake, which is a problem in itself). It's only slightly vague suggestions from admins that the software itself is at fault. That's why I'd prefer it was them making this Issue instead of me, but I've been requested to do it, so here it is.

Steps to Reproduce

  1. Be a server admin, using software that lets you monitor incoming activity
  2. Subscribe to a community on an affected instance like lemmy.world
  3. Generate some activity using an actor unrelated to your own server, and see how many times the instance Announces it

Technical Details

(I can't you much here, for the aforementioned I'm-not-running-Lemmy reasons)

Version

0.19.3

Lemmy Instance URL

lemmy.world

@freamon freamon added the bug Something isn't working label Apr 8, 2024
@Nutomic Nutomic added the area: federation support federation via activitypub label Apr 8, 2024
@Nutomic
Copy link
Member

Nutomic commented Apr 9, 2024

This seems like a bug in Lemmy so youre definitely right to report it here. Doesnt matter if you observe it from another platform. It looks like all requests go to /inbox and you respond with status 200 without producing error codes or timeouts, so it doesnt look like a problem on your side.

@phiresky Do you have any idea about this? All I can think of is that theres more than one federation worker active for a single target instance, but the code looks correct.

Edit: It could help to write unit tests for the outgoing federation queue, to ensure that it starts and stops workers as expected, and sends activities correctly.

@phiresky
Copy link
Collaborator

phiresky commented Apr 9, 2024

I know that lemmy.world definitely had a misconfiguration for a bit where they had multiple instances of Lemmy running without the appropriate -federation-worker flags set (since those where not required before 0.19).

If this issue doesn't happen always and for every instance then it's likely a misconfiguration in those instances.

That doesn't mean we shouldn't do something to improve the situation. For example we could add a heuristic to the federation workers that detects and logs an error when the federation_queue_state table has been updated by someone else which would mean overlapping federation workers running.

The best solution would be a horizontal scaling method that does not require manual configuration of federation instance ranges but I can't think of a way to do that without huge complexity.

@Nutomic
Copy link
Member

Nutomic commented Apr 9, 2024

The bug also seems to happen for instances that are running only a single Lemmy backend, eg mander.xyz.

@freamon
Copy link
Author

freamon commented Apr 10, 2024

Have done some more testing.

Test 1

Sent a Follow from my server to a community on mander.xyz using a Ruby script.

Script in here
 #!/usr/bin/env ruby

require 'securerandom'
require 'openssl'
require 'http'          # needs 'gem install http'

# change these
my_site                 = "lemmon.website"
my_dir                  = "u"
my_actor                = "freamon"
my_private_key_location = "/home/andrew/keys/#{my_actor}/private.pem"
my_public_key_location  = "https://#{my_site}/#{my_dir}/#{my_actor}"

target_site             = "mander.xyz"
target_dir              = "c"
target_actor            = "printforgood"

uuid = SecureRandom.uuid

json = <<~EOT
{
  "@context": [
    "https://w3id.org/security/v1",
    "https://www.w3.org/ns/activitystreams"
  ],
  "id":"https://#{my_site}/follows/#{uuid}",
  "type":"Follow",
  "actor":"https://#{my_site}/#{my_dir}/#{my_actor}",
  "object":"https://#{target_site}/#{target_dir}/#{target_actor}"
}
EOT

date          = Time.now.utc.httpdate
digest        = Base64.strict_encode64(OpenSSL::Digest.digest('SHA256', json))
digest.prepend("SHA-256=")
keypair       = OpenSSL::PKey::RSA.new(File.read("#{my_private_key_location}"))
signed_string = "(request-target): post /#{target_dir}/#{target_actor}/inbox\ncontent-type: application/activity+json\ndate: #{date}\ndigest: #{digest}\nhost: #{target_site}"
signature     = Base64.strict_encode64(keypair.sign(OpenSSL::Digest::SHA256.new, signed_string))
header        = 'keyId="' + my_public_key_location + '#main-key",algorithm="rsa-sha256",headers="(request-target) content-type date digest host",signature="' + signature + '"'

response = HTTP.headers({ 'Host': target_site, 'User-Agent': 'ActivityPub +https://' + my_site, 'Content-Type': 'application/activity+json', 'Digest': digest, 'Date': date, 'Signature': header })
               .post("https://#{target_site}/#{target_dir}/#{target_actor}/inbox", body: json)

p response        # header
p response.to_s   # body (if any)

Received 3 duplicate Accepts in response.
Nginx log:

ip.address - - [09/Apr/2024:18:30:27 +0000] "POST /inbox HTTP/1.1" 200 0 "-" "Lemmy/0.19.3; +https://mander.xyz"
ip.address - - [09/Apr/2024:18:30:27 +0000] "POST /inbox HTTP/1.1" 200 0 "-" "Lemmy/0.19.3; +https://mander.xyz"
ip.address - - [09/Apr/2024:18:30:27 +0000] "POST /inbox HTTP/1.1" 200 0 "-" "Lemmy/0.19.3; +https://mander.xyz"

My backend server just gives each request a UUID and dumps the bodies into a file:

-rw-r--r-- 1 andrew andrew    475 Apr  9 18:30 activity_from_printforgood_737b341e-54f4-42fe-9bdb-ae89c7f89222.json
-rw-r--r-- 1 andrew andrew    475 Apr  9 18:30 activity_from_printforgood_38b15127-ba42-4b4a-b552-946a6707d2e9.json
-rw-r--r-- 1 andrew andrew    475 Apr  9 18:30 activity_from_printforgood_f8a6f4ed-376c-4ade-9abe-87d9035435fc.json

Each file contains the same Accept JSON.

JSON in here
{
  "@context": [
    "https://join-lemmy.org/context.json",
    "https://www.w3.org/ns/activitystreams"
  ],
  "actor": "https://mander.xyz/c/printforgood",
  "to": [
    "https://lemmon.website/u/freamon"
  ],
  "object": {
    "actor": "https://lemmon.website/u/freamon",
    "to": null,
    "object": "https://mander.xyz/c/printforgood",
    "type": "Follow",
    "id": "https://lemmon.website/follows/194d816b-1599-4587-95ae-d274c1d4209c"
  },
  "type": "Accept",
  "id": "https://mander.xyz/activities/accept/b0f94ef2-f411-4b09-8548-9d81b98b01c7"
}

Test 2

Sent a Follow from a Lemmy instance running inside a VM and tunnelled through to via ngrok.com, to a community on mander.xyz, using the lemmy-ui

Received only 1 Accept in response:

screenshot in here

Lemmy0 19 3 on DESKTOP-2B5PCTI - Virtual Machine Connection 09_04_2024 22_30_39

Test 3

Still on the VM. Extracted the private key for my lemmy user, and sent a Follow to a different community on mander.xyz (that was conveniently stuck at 'Pending'), using the Ruby script from Test 1.

Received 2 Accepts in response, one of which lemmy dutifully rejected

screenshot in here

Lemmy0 19 3 on DESKTOP-2B5PCTI - Virtual Machine Connection 09_04_2024 23_38_03

Test 4

Used the lemmy-ui again to subscribe to a different community on mander.xyz. Also received 2 Accepts in response, one of which lemmy rejected

Test 5

Same as test 2-5, but for communities on lemmy.ml instead. Follows were sent using the lemmy-ui, and the Ruby script, but only 1 Accept was ever sent back

Test 6

Sent a Follow from my server to a community on lemmy.ml using the Ruby script. Received only 1 Accept back in response

Thoughts

I've considered the possibility that me using lemmon.website to experiment with ActivityPub is a factor, and Lemmy instances have more than 1 idea of what the site is, and are sending activity for each version. The argument against this, is that I've interacted way more with lemmy.ml than mander.xyz, and there's no duplications coming from lemmy.ml. Also, when lemmy.world was on 0.18.5, it likewise never sent me dupes. I even tried to insert another record for 'lemmon.website' into the lemmy DB on the VM I was using, and - unsurprisingly - it rejected it for not being unique.

Likewise, the Ruby script I've used - it works without issue for lemmy.ml (and Mastodon instances, and Pixelfed instances, etc), but me using it in Test 3 seemed to be the trigger that tripped mander.xyz up into starting sending me activity twice.

I haven't been able to form anything conclusive from this, but maybe something will stand out to people more familiar with Lemmy's code.

@Nothing4You
Copy link
Contributor

I've seen duplicate activities on Lemmy <-> Lemmy before, it seems unlikely to be related to the software on the receiving side.

@phiresky
Copy link
Collaborator

It would be good to get a reproduction with an instance that someone actually controls because otherwise it's hard to say whether there is a misconfiguration or not and to get the relevant logs / db entries (which are on the sending side)

@phiresky
Copy link
Collaborator

phiresky commented Apr 10, 2024

it seems unlikely to be related to the software on the receiving side.

It might be relevant since in general every actor has their own inbox, and the merging of these inboxes into a single site-wide inbox might not work correctly depending on how other end presents their inboxes. If lemmy sees inboxes as separate then it may send events to multiple inboxes

@Nutomic
Copy link
Member

Nutomic commented Apr 11, 2024

It might be relevant since in general every actor has their own inbox, and the merging of these inboxes into a single site-wide inbox might not work correctly depending on how other end presents their inboxes. If lemmy sees inboxes as separate then it may send events to multiple inboxes

I also thought about that, but based on the logs posted by @freamon, all activities were sent to /inbox. So I dont see any way for deduplication to fail. Also all inbox requests receive response status 200, so it cant be retries due to errors.

I messaged the admin of mander.xyz to help investigate, but havent received a response so far.

@freamon
Copy link
Author

freamon commented Apr 26, 2024

Some good news / bad news for you - lemmy.ml is duplicating activities. I subscribed to asklemmy@lemmy.ml from a test Lemmy instance tunnelled through via ngrok, so I could screenshot the 'inspect' window - you can see a pattern of 200 OK, 400 Bad Request for every activity. I've included two images to demonstrate that they are the same activity.

Lemmy0 19 3 on DESKTOP-2B5PCTI - Virtual Machine Connection 26_04_2024 15_15_59

Lemmy0 19 3 on DESKTOP-2B5PCTI - Virtual Machine Connection 26_04_2024 15_16_24

(I don't imagine it's a config issue for you guys, but LW changed their config to fix the problem when they were doing it)

@jlandahl
Copy link

Note that the duplication has returned: https://lemmy.ml/comment/11010288

Anyone know if lemmy.world is testing that config change again or something?

@Nothing4You
Copy link
Contributor

@freamon brought this up in their matrix chat the other day and the answer was nothing changed on lemmy.world's end

@freamon
Copy link
Author

freamon commented May 14, 2024

@jlandahl - LW had to reset a container, so you shouldn't be getting any more duplicate activity from them. It was the same with lemmy.ml incidentally - a few comments up I was saying that they were duplicating too, but then they reset as part of testing an upgrade to 0.19.4, and that stopped it.


As a Issue that seems to start randomly, and is stopped by 'turning it off and on again', I suspect this will be a hard problem to solve.

I noticed when I was recording the video the other day (of which this is an excerpt), that there was a recurring pattern: 30 seconds of nothing, then a flood of activity, then back to 30 seconds of nothing, etc. I imagine this is by design, and my instance has to wait its turn to get a bunch in one, rather than get activities as they are created. It made me wonder whether 'version 1' of an activity was being generated but not actually sent, just queued up and stuck somewhere. Then Lemmy was generating 'version 2' because it thought the first version wasn't delivered successfully, and then both versions were being sent at the same time.

@MrKaplan-lw
Copy link

Hello,

I'm part of the Lemmy.World infra team.

I'd be happy to help finding the root cause of this issue, but it's currently not clear what information we should be looking for?

We're currently running with the default log level for our federation container and there are no logs even mentioning lemmon.website, except for some unrelated html error pages from a kbin instance.

@Nutomic
Copy link
Member

Nutomic commented May 15, 2024

To fix this we have to narrow down what exactly is causing the duplicate activities. My understanding is that the bug doesnt happen with the default ansible configuration, and affected instances have some different config. Some things worth trying on a test instance:

  • Does the bug get triggered by running multiple Lemmy processes with --disable-activity-sending, --disable-http-server etc?
  • Is the bug random and goes away after restarting Lemmy?
  • Are all activities sent out multiple times, or only specific activity types?

@MrKaplan-lw
Copy link

Does the bug get triggered by running multiple Lemmy processes with --disable-activity-sending, --disable-http-server etc?

We've previously been using multiple federation containers and now we are using a single federation container.
As far as we know this issue has been there both with multiple federation containers and a single federation container.
We're using --disable-http-server --disable-scheduled-tasks for the single container.

Is the bug random and goes away after restarting Lemmy?

It seems to be random. Restarting fixes it according to @freamon's feedback after we restart the container.

Are all activities sent out multiple times, or only specific activity types?

This is probably something @freamon can answer.
As I understand from our matrix chat this affects at least posts, comments, and dislikes, probably votes in general.

@freamon
Copy link
Author

freamon commented May 15, 2024

Yeah - restarting seems to fix it (at least from my perspective).

It's for all activities - anything a remote community might Announce, and - for my one local community - anything a user might send.

(I just verified that last part by upvoting a random post on endlesstalk.org. You can tell how broke that instance is anyway by how out-of-sync it is with votes generally (e.g. compare https://endlesstalk.org/post/34649989 with the fedi-link), but the upvote I gave for a post on !tails@lemmon.website was sent to me a sweet 7 times. There's some info here about his container setup, with a comment that also suggests that a restart fixes it.)

@Nutomic
Copy link
Member

Nutomic commented May 20, 2024

I added some extra logging in #4726 to figure out where the problem is coming from. Once that is merged and released, run Lemmy with RUST_LOG="warn,lemmy_server=debug or RUST_LOG="warn,lemmy_server=trace (preferably with a separate process only for federation). Then see if any of these log lines are printed more than once, and report them here.

dessalines pushed a commit that referenced this issue May 21, 2024
* Extra logging to debug duplicate activities (ref #4609)

* Fix logging for api tests

* fmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: federation support federation via activitypub bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants