My main account is dark_arc@social.packetloss.gg. However, as of roughly 24-hours ago (it seems this has been going on since March 10th and gotten worse since) it seems like the server has stopped properly retrieving content from lemmy.world.
It’s been running smoothly for well over 9 months, and (I think) working fine for content coming in from other instances. So I’m curious if anyone else experienced anything strange with lemmy.world federation recently?
Setup Description
The server flow in my case is as follows:
[Public Internet] <-> [Digital Ocean Droplet] <-> [ZeroTier] <-> [Physical Machine in my Basement (HW Info)]
The Digital Ocean droplet is a virtual host machine that forwards requests via nginx to the physical machine where a second nginx server (running the standard lemmy nginx config) then forwards the request to the lemmy server software itself.
Current Status
Lemmy Internal Error
I’ve found this is my lemmy logs:
2024-03-24T00:42:10.062274Z WARN lemmy_utils: error in spawn: Unknown: Request limit was reached during fetch
0: lemmy_apub::objects::community::from_json
at crates/apub/src/objects/community.rs:126
1: lemmy_apub::fetcher::user_or_community::from_json
at crates/apub/src/fetcher/user_or_community.rs:87
2: lemmy_server::root_span_builder::HTTP request
with http.method=POST http.scheme="http" http.host=social.packetloss.gg http.target=/inbox otel.kind="server" request_id=688ad030-f892-4925-9ce9-fc4f3070a967
at src/root_span_builder.rs:16
I’m thinking this could be the cause … though I’m not sure how to raise the limit (it seems to be hard coded). I opened an issue with the Lemmy devs but I’ve since closed it while gathering more information/making sure this is truly an issue with the Lemmy server software.
Nginx 408 and 499s
I’m seeing the digital ocean nginx server reporting 499 on various “/inbox” route requests and I’m seeing the nginx running on the physical machine that talks directly to lemmy reporting 408 on various “/inbox” route requests.
There are some examples in this comment: https://lemmy.world/comment/8728858
Checking lemmy.world federation status at: https://phiresky.github.io/lemmy-federation-state/site?domain=lemmy.world . There, you can see that social.packetloss.gg is listed among the “407 failing instances”.
You’ll need to check if your server actually configured to receive federation traffics. If you’re using cloudflare or some other web application firewall, make sure they’re not doing any anti bots measures on the
/inbox
endpoint. For example, in Cloudflare, create a new WAF rule (Security -> WAF) for/inbox
and set it to skip all security.If you don’t use any web application firewall at all, did you just upgraded your instance from v18.x to v19.x recently right before experiencing federation issue? v19.x has increased resource consumption and will have problem running on small server after running for a while. For small VPS (~4GB of RAM), you might want to adjust database pool_size to <30 on
lemmy.hjson
file. Restarting lemmy AND postgres every once in a while also helps if you’re on a small VPS.It’s been up on 19.x for a few months now. It’s also a full on bare metal server with a ton of resources, it’s not at all starved.
It’s almost like someone posted something to somewhere that “jammed” Lemmy and it just won’t get past it but I’m not sure how to figure out what that would be or how to unjam things.
Could be postgres-related. Federation is only “jammed” if the source instance thinks your instance is having an issue because it takes too long to respond. Maybe enabling slow query log on postgres and then reviewing that log could point you in the right direction.
Using that site and doing my instance to Lemmyworld , your site is listed under the working ones. I know spam has been on the rise again on big instances, so that could be playing a part in this also.
I’m one of a whole 2 users at lemmy.starlightkel.xyz and we’re seeing lemmy.world content no problem right now.
Yeah I’m basically the only user of this server. Good data point that you’re not having issues though.
Similar for my instance.
Here’s a handy site to check your federation status: https://phiresky.github.io/lemmy-federation-state/site?domain=social.packetloss.gg
It looks like all the sites are lagging, which can sometimes mean there is an issue. What’s the specs of the server? Can you try restarting it and seeing if that helps?
There are issues for servers hosted in places like Australia because of the latency in communicating with .world due to how big it is and how much data it sends because of this.
This is the right answer
So, I think this is a (helpful) general comment but wrong in this/my specific case.
The server is so small it’s not really going to register on a 10-minute frequency for outgoing content – I’m not that much of a lemmy addict! haha.
You can see in a comment here my most recent comment to lemmy.world did sync: https://lemmy.world/comment/8728858
I’m not having any issues with outgoing content, beehaw, the KDE instance, and several others. It’s just lemmy.world that’s acting up (which is unfortunately because it’s my favorite – I mod/run several communities and donate to here/them – haha).
I’m not having issues that I’m aware of, but that site always returns Network Request Failed and I haven’t figured out why.
If your failing on Lemmy.world like others suggest, but federation is working with others, it’s likely you are getting large queries that are timing out with lemmy.world. You can check for timeouts in your proxy’s logs.
This guide is for tuning postgres if you find it is an issue. This has good tips and tips on how to analyze issues on postgres. It’s oriented for matrix tuning though. work_mem and temp files likely will need constant tuning to meet your demands if you have limited memory for postgres.
Lastly I wanted to note that Lemmy does application connection pooling, so if you have a Pooler service behind postgres, dont, let Lemmy connect directly to postgres
Hope these help. I went from a recent 1 million federation backlog in lemmy.world to near none. I believe I am at a ram limit and a couple expensive queries (deleting posts for example) are creating temp files and taking too long.
So, I think you’re most on the right track of the responses…
It seems to just be exclusively incoming from lemmy.world. If you look here, my most recent comment is on lemmy.world:
https://social.packetloss.gg/comment/1415801 https://lemmy.world/comment/8710941
The instance just isn’t getting any new posts, comments, or votes back from lemmy.world.
Everytime I shut down the lemmy server I see this:
2024-03-23T17:34:33.774333Z WARN lemmy_server: Received ctrl-c, shutting down gracefully... 2024-03-23T17:34:33.774912Z WARN lemmy_federate: Waiting for 1618 workers (30.00s max)
That number never seems to move, there are always 1618 works. I’m not sure if that means anything or not regarding pending processing or what have you.
I am seeing in my publicly facing nginx logs:
135.181.143.221 - - [22/Mar/2024:12:23:28 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:12:23:40 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:12:23:54 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:12:24:12 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:12:24:38 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:12:25:21 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:12:26:35 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:12:28:53 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:12:33:19 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:12:42:01 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:12:59:15 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:13:33:33 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 135.181.143.221 - - [22/Mar/2024:21:31:55 +0000] "POST /inbox HTTP/1.1" 499 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
There’s then an internal nginx server that sees:
10.241.127.2 - - [22/Mar/2024:12:23:18 +0000] "POST /inbox HTTP/1.1" 200 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:12:24:19 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:12:24:31 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:12:24:45 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:12:25:03 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:12:25:29 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:12:26:11 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:12:27:25 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:12:29:43 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:12:34:09 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:12:42:51 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:13:00:06 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:13:34:24 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:14:42:49 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:16:59:32 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [22/Mar/2024:21:32:45 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world" 10.241.127.2 - - [23/Mar/2024:06:39:03 +0000] "POST /inbox HTTP/1.1" 408 0 "-" "Lemmy/0.19.3; +https://lemmy.world"
So, things did start timing out. I’m not sure what to do about that though.
This server is not resource starved:
load average: 0.04, 0.09, 0.10 total used free shared buff/cache available Mem: 31507 7651 1092 164 22764 23239 Swap: 16087 1 16086
It’s just this lemmy.world data that’s suddenly out of wack after months of normal operation (both on lemmy 18 and 19).
It feels like a bad payload that the server just can’t move past for some reason and lemmy.world keeps sending.
I had logging on the lemmy container itself piped to /dev/null because it’s just such a noisy log. I turned it back on… I’ll see if I can find more information next time lemmy.world posts.
Another note you want to have a round trip from federation in under 3 seconds. Those 499 are likely 10 second timeouts from Lemmy.worlds side. Mostly from a long query being the culprit. Lemmy federation in 0.19 processes changes in order per instance so lemmy.world won’t sync newer stuff and keep sending the same large request. I believe there is a retry limit though.
Yeah, I mean things should be fine in general; like I said this has been working for quite a long time now without issue.
The machine that’s actually doing the work here is quite powerful and is used to run several game servers in addition to Lemmy … Lemmy really isn’t much more than footnote in resource usage:
CPU: Info: 8-core model: Intel Core i7-10700 bits: 64 type: MT MCP cache: L2: 2 MiB Speed (MHz): avg: 4653 min/max: 800/4800 cores: 1: 4698 2: 4685 3: 4786 4: 4704 5: 4694 6: 4700 7: 4800 8: 4801 9: 4802 10: 3408 11: 4756 12: 4713 13: 4706 14: 4707 15: 4798 16: 4703 Drives: Local Storage: total: 931.51 GiB used: 380.39 GiB (40.8%) ID-1: /dev/nvme0n1 vendor: Western Digital model: WDS100T2B0C-00PXH0 size: 931.51 GiB Partition: ID-1: / size: 914.18 GiB used: 380.02 GiB (41.6%) fs: xfs dev: /dev/dm-0 ID-2: /boot size: 1014 MiB used: 370 MiB (36.5%) fs: xfs dev: /dev/nvme0n1p2 ID-3: /boot/efi size: 598.8 MiB used: 5.8 MiB (1.0%) fs: vfat dev: /dev/nvme0n1p1 Swap: ID-1: swap-1 type: partition size: 15.71 GiB used: 1.2 MiB (0.0%) dev: /dev/dm-1 Sensors: System Temperatures: cpu: 28.0 C pch: 26.0 C mobo: N/A Fan Speeds (rpm): N/A Info: Processes: 358 Uptime: 16h 39m Memory: total: 32 GiB note: est. available: 30.77 GiB used: 8.54 GiB (27.8%) Init: systemd target: multi-user (3) Shell: fish inxi: 3.3.30
How much RAM is postgres using? I’m running on i7 11th Gen and 16gb of RAM so your hardware is fine. This is the resource usage of my setup. First 6 are Lemmy itself broken up to scale and for redundancy. I have a node down right now normally using 3 http ones. Http ones receive federation from others and the lemmy-instance2s are the databases setup to be highly available too.
NAME CPU(cores) MEMORY(bytes)
lemmy-6b9fbd75f6-4q4n8 1m 5Mi
lemmy-fed-0 2m 18Mi
lemmy-fed-1 1m 18Mi
lemmy-fed-2 6m 17Mi
lemmy-http-57886f784f-5x9vb 4m 73Mi
lemmy-http-57886f784f-tpbtb 13m 121Mi
lemmy-instance2-czc6-0 47m 3362Mi
lemmy-instance2-jpw7-0 13m 1017Mi
Not sure what you’re using to generate that list/formatting is a bit difficult.
I don’t have a cluster since it’s effectively single user + @Auto_Post_Bot@social.packetloss.gg (in theory a few other people have access, but they’re not active), single machine, it’s just more or less the out of the box docker stuff on a bare metal machine in my basement + a digital ocean droplet.
The droplet is what I’m using to have a static IP to prevent dynamic DNS nonsense + it provides some level of protection against a naive DDoS attack on random fediverse servers (since I can in the worst case, get on my phone and severe the ZeroTier connection that’s using to connect the droplet to my basement server).
I’m pretty confident whatever is going on is payload related at this point.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 50622 70 20 0 330264 240200 201512 S 0.0 0.7 0:25.21 postgres 50636 70 20 0 327804 239520 201296 S 0.0 0.7 0:26.55 postgres 50627 70 20 0 327204 239152 201592 S 0.0 0.7 0:24.75 postgres 50454 70 20 0 328932 238720 200872 S 0.0 0.7 0:26.61 postgres 50639 70 20 0 313528 217800 193792 S 0.0 0.7 0:03.13 postgres 50641 70 20 0 313284 217336 194204 S 0.0 0.7 0:03.15 postgres 50626 70 20 0 313592 216604 193636 S 0.0 0.7 0:05.07 postgres 50632 70 20 0 313236 216460 193968 S 0.0 0.7 0:04.52 postgres 50638 70 20 0 310368 216084 193856 S 0.0 0.7 0:04.20 postgres 50614 70 20 0 310520 216072 193840 S 0.0 0.7 0:02.88 postgres 50642 70 20 0 312200 215920 194068 S 0.0 0.7 0:04.46 postgres 50640 70 20 0 312584 215724 193676 S 0.0 0.7 0:03.32 postgres 50635 70 20 0 309744 215404 193764 S 0.0 0.7 0:02.72 postgres 50630 70 20 0 312168 215224 193488 S 0.0 0.7 0:02.67 postgres 50621 70 20 0 309560 215096 193772 S 0.0 0.7 0:02.97 postgres 50646 70 20 0 309492 215008 193560 S 0.0 0.7 0:04.66 postgres 50625 70 20 0 309760 215004 193368 S 0.0 0.7 0:03.08 postgres 50637 70 20 0 309296 214992 193848 S 0.0 0.7 0:02.87 postgres 50616 70 20 0 310596 214984 192700 S 0.0 0.7 0:04.17 postgres 50643 70 20 0 310392 214940 194008 S 0.0 0.7 0:04.14 postgres 50624 70 20 0 310128 214880 192928 S 0.0 0.7 0:04.15 postgres 50631 70 20 0 310220 214596 192576 S 0.0 0.7 0:02.71 postgres 50613 70 20 0 309364 213880 192520 S 0.0 0.7 0:04.06 postgres 50628 70 20 0 309852 213236 191504 S 0.0 0.7 0:03.04 postgres 50634 70 20 0 187772 163388 149428 S 0.0 0.5 0:02.87 postgres 50644 70 20 0 189684 162892 148508 S 0.0 0.5 0:04.11 postgres 50633 70 20 0 186096 162544 149324 S 0.0 0.5 0:03.20 postgres 50629 70 20 0 185644 162112 149296 S 0.0 0.5 0:04.62 postgres 50618 70 20 0 186264 160576 147928 S 0.0 0.5 0:04.10 postgres 50582 70 20 0 185708 160236 147592 S 0.0 0.5 0:04.10 postgres 3108 70 20 0 172072 144092 142256 S 0.0 0.4 0:04.46 postgres 3109 70 20 0 172024 142404 140632 S 0.0 0.4 0:02.24 postgres 2408 70 20 0 171856 23660 22020 S 0.0 0.1 0:00.76 postgres 3113 70 20 0 173536 9472 7436 S 0.0 0.0 0:00.15 postgres 3112 70 20 0 171936 8732 7020 S 0.0 0.0 0:01.54 postgres 3114 70 20 0 173472 5624 3684 S 0.0 0.0 0:00.00 postgres
I’ve got quite a bit of experience with postgres; I don’t see any indication it’s the problem.
I have the same issue. I have a very small instance and lemmy.world seems to not work no matter what I do. I can get lemmy ml no issues and done if the others but for done reason world just won’t work with and communities, like everything is blacklisted or something.
Acronyms, initialisms, abbreviations, contractions, and other phrases which expand to something larger, that I’ve seen in this thread:
Fewer Letters More Letters DNS Domain Name Service/System HTTP Hypertext Transfer Protocol, the Web IP Internet Protocol VPS Virtual Private Server (opposed to shared hosting) nginx Popular HTTP server
[Thread #624 for this sub, first seen 23rd Mar 2024, 17:55] [FAQ] [Full list] [Contact] [Source code]
How can you check?
If you’re running it in docker you can just check the logs, I do it like this:
docker compose logs -f lemmy
, and see if you have requests from any instance in the log stream. For me it goes pretty fast, but you can always ctrl+c to exit and scroll up to see what you’ve missed. Might not be the most optimal way, but it works for me.