We’ve been dealing this one for a while now, and we have no immediate answers, meaning we’re punishing ourselves needlessly.
Most of our workload is via eosd, on Citrine 4.2.29. We use machine based sss auth, which is why we’re using eosd.
We have a cluster of Minio servers that are talking various protocols back to the MGM, being HTTP against /proc/user for fileinfo and some other queries, xrdcp for pushing large files into the system, and webdav PUTs for pushing small files in.
We occasionally get request spikes from S3 clients, and it mostly appears related to fileinfo queries against /proc/user, and the eosd clients then fail to operate in a timely manner. The S3 keeps hammering away, and the mix of xrootd/http continues working fine, but every eosd client stalls or becomes incredibly slow. IO backs up, eventually we have a service failure.
As our ownCloud service works via eosd to the MGM, eventually the 30 second timeout for response is hit, and we end up with a thundering herd of reconnecting clients and the same requests piling up.
If we kill off all our S3 clients, things return to normal. Interaction with the eos cli is acceptable.
There are no obvious reports in the EOS MGM logs, even at the debug level, and the MQ is not reporting anything out of the ordinary. eosd logs aren’t helpful at all either.
What it looks like, is the xrootd/http queries are somehow stealing all queue slots, and the eosd clients are starved of ability to interact.
Throughout the service degradation, we can perform IO to the system, it’s just “slow”.
The difficulty is that we have been unable to isolate exactly which set of commands are either locking, or consuming all the queue. We do know S3 clients can outperform ownCloud or ownCloud-WebDAV clients by an order of magnitude. We’ve tried throttling the number of concurrent connections a S3 client can perform, but we’ve had limited success in actually stopping a service DoS.
A correlation we’ve identified is the larger our metadata has gotten due to it not being compacted, the lower the workload threshold to cause failure has become.
So, hat in hand, I come to you with an incomplete-bug-report-we-don’t-exactly-know-how-to-reproduce issue
This leads me to a few questions:
- We see a lot of fixes in the later versions of Citrine to do with lock contention, is there a version of Citrine we should be using later than 4.2.29?
- We are hypothesising there’s a relationship to the size of the un-compacted MGM metadata, and the performance. Is this possible?
- mad1sum posted earlier about queue depths, and given the behaviour of 0mq, is it possible we’re losing queries off the queue?
- Assuming the issue is actually related to fileinfo queries on /proc/user, would pointing our S3 instances at a slave MGM be a step we could use to isolate our blast radius?