Unusual DoS like events when mixing xrdcp, http and eosd workloads

davidjericho · October 10, 2018, 6:19am

We’ve been dealing this one for a while now, and we have no immediate answers, meaning we’re punishing ourselves needlessly.

Most of our workload is via eosd, on Citrine 4.2.29. We use machine based sss auth, which is why we’re using eosd.

We have a cluster of Minio servers that are talking various protocols back to the MGM, being HTTP against /proc/user for fileinfo and some other queries, xrdcp for pushing large files into the system, and webdav PUTs for pushing small files in.

We occasionally get request spikes from S3 clients, and it mostly appears related to fileinfo queries against /proc/user, and the eosd clients then fail to operate in a timely manner. The S3 keeps hammering away, and the mix of xrootd/http continues working fine, but every eosd client stalls or becomes incredibly slow. IO backs up, eventually we have a service failure.

As our ownCloud service works via eosd to the MGM, eventually the 30 second timeout for response is hit, and we end up with a thundering herd of reconnecting clients and the same requests piling up.

If we kill off all our S3 clients, things return to normal. Interaction with the eos cli is acceptable.

There are no obvious reports in the EOS MGM logs, even at the debug level, and the MQ is not reporting anything out of the ordinary. eosd logs aren’t helpful at all either.

What it looks like, is the xrootd/http queries are somehow stealing all queue slots, and the eosd clients are starved of ability to interact.

Throughout the service degradation, we can perform IO to the system, it’s just “slow”.

The difficulty is that we have been unable to isolate exactly which set of commands are either locking, or consuming all the queue. We do know S3 clients can outperform ownCloud or ownCloud-WebDAV clients by an order of magnitude. We’ve tried throttling the number of concurrent connections a S3 client can perform, but we’ve had limited success in actually stopping a service DoS.

A correlation we’ve identified is the larger our metadata has gotten due to it not being compacted, the lower the workload threshold to cause failure has become.

So, hat in hand, I come to you with an incomplete-bug-report-we-don’t-exactly-know-how-to-reproduce issue

This leads me to a few questions:

We see a lot of fixes in the later versions of Citrine to do with lock contention, is there a version of Citrine we should be using later than 4.2.29?
We are hypothesising there’s a relationship to the size of the un-compacted MGM metadata, and the performance. Is this possible?
mad1sum posted earlier about queue depths, and given the behaviour of 0mq, is it possible we’re losing queries off the queue?
Assuming the issue is actually related to fileinfo queries on /proc/user, would pointing our S3 instances at a slave MGM be a step we could use to isolate our blast radius?

apeters · October 15, 2018, 8:29am

Hi David,

your problem comes still from lock contention needed for the S3 activity.

The namespace has ms counters and the answer to the problem can always be found there.

When you are in degraded situation, you do:
eos ns stat --reset

and then look at the counters after few seconds:

eos ns stat

The column exec(ms) is probably in the hundreds or thousands of ms somewhere … if you get me this number, we can probably figure out the bottleneck.

If I remember right, I made you the 4.2.29 tag with some improvements in the listing locking (fileinfo --json calls).

We have now 4.3.12 in production for LHC instances and 4.4.4 in the HOME instances.

For your questions: 1) I will discuss this with Elvin and add here 2) There is some memory fragmentation happening if your MGM runs for a long time … there could be some slowdown over time 3) 0MQ is not involved here 4) You can certainly send the S3 queries to the SLAVE, that you resolve the problems for all the other clients.

Cheers Andreas.

luca.mascetti · October 16, 2018, 1:45am

Hi Daivd,

checking your MGM configuration I noticed that your thread pool is relatively small for the load you might experience:

xrd.sched mint 8 maxt 256 idle 64

In our CERNBox instance we have the following:

xrd.sched mint 64 maxt 4096 idle 300

this might explain why the http traffic (which uses a different thread pool) it continue working while
the rest remains clogged.

Cheers,
Luca

davidjericho · October 17, 2018, 3:39am

Thank you both @apeters and @luca.mascetti. We’ll change the xrd.sched settings in our test environment and see if we can break it the same way.

We’ll also take a closer look at the other elements and see if we can confirm anything.

luca.mascetti · November 7, 2018, 5:02pm

Is the system behaving better with these changes?

davidjericho · November 7, 2018, 10:32pm

@luca.mascetti, yes it does thank you. We’ve not had repeats of the same issues since the changes to the xrd.sched settings.

I think our big question at the moment is why is it that a HTTP call can consistently beat a xrootd call when it comes to getting a place in the queue? It’s not of major consequence, but it’d be good to know the answer to accommodate it in the future.

luca.mascetti · November 8, 2018, 10:51am

they use a different thread pool, probably your standard xrootd thread pool is busier than the http one

apeters · November 8, 2018, 11:06am

I explained that in direct conversation to Michael. The HTTP thread pool is much smaller than the XROOT thread pool. The performance share comes from the lock time of the namespace mutex, and this depends which calls get executed by HTTP or XROOTD. It is not only the number of threads which relevant.

CERN Accelerating science

Unusual DoS like events when mixing xrdcp, http and eosd workloads