Can one of the EOS team please do a quick rundown of how the EOS scheduler works or point to some documentation or slides with that information?
Iβm looking to have three particular questions answered, but a general overview would be good to have. I saw the presentation at the workshop that did a brief overview, but that seemed to cover file placement more than anything.
My questions in this case are: How would EOS handle repeated (say 1,000 - 10,000) accesses of a single file (with one ore more replicas) via xrdcp? Would aquamarine and citrine behave differently in this case? Are there configuration settings that affect how the scheduler works in a case like this?
The new Citrine Scheduler includes also a penalties subsystem. It applies penalties to the FSs for every access/placement so to avoid saturating FS in case of burts of accesses or file placements like the one you described.
you can configure the penalties system with the parameters penaltyUpdateRate, and the values for the access penalties are configured via accessDlScorePenalty and accessUlScorePenalty
there is also a parameter skipSaturatedAccess: if 0, select the optimal fs for access regardless of the fact it is IO-saturated. if 1, try to find an IO-unsaturated fs first and then fallback onto saturated fs.
Could there be some reason that some scheduling groups are not selected to add new files ? I just realized that half our groups are not filled any more, resulting in a quite high difference of fill rate between our groups, whereas previously (with aquamarine), the balance was very good.
Hi Frank,
the selection of the scheduling groups for file placement is round-robin. Once a group is selected the Geotreeengine tries to select the best FSs for the placement, it it fails, the next group is selected.
Are all your scheduling groups enabled right? if you run this command : geosched show snapshot plct, we can investigate what the issue could be.
cheers
Andrea
Thank you for your answer.
Yes all the 24 groups are enabled, for what I know (status is on in group ls output).
I still need to learn about how to deal this new geotreeengine, so probably this case will help.
This command returns empty output. Could it help to send the full output of the snapshot command ? IT is quite large, and as a first look, I donβt see any obvious difference between group that are filled up and the ones that are not.
As an additional observation, I can say that we see many of these kind of messages in the log :
source=GeoTreeEngine:1374 tident=<service> sec= uid=0 gid=0 name= geo="" Last fs update for fs 332 is older than older penalty : it could happen as a transition but should not happen permanently.
Here are the first lines of command geosched show snapshot default.X plct for two groups.
This one is being filled up :
### scheduling snapshot for scheduling group default.0 and operation 'Placement' :
--------default.0/( free:17|repl:0|pidx:0|status:OK|ulSc:9|dlSc:16|filR:0|totS:1.1991e+13)
`----------JRC/( free:17|repl:0|pidx:0|status:OK|ulSc:9|dlSc:16|filR:0|totS:1.1991e+13)
`----------DC1/( free:17|repl:0|pidx:16|status:OK|ulSc:9|dlSc:16|filR:0|totS:1.1991e+13)
This one not :
### scheduling snapshot for scheduling group default.1 and operation 'Placement' :
--------default.1/( free:17|repl:0|pidx:0|status:OK|ulSc:79|dlSc:96|filR:0|totS:2.74565e+13)
`----------JRC/( free:17|repl:0|pidx:0|status:OK|ulSc:79|dlSc:96|filR:0|totS:2.74565e+13)
`----------DC1/( free:17|repl:0|pidx:6|status:OK|ulSc:79|dlSc:96|filR:0|totS:2.74565e+13)
I also noted that for all other operations execept plctdrain, the free value is 0 for all groupsβ¦
Another observation : in geosched show snapshot, all disks have status:DinRW, but in geosched show tree, I see [1,1,DinRO] (or [1,1,UnvDinRO] for some of them, all on the same FSTs, it seems)
Hi Frank
from the output of the scheduling group default.1 i also see no problem ( there are 17 free slots) so it should not be a problem placing files there.
I donβt understand what you mean by βI also noted that for all other operations execept plctdrain, the free value is 0 for all groupsβ cause you pasted 2 groups with free = 17 for the plct operation, so i donβt see 0 there
regarding the status::DinRW, this mean that the FS is in RW and it can accept files to Drain, while the difference between trees and snapshots is not an issue. The trees are not updated frequently, only if the fs are removed/added or moved to a different group, the GeoTreeEngine uses the snapshot tree to take decisions.
Would it be possible for you to enable debug logs on the MGM just for a while to see if you see this kind of log:
βcould not place all replica(s) for file in subgroup groupβ and send us a portion around it?
Yes, I didnβt paste these to not overwhelm to much, but this is what I mean :
# eos geosched show snapshot default.0 accsrw
### scheduling snapshot for scheduling group default.0 and operation 'Access RW' :
--------default.0/( free:0|repl:0|pidx:0|status:OK|ulSc:10|dlSc:25|filR:0|totS:1.18829e+13)
`----------JRC/( free:0|repl:0|pidx:0|status:OK|ulSc:10|dlSc:25|filR:0|totS:1.18829e+13)
`----------DC1/( free:0|repl:0|pidx:16|status:OK|ulSc:10|dlSc:25|filR:0|totS:1.18829e+13)
see free:0 It is the same for all subgroups and all operations, except plct and plctdrain. It might be normal.
OK, I did that, but I canβt see any of these messages. However I can confirm that some subgroups are never considered in the logs for placement : during 30 seconds of debug level active, I can count 480 placements (placing replicas for /path/ in subgroup default.x occurrences in the log) but none of them with the group that are not filled up.
Hi Frank,
the free slots are reported only for placement operations and at the moment only plct and plctdrain are using the GeoTreeEngine, so this is normal ( previously you reported that only plctdrain had free slots and it sounded strange to me)
The fact that you donβt see any error from the GeoTreeEngine, means that some scheduling group are not even selected when trying the placementβ¦i will check that part of the code and iβll get back to you
cheers
Andrea
No, I didnβt observe any obvious pattern. Youβll see below the group ls output. All groups with ~77% avg(filled) are not selected. The ones selected are around 88%.
the scheduling of the group is not taking into account the info from the GeoTreeEngine, so that value is not filled at group level but only at fs level.
the issue is happening before GeoTreeEngine comes to play, so any information coming from it is not relevant.
It appears that when the group status are modified, the list of selected groups when creating files are modified, but then it stays the same as long as groups are not touched. But we canβt disable/enable group continuously, that would impact production.
So currently, we still have the behaviour that when creating a file, only half the available groups (12 out of 24) are selected. The problem is worked around by activating group balancing, so this avoid half of the groups to fill up too fast.
Maybe I found a way to reproduce our issue, because I could observe the same behaviour on our test instance. And the result changes whether we have a odd or even number of groups.
The instance initially had 3 groups, without this problem. Then I added groups up to 10, and I started to observe it. I added another one, and it started to behave correctly again.
The test I do is this one, on a fuse mount :
# create 400 files
/eos/instance/path $ for i in {000..399}; do echo $i > file-$i; done
# check where they were placed using file info, and count them (total is twice number of files because of replicas)
/eos/instance/path $ for i in file-*; do eos file info /eos/instance/path/$i; done | grep default | awk '{print $4}' |sort | uniq -c
Result is this when 11 groups are on (correctly spread) :
And when I disable or remove a group (here I set eos group set default.0 off), I get this (you first need to remove the previously create files, or work in another folder) :
Hi Frank,
thanks a lot for your testing!
actually now that i remember i have already opened a ticket quite some time ago cause the Scheduler is broken when there are only 2 scheduling groups (https://its.cern.ch/jira/projects/EOS/issues/EOS-1882) but as you showed this is more general and it applies to a even number of groups!
It must be that at CERN in all our installations we have a odd number of groups. ( for sure also in my testing installation, as i have 3β¦). Now that you find a way to reproduce it iβm sure we can fix it soon
thanks a lot
Andrea