EOS File Scheduling

Can one of the EOS team please do a quick rundown of how the EOS scheduler works or point to some documentation or slides with that information?

I’m looking to have three particular questions answered, but a general overview would be good to have. I saw the presentation at the workshop that did a brief overview, but that seemed to cover file placement more than anything.

My questions in this case are: How would EOS handle repeated (say 1,000 - 10,000) accesses of a single file (with one ore more replicas) via xrdcp? Would aquamarine and citrine behave differently in this case? Are there configuration settings that affect how the scheduler works in a case like this?


Hi Dan,
sorry for the late reply,

The new Citrine Scheduler includes also a penalties subsystem. It applies penalties to the FSs for every access/placement so to avoid saturating FS in case of burts of accesses or file placements like the one you described.

you can configure the penalties system with the parameters penaltyUpdateRate, and the values for the access penalties are configured via accessDlScorePenalty and accessUlScorePenalty

there is also a parameter skipSaturatedAccess: if 0, select the optimal fs for access regardless of the fact it is IO-saturated. if 1, try to find an IO-unsaturated fs first and then fallback onto saturated fs.

the doc is at




One question, again about file scheduling :

Could there be some reason that some scheduling groups are not selected to add new files ? I just realized that half our groups are not filled any more, resulting in a quite high difference of fill rate between our groups, whereas previously (with aquamarine), the balance was very good.

Hi Frank,
the selection of the scheduling groups for file placement is round-robin. Once a group is selected the Geotreeengine tries to select the best FSs for the placement, it it fails, the next group is selected.
Are all your scheduling groups enabled right? if you run this command : geosched show snapshot plct, we can investigate what the issue could be.

sorry somehow one parameter was not printed
the command is geosched show snapshot sheduling_group_not_picked+up plct

Dear Andrea,

Thank you for your answer.
Yes all the 24 groups are enabled, for what I know (status is on in group ls output).

I still need to learn about how to deal this new geotreeengine, so probably this case will help.

This command returns empty output. Could it help to send the full output of the snapshot command ? IT is quite large, and as a first look, I don’t see any obvious difference between group that are filled up and the ones that are not.

As an additional observation, I can say that we see many of these kind of messages in the log :

source=GeoTreeEngine:1374             tident=<service> sec=      uid=0 gid=0 name= geo="" Last fs update for fs 332 is older than older penalty : it could happen as a transition but should not happen permanently.

Maybe it could be connected ?

eos version used for mgm is 4.2.12

Here are the first lines of command geosched show snapshot default.X plct for two groups.

This one is being filled up :

### scheduling snapshot for scheduling group default.0 and operation 'Placement' :
--------default.0/( free:17|repl:0|pidx:0|status:OK|ulSc:9|dlSc:16|filR:0|totS:1.1991e+13)
       `----------JRC/( free:17|repl:0|pidx:0|status:OK|ulSc:9|dlSc:16|filR:0|totS:1.1991e+13)
                 `----------DC1/( free:17|repl:0|pidx:16|status:OK|ulSc:9|dlSc:16|filR:0|totS:1.1991e+13)

This one not :

### scheduling snapshot for scheduling group default.1 and operation 'Placement' :
--------default.1/( free:17|repl:0|pidx:0|status:OK|ulSc:79|dlSc:96|filR:0|totS:2.74565e+13)
       `----------JRC/( free:17|repl:0|pidx:0|status:OK|ulSc:79|dlSc:96|filR:0|totS:2.74565e+13)
                 `----------DC1/( free:17|repl:0|pidx:6|status:OK|ulSc:79|dlSc:96|filR:0|totS:2.74565e+13)

I also noted that for all other operations execept plctdrain, the free value is 0 for all groups…

Another observation : in geosched show snapshot, all disks have status:DinRW, but in geosched show tree, I see [1,1,DinRO] (or [1,1,UnvDinRO] for some of them, all on the same FSTs, it seems)

Hi Frank
from the output of the scheduling group default.1 i also see no problem ( there are 17 free slots) so it should not be a problem placing files there.
I don’t understand what you mean by β€œI also noted that for all other operations execept plctdrain, the free value is 0 for all groups” cause you pasted 2 groups with free = 17 for the plct operation, so i don’t see 0 there

regarding the status::DinRW, this mean that the FS is in RW and it can accept files to Drain, while the difference between trees and snapshots is not an issue. The trees are not updated frequently, only if the fs are removed/added or moved to a different group, the GeoTreeEngine uses the snapshot tree to take decisions.

Would it be possible for you to enable debug logs on the MGM just for a while to see if you see this kind of log:

β€œcould not place all replica(s) for file in subgroup group” and send us a portion around it?


Yes, I didn’t paste these to not overwhelm to much, but this is what I mean :

# eos geosched show snapshot default.0 accsrw                                                                                                                   
### scheduling snapshot for scheduling group default.0 and operation 'Access RW' :
--------default.0/( free:0|repl:0|pidx:0|status:OK|ulSc:10|dlSc:25|filR:0|totS:1.18829e+13)
       `----------JRC/( free:0|repl:0|pidx:0|status:OK|ulSc:10|dlSc:25|filR:0|totS:1.18829e+13)
                 `----------DC1/( free:0|repl:0|pidx:16|status:OK|ulSc:10|dlSc:25|filR:0|totS:1.18829e+13)

see free:0 It is the same for all subgroups and all operations, except plct and plctdrain. It might be normal.

OK, I did that, but I can’t see any of these messages. However I can confirm that some subgroups are never considered in the logs for placement : during 30 seconds of debug level active, I can count 480 placements (placing replicas for /path/ in subgroup default.x occurrences in the log) but none of them with the group that are not filled up.

Hi Frank,
the free slots are reported only for placement operations and at the moment only plct and plctdrain are using the GeoTreeEngine, so this is normal ( previously you reported that only plctdrain had free slots and it sounded strange to me)

The fact that you don’t see any error from the GeoTreeEngine, means that some scheduling group are not even selected when trying the placement…i will check that part of the code and i’ll get back to you

Hi again
which file layout are you using?
what are the groups not selected? do you see a pattern?

No, I didn’t observe any obvious pattern. You’ll see below the group ls output. All groups with ~77% avg(filled) are not selected. The ones selected are around 88%.

β”‚type      β”‚            nameβ”‚      statusβ”‚ N(fs)β”‚ dev(filled)β”‚ avg(filled)β”‚ sig(filled)β”‚ balancingβ”‚   bal-shdβ”‚ drain-shdβ”‚
 groupview         default.0           on     17         6.64        88.78         4.07       idle          0          0 
 groupview         default.1           on     17        32.98        77.64        14.62       idle          0          0 
 groupview        default.10           on     17        31.85        77.45        14.35       idle          0          0 
 groupview        default.11           on     17         6.25        88.67         3.95       idle          0          0 
 groupview        default.12           on     17         4.60        87.45         3.39       idle          0          0 
 groupview        default.13           on     17        31.53        77.34        14.27       idle          0          0 
 groupview        default.14           on     17         6.05        88.30         3.63       idle          0          0 
 groupview        default.15           on     17        32.29        77.98        14.05       idle          0          0 
 groupview        default.16           on     17         5.72        88.02         3.34       idle          0          0 
 groupview        default.17           on     17         5.66        88.12         4.09       idle          0          0 
 groupview        default.18           on     17         5.81        88.13         3.55       idle          0          0 
 groupview        default.19           on     17        30.65        76.90        14.37       idle          0          0 
 groupview         default.2           on     17         4.58        87.14         2.65       idle          0          0 
 groupview        default.20           on     17         5.31        87.66         3.44       idle          0          0 
 groupview        default.21           on     17        31.50        76.78        14.21       idle          0          0 
 groupview        default.22           on     17         5.93        88.15         3.52       idle          0          0 
 groupview        default.23           on     17        31.15        76.58        14.26       idle          0          0 
 groupview         default.3           on     17        30.73        77.49        14.21       idle          0          0 
 groupview         default.4           on     17        30.36        77.10        14.10       idle          0          0 
 groupview         default.5           on     17         7.36        89.44         4.59       idle          0          0 
 groupview         default.6           on     17         9.36        89.77         3.93       idle          0          0 
 groupview         default.7           on     17        30.74        76.53        14.14       idle          0          0 
 groupview         default.8           on     17        31.25        77.28        14.11       idle          0          0 
 groupview         default.9           on     17        30.15        77.71        13.31       idle          0          0 

That makes
selected : 0, 2, 5, 6, 11, 12, 14, 16, 17, 18, 20, 22
not selected : 1, 3, 4, 7, 8, 9, 10, 13, 15, 19, 21, 23

Layout used is replica 2.

could you also send us the portion of debug logs from the MGM that you have?

Sure, I sent you a PM.

I just though of something. Does filR:0 means that the scheduler thinks that group is not filled ? All the groups refer this 0 value…

If there is a penalty for more filled groups, it means that no penalty is applied, and it always selects the same groups ?

the scheduling of the group is not taking into account the info from the GeoTreeEngine, so that value is not filled at group level but only at fs level.
the issue is happening before GeoTreeEngine comes to play, so any information coming from it is not relevant.

It appears that when the group status are modified, the list of selected groups when creating files are modified, but then it stays the same as long as groups are not touched. But we can’t disable/enable group continuously, that would impact production.

So currently, we still have the behaviour that when creating a file, only half the available groups (12 out of 24) are selected. The problem is worked around by activating group balancing, so this avoid half of the groups to fill up too fast.

Hi Frank,
i’m going to add more debugging logs on the next release so we can hopefully understand what’s happening on your instance

1 Like

Hi Andrea,

Maybe I found a way to reproduce our issue, because I could observe the same behaviour on our test instance. And the result changes whether we have a odd or even number of groups.
The instance initially had 3 groups, without this problem. Then I added groups up to 10, and I started to observe it. I added another one, and it started to behave correctly again.

The test I do is this one, on a fuse mount :

# create 400 files
/eos/instance/path $ for i in {000..399}; do echo $i > file-$i; done
# check where they were placed using file info, and count them (total is twice number of files because of replicas)
/eos/instance/path $ for i in file-*; do eos file info /eos/instance/path/$i; done | grep default | awk '{print $4}' |sort | uniq -c

Result is this when 11 groups are on (correctly spread) :

     72 default.0
     72 default.1
     72 default.10
     74 default.2
     74 default.3
     72 default.4
     72 default.5
     72 default.6
     74 default.7
     72 default.8
     74 default.9

And when I disable or remove a group (here I set eos group set default.0 off), I get this (you first need to remove the previously create files, or work in another folder) :

    160 default.1
    160 default.10
    160 default.4
    160 default.6
    160 default.8

incorrectly spread, only 5 groups selected.

But if I disable more groups, even if I have an odd number of groups activated, it still selects only half the available groups.

Can you also observe this on your side ?

Server version is 4.2.20.

Hi Frank,
thanks a lot for your testing!
actually now that i remember i have already opened a ticket quite some time ago cause the Scheduler is broken when there are only 2 scheduling groups (https://its.cern.ch/jira/projects/EOS/issues/EOS-1882) but as you showed this is more general and it applies to a even number of groups!
It must be that at CERN in all our installations we have a odd number of groups. ( for sure also in my testing installation, as i have 3…). Now that you find a way to reproduce it i’m sure we can fix it soon
thanks a lot