I thought to open a discussion about the xsmap files. I understood that they have been deprecated for replica files, and indeed we had a period working with aquamarine in which they completely disappeared. However, they started to come back at some point, and after our citrine upgrade, they are still created. They don’t really disturb, except that they are not updated when the files are modified, and in these situation balancing or draining operation fail because they anyway try to check the new content of the file against the xsmap file corresponding to the old content.
Does anyone else observer that they come back when balancing files ? What would be the reason ? Could it be linked to a misconfiguration ? Is there a way to get rid of them, for replica files only, not for rain files ?
What is the exact version of eos that you are currently running?
Can you tell me if for a newly written 2-replica file you still get the block-xs files?
Now, can you pick an existing file for which you know it has a blockxs file and paste the output of the following commands?
eos file info <file_path> --fullpath
# Go to the first fst and list
ls -lrt <first_fst_physical_path>*
eos file convert --rewrite <file_path>
eos file info <file_path> --fullpath
# Go to the new first fst and list
ls -lrt <fist_fst_physical_path>*
After conversion (strange… a first conversion failed with a message [tpc]: [FATAL] Socket error: Connection reset by peer 0 in the xrdlog.mgm) xsmap are not there any more:
Oh, this seems to have cause the crash of the FST that had the xsmap file :
pure virtual method called
terminate called without an active exception
error: received signal 6:
/lib64/libXrdEosFst.so(_ZN3eos3fst9XrdFstOfs20xrdfstofs_stacktraceEi+0x49)[0x7fec411e55a9]
/lib64/libc.so.6(+0x35270)[0x7fec45258270]
/lib64/libc.so.6(gsignal+0x37)[0x7fec452581f7]
/lib64/libc.so.6(abort+0x148)[0x7fec452598e8]
/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x165)[0x7fec45b5eac5]
/lib64/libstdc++.so.6(+0x5ea36)[0x7fec45b5ca36]
/lib64/libstdc++.so.6(+0x5ea63)[0x7fec45b5ca63]
/lib64/libstdc++.so.6(+0x5f5cf)[0x7fec45b5d5cf]
/lib64/libXrdServer.so.2(_ZN17XrdXrootdProtocol7fsErrorEicR13XrdOucErrInfoPKcPc+0x2e8)[0x7fec46714c08]
/lib64/libXrdUtils.so.2(_ZN7XrdLink4DoItEv+0x19)[0x7fec46496149]
I saw this “pure virtual function call” mention in the changelog of version 4.2.18, so probably the same thing ?
The FST server went on for 30 seconds, produced a stacktrace, then stopped and restarted by itself, but all its FS went fully booting.
The crash related to “pure virtual method called” is a side effect of a TPC transfer (i.e. the conversion) crashing. This has been recently fixed and can affect any type of TPC transfer:
It’s available in 4.2.18. Sorry for the crash! The booting of the FSes is normal since maybe the leveldb database was not properly shutdown.
I will have another look at the balancing in 4.2.12 but could you tell me if the replica which was balanced, in this particular case the “undeleted” one on FS 135 also has a blockxs file or not. Maybe pick another balancing job and look at the dropped replica …
No need to do anything else, I’ve figured out where the problem comes from. I will soon push a fix for this.
Thanks again for your time in helping me reproduce this.
Thank you for diagnosing this; I wasn’t sure if it was a problem or not, this is why I preferred asking before filing an issue. So could it be that you also have that problem on your instances, or is it linked to some specificity of ours ?
OK about the crash, we will avoid to use tpc until we upgrade. Is it only necessary to upgrade FST for this, or also MGM ?
About the full booting of FST, we observed that the LevelDB is more prone to not be properly shutdown than was sqlite, could it be ?
Yes, the problem is generic to any EOS instance. I’ve committed a fix for this and this behaviour should go away in 4.2.19. The corresponding commit: https://gitlab.cern.ch/dss/eos/commit/d562fd5c40395ba43762ebfdaf8c64edbaf3f051
The fix for the TPC requires an update of the FSTs.
Whenever the FST crashes with a segmentation fault and the proper shut-down procedure is not followed then this will result in a resync at start-up. It’s just that now we’re more careful than before. It doesn’t mean that the leveldb is actually corrupted but it’s just that we want to be on the safe side.
Good, thank you ! This fix requires only an update of the MGM, correct ?
But with this, we will still have remaining xsmap files previously created ? Would they be ignored even if present (that would be ideal) ? Otherwise, any way to remove them ? Since we have to keep the ones for rain6 files, a mere find of the filsystems doesn’t seem on option. Maybe could they be removed on demand during some verify or resync command ? Unless this has no sense…
OK for the TPC part, and for the full boot. Indeed, the full resync seems now almost automatic if the FST wasn’t shutdown correctly. Previously, we could avoid full resync most of the time, even for our cases of sudden reboot of the system. But I get you, better be on the safe side, this sounds right.
OK, thank you. So after upgrade we would not need to bother about the xsmap files, and errors while draining or balancing, nor scandisk reports of block checksum corruption will be seen, right ?
I realized that on a separate test instance, running mixed 4.2.12 and 4.2.17, balancing of files don’t create the xsmap files. How could that be ? Is there some specific condition triggering them ? Some historical parameters of the files ?
What makes a difference is that setting the default “replica” layout in recent versions doesn’t set anymore the sys.blockchecksum attribute. That’s why when you test it in a new instance or a newly created directory where you do "attr set default=replica " you won’t see the blockxs files begin created even when drained or balanced.
In your (initial) case, probably those directories where created (long time ago in the beryl_aquamarine version) when the command also enforced the sys.blockchecksum extended attribute.
Ah OK, so we could mitigate the problem immediately by changing this attribute on the most used folders to avoid creation of blockchecksum from new files ?