Eos lost data

miramir · July 4, 2022, 7:00am

Hi,
We have an EOS cluster with a quark db. When mgm restarted, the /eos/ directory tree lost files.
When you run the command, the files are visible in dir:

eos-m01:~ # eos fileinfo fxid:0b19cd0b
  File: '/eos/scg/1_Хранилище (создавать всё внутри)/1_ИСХОДНИКИ/3_СПЕЦПРОЕКТЫ/2021/Марафон по школам Дубны 2021/Марафон/исх/13 12 2021 ОИЯИ в лицее Кадышевского/камера Женя/-/C0015.MP4'  Flags: 0764  Clock: 16fd7c066eeca3a9
  Size: 138418448
Modify: Mon Dec 13 06:50:38 2021 Timestamp: 1639367438.000000000
Change: Wed May 18 15:39:46 2022 Timestamp: 1652877586.331433099
 Birth: Wed May 18 15:39:44 2022 Timestamp: 1652877584.537794598
  CUid: 0 CGid: 860 Fxid: 0b19cd0b Fid: 186240267 Pid: 5498326 Pxid: 0053e5d6
XStype: adler    XS: e9 1d e8 a7    ETAGs: "49993490997706752:e91de8a7"
Layout: replica Stripes: 2 Blocksize: 4k LayoutId: 00100112 Redundancy: d2::t0
  #Rep: 2
┌───┬──────┬────────────────────────┬────────────────┬────────────────┬──────────┬──────────────┬────────────┬────────┬────────────────────────┐
│no.│ fs-id│                    host│      schedgroup│ path│      boot│  configstatus│       drain│  active│ geotag│
└───┴──────┴────────────────────────┴────────────────┴────────────────┴──────────┴──────────────┴────────────┴────────┴────────────────────────┘
 0      197         eos-f049.jinr.ru        default.1           /e/p01 booted             rw      nodrain   online            RU::JINR::LIT
 1      285         eos-f071.jinr.ru        default.1           /e/p01 booted             rw      nodrain   online            RU::JINR::LIT

*******

But directory is empty.

We tried to restore it and found out that we have 2 the same directory old (cid 5452838 with data) and new (cid 5596529 - empty):

eos-b02:/home #  eos-ns-inspect print --members localhost:7777 --cid 5596529
ID: 5596529
Parent ID: 2
Name: scg
uid: 0, gid: 0
ctime: Wed Jun  1 13:28:02 2022 Timestamp: 1654079282.45653590
mtime: Thu Jan  1 03:00:00 1970 Timestamp: 0.0
stime: Thu Jan  1 03:00:00 1970 Timestamp: 0.0
Tree size: 0
Mode: 16877
Flags: 1

eos-b02:/home # eos-ns-inspect print --members localhost:7777 --cid 5452838
ID: 5452838
Parent ID: 2
Name: scg
uid: 0, gid: 860
ctime: Fri May 27 10:43:36 2022 Timestamp: 1653637416.33695994
mtime: Mon May 30 13:07:01 2022 Timestamp: 1653905221.476521648
stime: Thu Jan  1 03:00:00 1970 Timestamp: 0.0
Tree size: 4161269237640
Mode: 16893
Flags: 1
Extended attributes (11):
    sys.eos.btime=1651237100.622317027
    sys.owner.auth=*
    sys.forced.layout=replica
    sys.forced.blocksize=4k
    sys.mask=775
    user.acl=
    sys.forced.nstripes=2
    sys.forced.space=default
    sys.forced.stripes=16
    sys.forced.checksum=adler
    sys.acl=g:860:rwxmqc
Full path: /eos/scg/
------------------------------------------------
FileMap:
------------------------------------------------
ContainerMap:
1_Хранилище (создавать всё внутри): 5476312
2_Файлообмен (забрал и удалил): 5498474

We tried to remove new empty dir with ns-inspect:

eos-ns-inspect drop-empty-cid --members localhost:7777 --cid  5596529

But the old directory did not appeared.
Has anyone encountered such a problem?

miramir · July 5, 2022, 10:08am

I would like to add. The lost directory tree contains 38K files, we urgently need to restore the contents of this tree.
Since the cid of the /eos/scg/ is present in quarkdb, we hope recovery is possible. But we don’t know how to do it.
Any help would be greatly appreciated!

apeters · July 5, 2022, 11:35am

Hi Ivan,

if you run eos-ns-inspect with ‘check-orphans’ it should print directory 5452838 as orphaned.

There is no tool implemented to attach on orphan using eos-ns-inspect.

One can do this manually by using the redis client. I will make a dry run here and then tell you how to do it in a short while …

miramir · July 5, 2022, 12:48pm

Hi Andreas,
I run

eos-ns-inspect check-orphans --members localhost:7777

and recived:

file-id=25452838 invalid-parent-id=0 size=0 locations= unlinked-locations=

Thank you. I’ll be waiting.

apeters · July 5, 2022, 1:05pm

After this command,
did you restart the MGM or did ‘eos ns cache drop -d’ ?

If not, you have to do that!

apeters · July 5, 2022, 1:06pm

I mean, after you dropped the additional /eos/scg/ directory …

miramir · July 5, 2022, 1:37pm

I restarted mgm and the directory /eos/scg appeared, but it is empty:

EOS Console [root://eos.jinr.ru] |/eos/flnp-admin/> fileinfo /eos/scg
  Directory: '/eos/scg'  Treesize: 0
  Container: 0  Files: 0  Flags: 40755
Modify: Thu Jan  1 03:00:00 1970 Timestamp: 0.000000000
Change: Tue Jul  5 16:24:07 2022 Timestamp: 1657027447.639821534
Sync  : Thu Jan  1 03:00:00 1970 Timestamp: 0.000000000
Birth : Thu Jan  1 03:00:00 1970 Timestamp: 0.000000000
  CUid: 0 CGid: 0 Fxid: 005c533d Fid: 6050621 Pid: 2 Pxid: 00000002
  ETAG: 5c533d:0.000

apeters · July 6, 2022, 8:32am

Ok,
sorry for the delay.

What you do now is.
1 Make a backup of QDB

2 eos rmdir /eos/scg/

3 eos-ns-inspect [add connection settings] fix-detached-parent --destination-path /eos/scg --cid 5452838

miramir · July 6, 2022, 11:09am

That’s all right

Make a backup - DONE

I removed /eos/scg by the command drop-empty-cid.
Or it was necessary to do it through the command rmdir /eos/scg/ ?

eos-b02:/bkp # eos-ns-inspect fix-detached-parent --members localhost:7777 --destination-path /eos/scg --cid 5452838
Destination path '/eos/scg' does not exist.

miramir · July 7, 2022, 8:24am

I checked everything and found an error.
Directory id /eos/sgc is not exist in the check_orhans list.
I mixed up 25452838 and 5452838

apeters · July 7, 2022, 9:12am

Can you do the following:

mkdir /eos/tmp/

eos-ns-inspect fix-detached-parent --members localhost:7777 --destination-path /eos/tmp --cid 5452838

And then check what is in /eos/tmp/ …

miramir · July 7, 2022, 11:37am

eos-b02:~ # eos-ns-inspect fix-detached-parent --members localhost:7777 --destination-path /eos/tmp --cid 5452838
Finding all parents of Container #5452838...
scg: #5452838 with parent #2
eos: #2 with parent #1
Unable to continue - given container (5452838) looks fine? No changes have been made.

EOS Console [root://localhost] |/eos/tmp/> ls -la
drwxrwxr-+   1 root     root                0 Jul  7 14:33 .
drwxrwxr-+   1 root     root     2645894117260142 Jul  7 14:33 ..

apeters · July 7, 2022, 12:22pm

Argh … ok, this does not do what we want … I need to give you the REDIS commands then …

miramir · July 7, 2022, 12:38pm

Ok. I’m waiting…

apeters · July 7, 2022, 12:46pm

Run this commands:

eos file info /eos/scg

this should show fid:2

redis-cli -p 7777 -h localhost HSET 2:map_conts “1_Хранилище (создавать всё внутри)” 5476312
redis-cli-p 7777 -h localhost HSET 2:map_conts “2_Файлообмен (забрал и удалил)” 5498474

if you get ‘MOVED … ’ run the same commands on that host

drop MGM cache

eos ns cache drop-single-container 2

list /eos/scg

eos ls -la /eos/scg/

Do you see the two directories?

apeters · July 7, 2022, 12:47pm

No wait … that is screwed …

apeters · July 7, 2022, 12:50pm

No, you have to attach directory 5452838 to /eos (2),
so you just do this redis command:

redis-cli -p 7777 -h localhost HSET 2:map_conts scg 5476312

(if you were running already the two commands before, you rerun them with only the KEY ( not the number in the end)
HDEL 2:map_conts “1_…”
and
HDEL 2:map_conts “2_…”
)

miramir · July 7, 2022, 12:53pm

eos-b02:~ # redis-cli -p 7777 -h localhost HSET 2:map_conts scg 5476312
(integer) 1

EOS Console [root://localhost] |/eos/> ls /eos/scg
Unable to stat /eos/scg; No such file or directory (errc=2) (No such file or directory)

miramir · July 7, 2022, 12:54pm

No, I run only:

redis-cli -p 7777 -h localhost HSET 2:map_conts scg 5476312

apeters · July 7, 2022, 12:56pm

Did you drop the cache for inode 2 before listing?
eos ns cache drop-single-container 2

CERN Accelerating science

Eos lost data

this should show fid:2

if you get ‘MOVED … ’ run the same commands on that host

drop MGM cache

list /eos/scg