作者 | 时间 | QQ技术交流群 |
---|---|---|
[email protected] | 2020/12/01 | 672152841 |
-
glusterfs shd进程的主要职责是修复副本卷和EC卷的数据,shd存在于每个gluster集群的节点中(如果开启heal),不是基于卷的。比如一个节点既部署了副本的brick,同时也部署了ec的brick,基本就一个shd进程
-
glusterfs shd进程会加载io-stats这个xlator,这个xlator的孩子可以是副本卷的xlator,也可以是ec卷的xlator.如果展示了glustershd的加载的xlator
// EC卷的第一个glusterfsd volume ec-vol-client-0 type protocol/client option ping-timeout 42 option remote-host 172.16.84.37 option remote-subvolume /data/ec/brick end-volume // EC卷的第二个glusterfsd volume ec-vol-client-1 type protocol/client option ping-timeout 42 option remote-host 172.16.84.41 option remote-subvolume /data/ec/brick end-volume // EC卷的sglusterfsd的集合 volume ec-vol-disperse-0 type cluster/disperse option redundancy 1 option self-heal-daemon enable option iam-self-heal-daemon yes subvolumes ec-vol-client-0 ec-vol-client-1 ec-vol-client-2 end-volume // io-stats的xlator volume ec-vol type debug/io-stats option log-level INFO option threads 16 subvolumes ec-vol-disperse-0 end-volume
Volume Name: ec-vol Type: Disperse Volume ID: 8ab95850-a90f-4e1d-b7f4-5932c52f7dce Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 172.16.84.37:/data/ec/brick Brick2: 172.16.84.41:/data/ec/brick Brick3: 172.16.84.42:/data/ec/brick Options Reconfigured: cluster.disperse-self-heal-daemon: enable storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on
-
shd进程有两种self-heal crawls,一种是index heal,另外一种是full heal.每个文件在crawling时候,会执行metadata、data、entry的修复。metadata修复文件的属性、权限、mode。data的修复会修复文件内容;entry修复会修复entry里面的目录
-
glusterfs中一般会有三种情况会触发index heal
1.每隔60s会触发一次,可以通过gluster volume xxx set cluster.heal-timeout {value}来设置 2.当客户端执行gluster volume heal {volume}会触发 3.当副本卷所在节点或者glusterfsd宕机后,又再次恢复过来会触发
在一个时间段内只会有一个shd进程来做修复。当此次shd修复完成后才可以触发shd进程进行下一次的修复,没完成之前会把请求进入队列。如果在cli执行了gluster volume heal {volume}后,此次修复没有完成,再次执行gluster volume heal {volume} full,只有等上一次完成之后才能进行full的修复
-
在index heal逻辑里面,gluster shd进程读取每个brick下的.glusterfs/indices/xattrop这个目录下的目录,加锁触发修复。.glusterfs/indices/xattrop目录中包含了以"xattrop-"开头的目录。这个目录下的其他木库是以gfid开头的,这些是需要修复的数据
//在节点2上干掉一个glusterfsd进程 [root@node2 ~]# ps -ef|grep glusterfsd root 1401177 1 1 10:54 ? 00:03:15 /usr/local/sbin/glusterfsd -s 172.16.84.41 --volfile-id replica-vol.172.16.84.41.data-replica-brick -p /var/run/gluster/vols/replica-vol/172.16.84.41-data-replica-brick.pid [root@node2 ~]# kill -9 1401177 // 节点1 上挂载然后,拷贝文件到这个卷中 [root@node1 glusterfs]# mount -t glusterfs 172.16.84.37:replica-vol /mnt/rep/ [root@node1 glusterfs]# cp cmd_history.log /mnt/rep/ // 然后发现这个卷下有2个文件 xattrop-22a5abc0-b16c-497e-b882-18c2c3486c5e和300f383b-db21-412b-9a10-6d70fd80fa7d [root@node1 xattrop]# pwd /data/replica/brick/.glusterfs/indices/xattrop [root@node1 xattrop]# ls -l total 0 ---------- 3 root root 0 Mar 11 10:57 00000000-0000-0000-0000-000000000001 ---------- 3 root root 0 Mar 11 10:57 300f383b-db21-412b-9a10-6d70fd80fa7d ---------- 3 root root 0 Mar 11 10:57 xattrop-22a5abc0-b16c-497e-b882-18c2c3486c5e
-
当客户端对这个cmd_history.log 文件进行操作请求发送到glusterfsd处理时候,在features/index这个xlator处理时候会添加硬链接在fop的pre-op事务处理阶段,如果修复处理成功也会删除硬链接在post-op处理阶段。因此当在.glusterfs/indices/xattrop目录中的没有IO操作的时候,对.glusterfs/indices/xattrop下面的文件进行修复,修复的逻辑如下
//修复过程 In shd process of each node { opendir +readdir (.glusterfs/indices/xattrop) for each entry inside it { self_heal_entry() //Explained below. } } self_heal_entry() { Call syncop_lookup(replicae subvolume) which eventually does { take appropriate locks determine source and sinks from AFR changelog xattrs perform whatever heal is needed (any of metadata, data and entry heal in that order) clear changelog xattrs and hardlink inside .glusterfs/indices/xattrop } }
// features/index xlator的信息 volume replica-vol-index type features/index option index-base /data/replica/brick/.glusterfs/indices option xattrop-dirty-watchlist trusted.afr.dirty option xattrop-pending-watchlist trusted.afr.replica-vol- subvolumes replica-vol-barrier end-volume
-
当gfid 的硬链接在副本卷的中2个副本的brick存在 .glusterfs/indices/xattrop,它会基于domain lock一个一个修复。当某个文件或者目录的父目录没有没有被修复,会跳过继续修复可以修复的文件或者目录。
-
如果一个文件出现data或者metadata出现脑裂,这些文件是不需要被修复的。
-
当在cli执行gluster volume heal {volume} full,会触发Full heal。这个命令比较粗暴,直接从健康的brick拷贝数据到被修复的brick.
-
这个和index heal修复模式不同,gluster集群中每个节点都有一个uuid.选择uuid最大的.也是最新最健康的的brick源端的数据源,选择后数据搬运到待修复的brick上.基本的修复逻辑如下
In shd process of highest UUID node per replica {
opendir +readdir ("/")
for each entry inside it {
self_heal_entry()
if (entry == directory) {
/ Recurse/
again opendir+readdir (directory) followed by self_heal_entry() of each entry.
}
}
-
replica 卷
[root@node1 ~]# gluster volume info Volume Name: replica-vol Type: Replicate Volume ID: dc294dff-7adb-46dd-a081-92b3a54d476e Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 172.16.84.37:/data/replica/brick Brick2: 172.16.84.41:/data/replica/brick Brick3: 172.16.84.42:/data/replica/brick Options Reconfigured: cluster.self-heal-daemon: enable storage.reserve: 99 cluster.granular-entry-heal: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: off // 启动每个volume的shd进程 [root@node1 ~]# gluster volume status replica-vol Status of volume: replica-vol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 172.16.84.37:/data/replica/brick 49152 0 Y 1511816 Brick 172.16.84.41:/data/replica/brick 49152 0 Y 1401177 Brick 172.16.84.42:/data/replica/brick 49152 0 Y 2731653 Self-heal Daemon on localhost N/A N/A Y 1511833 Self-heal Daemon on 172.16.84.41 N/A N/A Y 1401194 Self-heal Daemon on 172.16.84.42 N/A N/A Y 2731670 Task Status of Volume replica-vol ------------------------------------------------------------------------------ There are no active volume tasks
-
ec 卷
[root@node1 ~]# gluster volume info Volume Name: ec-vol Type: Disperse Volume ID: 8ab95850-a90f-4e1d-b7f4-5932c52f7dce Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 172.16.84.37:/data/ec/brick Brick2: 172.16.84.41:/data/ec/brick Brick3: 172.16.84.42:/data/ec/brick Options Reconfigured: cluster.disperse-self-heal-daemon: enable storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on [root@node1 ~]# gluster volume status ec-vol Status of volume: ec-vol Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 172.16.84.37:/data/ec/brick 49153 0 Y 1569574 Brick 172.16.84.41:/data/ec/brick 49153 0 Y 1456710 Brick 172.16.84.42:/data/ec/brick 49153 0 Y 2833998 Self-heal Daemon on localhost N/A N/A Y 1511833 Self-heal Daemon on 172.16.84.41 N/A N/A Y 1401194 Self-heal Daemon on 172.16.84.42 N/A N/A Y 2731670 Task Status of Volume ec-vol ------------------------------------------------------------------------------ There are no active volume tasks
-
节点进程运行信息
// 每个节点运行1个shd进程 [root@node1 ~]# ps -ef|grep -v grep |grep glustershd root 1578547 1 0 11:46 ? 00:00:00 /usr/local/sbin/glusterfs -s localhost --volfile-id shd/ec-vol -p /var/run/gluster/shd/ec-vol/ec-vol-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/6cf91972718c43a8.socket --xlator-option *replicate*.node-uuid=a37ce536-32de-4829-9d7d-3f3a5ca83ef7 --process-name glustershd --client-pid=-6