-
Notifications
You must be signed in to change notification settings - Fork 8
/
EtherDrive-2.6-HOWTO.txt
1592 lines (1123 loc) · 65.9 KB
/
EtherDrive-2.6-HOWTO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
EtherDrive(R) storage and Linux 2.6
Sam Hopkins and Ed L. Cashin {sah,ecashin}@coraid.com
April 2008
Using network data storage with ATA over Ethernet
<http://www.coraid.com/documents/AoEr10.txt> is easy after understand-
ing a few simple concepts. This document explains how to use AoE tar-
gets from a Linux-based Operating System, but the basic principles are
applicable to other systems that use AoE devices. Below we begin by
explaining the key components of the network communication method, ATA
over Ethernet (AoE). Next, we discuss the way a Linux host uses AoE
devices, providing serveral examples. A list of frequently asked
questions follows, and the document ends with supplementary informa-
tion.
______________________________________________________________________
Table of Contents
1. The EtherDrive System
2. How Linux Uses The EtherDrive System
3. The ATA over Ethernet Tools
3.1 Limiting AoE traffic to certain network interfaces
4. EtherDrive storage and Linux Software RAID
4.1 Example: RAID 5 with mdadm
4.2 Important notes
5. FAQ (contains important info)
5.1 Q: How does the system know about the AoE targets on the
network?
5.2 Q: How do I see what AoE devices the system knows about?
5.3 Q: What is the "closewait" state?
5.4 Q: How does the system know an AoE device has failed?
5.5 Q: How do I take an AoE device out of the failed state?
5.6 Q: How can I use LVM with my EtherDrive storage?
5.7 Q: I get an "invalid module format" error on modprobe.
Why?
5.8 Q: Can I allow multiple Linux hosts to use a filesystem that is
on my EtherDrive storage?
5.9 Q: Can you give me an overview of GFS and related software?
5.9.1 Background
5.9.2 Hardware
5.9.3 Software
5.9.4 Use
5.9.5 Fencing
5.10 Q: How can I make a RAID of more than 27 components?
5.11 Q: Why do my device nodes disappear after a reboot?
5.12 Q: Why does RAID initialization seem slow?
5.13 Q: I can only use shelf zero! Why won't e1.9 work?
5.14 Q: How can I start my AoE storage on boot and shut it down when
the system shuts down?
5.15 Q: Why do I get "permission denied" when I'm root?
5.16 Q: Why does fdisk ask me for the number of cylinders?
5.17 Q: Can I use AoE equipment with Oracle software?
5.18 Q: Why do I have intermittent problems?
5.19 Q: How can I avoid running out of memory when copying large
files?
5.20 Q: Why doesn't the aoe driver notice that an AoE device has
disappeared or changed size?
5.21 Q: My NFS client hangs when I export a filesystem on an AoE
device.
5.22 Q: Why do I see "unknown partition table" errors in my
logs?
5.23 Q: Why do I get better throughput to a file on an AoE device
than to the device itself?
5.24 Q: How can I boot diskless systems from my Coraid EtherDrive
devices?
5.25 Q: What filesystems do you recommend for very large block
devices?
5.26 Q: Why does umount say, "device is busy"?
5.27 Q: How do I use the multiple network path support in driver
versions 33 and up?
5.28 Q: Why does "xfs_check" say "out of memory"?
5.29 Q: Can virtual machines running on VMware ESX use AoE over
jumbo frames?
5.30 Q: Can I use SMART with my AoE devices?
6. Jumbo Frames
6.1 Linux NIC MTU
6.2 Network Switch MTU
6.3 SR MTU
7. Appendix A: Archives
7.1 Example: RAID 5 with the raidtools
7.2 Example: RAID 10 with mdadm
7.3 Important notes
7.4 Old FAQ List
7.4.1 Q: When I "modprobe aoe", it takes a long time. The
system seems to hang. What could be the problem?
______________________________________________________________________
11.. TThhee EEtthheerrDDrriivvee SSyysstteemm
The ATA over Ethernet network protocol allows any type of data storage
to be used over a local ethernet network. An "AoE target" receives ATA
read and write commands, executes them, and returns responses to the
"AoE initiator" that is using the storage.
These AoE commands and responses appear on the network as ethernet
frames with type 0x88a2, the IANA registered Ethernet type for ATA
over Ethernet (AoE) <http://www.coraid.com/documents/AoEr10.txt>. An
AoE target is identified by a pair of numbers: the shelf address, and
the slot address.
For example, the Coraid SR appliance can perform RAID internally on
its SATA disks, making the resulting storage capacity available on the
ethernet network as one or more AoE targets. All of the targets will
have the same shelf address because they are all exported by the same
SR. They will have different AoE slot addresses, so that each AoE
target is individually addressable. The SR documentation calls each
target a "LUN". Each LUN behaves like a network disk.
Using EtherDrive technology like the SR appliance is as simple as
sending and receiving AoE packets.
To a Linux-based system running the "aoe" driver, it doesn't matter
what the remote AoE device really is. All that matters is that the AoE
protocol can be used to communicate with a device identified by a
certain shelf and slot address.
22.. HHooww LLiinnuuxx UUsseess TThhee EEtthheerrDDrriivvee SSyysstteemm
For security and performance reasons, many people use a second,
dedicated network interface card (NIC) for ATA over Ethernet traffic.
A NIC must be up before it can perform any networking, including AoE.
On examining the output of the ifconfig command, you should see your
AoE NIC listed as "UP" before attempting to use an AoE device
reachable via that NIC.
You can aaccttiivvaattee tthhee NNIICC with a simple ifconfig eth1 up, using the
appropriate device name instead of "eth1". Note that assigning an IP
address is not necessary if the NIC is being used only for AoE
traffic, but having an IP address on a NIC used for AoE will not
interfere with AoE.
On a Linux system, block devices are used via special files called
device nodes. A familiar example is /dev/hda. When a block device node
is opened and used, the kernel translates operations on the file into
operations on the corresponding hardware EtherDrive.
Each accessible AoE target on your network is represented by a disk
device node in the /dev/etherd/ directory and can be used just like
any other direct attached disk. The "aoe" device driver is an open-
source loadable kernel module authored by Coraid. It translates system
reads/writes on a device into AoE request frames for the associated
remote EtherDrive storage device, retransmitting requests if needed.
When the AoE responses from the device are received, the appropriate
system read/write call is acknowledged as complete. The aoe device
driver handles retransmissions in the event of network congestion.
The association of AoE targets on your network to device nodes in
/dev/etherd/ follows a simple naming scheme. Each device node is named
eX.Y, where X represents a shelf address and Y represents a slot
address. Both X and Y are decimal integers. As an example, the
following command displays the first 4 KiB of data from the AoE target
with shelf address 0 and slot address 1.
dd if=/dev/etherd/e0.1 bs=1024 count=4 | hexdump -C
Creating an ext3 filesystem on the same AoE target is as simple as ...
mkfs.ext3 /dev/etherd/e0.1
Notice that the filesystem goes directly on the block device. There's
no need for any intermediate "format" or partitioning step.
Although partitions are not usually needed, they may be created using
a tool like fdisk or GNU parted. Please see the ``FAQ entry about
partition tables'' for important caveats.
Partitions are used by adding "p" and the partition number to the
device name. For example, /dev/etherd/e0.3p1 is the first partition on
the AoE target with shelf address zero and slot address three.
After creating a filesystem, it can be mounted in the normal way. It
is important to remember to unmount the filesystem before shutting
down your network devices. Without networking, there is no way to
unmount a filesystem that resides on a disk across the network.
It is best to update your init scripts so that filesystems on
EtherDrive storage is unmounted early in the system-shutdown
procedure, before network interfaces are shut down. ``An example'' is
found below in the ``list of Frequently Asked Questions''.
The device nodes in /dev/etherd/ are usually created in one of three
ways:
1. Most distributions today use udev to dynamically create device
nodes as needed. You can configure udev to create the device nodes
for your AoE disks. (For an example of udev configuration rules,
see ``Why do my device nodes disappear after a reboot?'' in the
``FAQ section'' below.)
2. If you are using the standalone aoe driver, as opposed to the one
distributed with the Linux kernel, and you are not using udev, the
Makefile will create device nodes for you when you do a "make
install".
3. If you are not using udev you can use static device nodes. Use the
aoe_dyndevs=0 module load option for the aoe driver. (You do not
need this option if your aoe driver is older than version aoe6-50.)
Then the aoe-mkdevs and aoe-mkshelf scripts in the aoetools
<http://aoetools.sourceforge.net/> package can be used to create
the static device nodes manually. It is very important to avoid
using these static device nodes with an aoe driver that has the
aoe_dyndevs module parameter set to 1, because you could
accidentally use the wrong device.
33.. TThhee AATTAA oovveerr EEtthheerrnneett TToooollss
The aoe kernel driver allows Linux to do ATA over Ethernet. In
addition to the aoe driver, there is a collection of helpful programs
that operate outside of the kernel, in "user space". This collection
of tools and documentation is called the aoetools, and may be found at
http://aoetools.sourceforge.net/ <http://aoetools.sourceforge.net/>.
Current aoe drivers from the Coraid website are bundled with a
compatible version of the aoetools. This HOWTO may make reference to
commands from the aoetools, like the aoe-stat command.
33..11.. LLiimmiittiinngg AAooEE ttrraaffffiicc ttoo cceerrttaaiinn nneettwwoorrkk iinntteerrffaacceess
By default, the aoe driver will use any local network interface
available to reach an AoE target. Most of the time, though, the
administrator expects legitimate AoE targets to appear only on certain
ethernet interfaces, e.g., "eth1" and "eth2".
Using the aoe-interfaces command from the aoetools package allows the
administrator to limit AoE activity to a set list of ethernet
interfaces.
This configuration is especially important when some ethernet
interfaces are on networks where an unexpected AoE target with the
same shelf and slot address as a production AoE target might appear.
Please see the aoe-interfaces manpage for more information.
At module load time the list of allowable interfaces may be set with
the "aoe_iflist" module parameter.
modprobe aoe 'aoe_iflist=eth2 eth3'
44.. EEtthheerrDDrriivvee ssttoorraaggee aanndd LLiinnuuxx SSooffttwwaarree RRAAIIDD
Some AoE devices are internally redundant. A Coraid SR1521, for
example, might be exporting a 14-disk RAID 5 as a single 9.75 terabyte
LUN. In that case, the AoE target itself is performing RAID,
enhancing performance and reliability.
You can also perform RAID on the AoE initiator. Linux Software RAID
can increase performance by striping over multiple AoE targets and
reliability by using data redundancy. Reading the Linux Software RAID
HOWTO <http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html> before you
start to work with RAID will likely save time in the long run. The
Linux kernel has an "md" driver that performs the Software RAID, and
there are several tool sets that allow you to use this kernel feature.
The main software package for using the md driver is mdadm
<http://www.cse.unsw.edu.au/~neilb/source/mdadm/>. Less popular
alternatives include the older raidtools package ``(discussed in the
Archives below)'', and EVMS <http://evms.sourceforge.net/>.
44..11.. EExxaammppllee:: RRAAIIDD 55 wwiitthh mmddaaddmm
In this example we have five AoE targets in shelves 0-4, with each
shelf exporting a single LUN 0. The following mdadm command uses these
five AoE devices as RAID components, creating a level-5 RAID array.
The md configuration information is stored on the components
themselves in "md superblocks", which can be examined with another
mdadm command.
# mdadm -C -n 5 --level=raid5 --auto=md /dev/md0 /dev/etherd/e[0-4].0
mdadm: array /dev/md0 started.
# mdadm --examine /dev/etherd/e0.0
/dev/etherd/e0.0:
Magic : a92b4efc
Version : 00.90.00
UUID : 46079e2f:a285bc60:743438c8:144532aa (local to host ellijay)
...
The /proc/mdstats file contains summary information about the RAID as
reported by the kernel itself.
# cat /proc/mdstat
Personalities : [raid5] [raid4]
md0 : active raid5 etherd/e4.0[5] etherd/e3.0[3] etherd/e2.0[2] etherd/e1.0[1] etherd/e0.0[0]
5860638208 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUUU_]
[>....................] recovery = 0.0% (150272/1465159552) finish=23605.3min speed=1032K/sec
unused devices: <none>
Until md finishes initializing the parity of the RAID, performance is
sub-optimal, and the RAID will not be usable if one of the components
fails during initialization. After initialization is complete, the md
device can continue to be used even if one component fails.
Later the array can be stopped in order to shut it down cleanly in
preparation for a system reboot or halt.
# mdadm -S /dev/md0
In a system init script (see ``the aoe-init example in the FAQ'') an
mdadm command can assemble the RAID components using the configuration
information that was stored on them when the RAID was created.
# mdadm -A /dev/md0 /dev/etherd/e[0-4].0
mdadm: /dev/md0 has been started with 5 drives.
To make an xfs filesystem on the RAID array and mount it, the
following commands can be issued:
# mkfs -t xfs /dev/md0
# mkdir /mnt/raid
# mount /dev/md0 /mnt/raid
Once md has finished initializing the RAID, the storage is single-
fault tolerant: Any of the components can fail without making the
storage unavailable. Once a single component has failed, the md device
is said to be in a "degraded" state. Using a degraded array is fine,
but a degraded array cannot remain usable if another component fails.
Adding hot spares makes the array even more robust. Having hot spares
allows md to bring a new component into the RAID as soon as one of its
components has failed so that the normal state may be achieved as
quickly as possible. You can check /proc/mdstat for information on the
initialization's progress.
The new write-intent bitmap feature can dramatically reduce the time
needed for re-initialization after a component fails and is later
added back to the array. Reducing the time the RAID spends in degraded
mode makes a double fault less likely. Please see the mdadm manpages
for details.
44..22.. IImmppoorrttaanntt nnootteess
1. Some Linux distributions come with an mdmonitor service running by
default. Unless you configure the mdmonitor to do what you want,
consider turning off this service with chkconfig mdmonitor off and
/etc/init.d/mdmonitor stop or your system's equivalent commands. If
mdadm is running in its "monitor" mode without being properly
configured, it may interfere with failover to hot spares, the
stopping of the RAID, and other actions.
2. There is a problem with the way some 2.6 kernels determine whether
an I/O device is idle. On these kernels, RAID initialization is
about five times slower than it needs to be.
On these kernels you can do the following to work around the
problem:
echo 100000 > /proc/sys/dev/raid/speed_limit_max
echo 100000 > /proc/sys/dev/raid/speed_limit_min
55.. FFAAQQ ((ccoonnttaaiinnss iimmppoorrttaanntt iinnffoo))
55..11.. QQ:: HHooww ddooeess tthhee ssyysstteemm kknnooww aabboouutt tthhee AAooEE ttaarrggeettss oonn tthhee nneett--
wwoorrkk??
A: When an AoE target comes online, it emits a broadcast frame
indicating its presence. In addition to this mechanism, the AoE
initiator may send out a query frame to discover any new AoE targets.
The Linux aoe driver, for example, sends an AoE query once per minute.
The discovery can be triggered manually with the "aoe-discover" tool,
one of the aoetools <http://aoetools.sourceforge.net/>.
55..22.. QQ:: HHooww ddoo II sseeee wwhhaatt AAooEE ddeevviicceess tthhee ssyysstteemm kknnoowwss aabboouutt??
A: The /usr/sbin/aoe-stat program (from the aoetools
<http://aoetools.sourceforge.net/>) lists the devices the system
considers valid. It also displays the status of the device (up or
down). For example:
root@makki root# aoe-stat
e0.0 10995.116GB eth0 up
e0.1 10995.116GB eth0 up
e0.2 10995.116GB eth0 up
e1.0 1152.874GB eth0 up
e7.0 370.566GB eth0 up
55..33.. QQ:: WWhhaatt iiss tthhee ""cclloosseewwaaiitt"" ssttaattee??
A: The "down,closewait" status means that the device went down but at
least one process still has it open. After all processes close the
device, it will become "up" again if it the remote AoE device is
available and ready.
The user can also use the "aoe-revalidate" command to manually cause
the aoe driver to query the AoE device. If the AoE device is available
and ready, the device state on the Linux host will change from
"down,closewait" to "up".
55..44.. QQ:: HHooww ddooeess tthhee ssyysstteemm kknnooww aann AAooEE ddeevviiccee hhaass ffaaiilleedd??
A: When an AoE target cannot complete a requested command it will
indicate so in the response to the failed request. The Linux aoe
driver will mark the AoE device as failed upon reception of such a
response. In addition, if an AoE target has not responded to a prior
request within a default timeout (currently three minutes) the aoe
driver will fail the device.
55..55.. QQ:: HHooww ddoo II ttaakkee aann AAooEE ddeevviiccee oouutt ooff tthhee ffaaiilleedd ssttaattee??
A: If the aoe driver shows the device state to be "down", first check
the EtherDrive storage itself and the AoE network. Once any problem
has been rectified, you can use the "aoe-revalidate" command from the
aoetools <http://aoetools.sourceforge.net/> to ask the aoe driver to
change the state back to "up".
If the Linux Software RAID driver has marked the device as "failed"
(so that an "F" shows up in the output of "cat /proc/mdstat"), then
you first need to remove the device from the RAID using mdadm. Next
you add the device back to the array with mdadm.
An example follows, showing how (after manually failing e10.0) the
device is removed from the array and then added back. After adding it
back to the RAID, the md driver begins rebuilding the redundancy of
the array.
root@kokone ~# cat /proc/mdstat
Personalities : [raid1] [raid5]
md0 : active raid1 etherd/e10.1[1] etherd/e10.0[0]
524224 blocks [2/2] [UU]
unused devices: <none>
root@kokone ~# mdadm --fail /dev/md0 /dev/etherd/e10.0
mdadm: set /dev/etherd/e10.0 faulty in /dev/md0
root@kokone ~# cat /proc/mdstat
Personalities : [raid1] [raid5]
md0 : active raid1 etherd/e10.1[1] etherd/e10.0[2](F)
524224 blocks [2/1] [_U]
unused devices: <none>
root@kokone ~# mdadm --remove /dev/md0 /dev/etherd/e10.0
mdadm: hot removed /dev/etherd/e10.0
root@kokone ~# mdadm --add /dev/md0 /dev/etherd/e10.0
mdadm: hot added /dev/etherd/e10.0
root@kokone ~# cat /proc/mdstat
Personalities : [raid1] [raid5]
md0 : active raid1 etherd/e10.0[2] etherd/e10.1[1]
524224 blocks [2/1] [_U]
[=>...................] recovery = 5.0% (26944/524224) finish=0.6min speed=13472K/sec
unused devices: <none>
root@kokone ~#
55..66.. QQ:: HHooww ccaann II uussee LLVVMM wwiitthh mmyy EEtthheerrDDrriivvee ssttoorraaggee??
A: With older LVM2 <http://sources.redhat.com/lvm2/> releases, you may
need to edit lvm.conf, but the current version of LVM2 supports AoE
devices "out of the box".
You can also create md devices from your aoe devices and tell LVM to
use the md devices.
It's necessary to understand LVM itself in order to use AoE devices
with LVM. Besides the manpages for the LVM commands, the LVM HOWTO
<http://tldp.org/HOWTO/LVM-HOWTO/> is a big help getting started if
you are starting out with LVM.
If you have an old LVM2 that does not already detect and work with AoE
devices, you can add this line to the "devices" block of your
lvm.conf.
types = [ "aoe", 16 ]
If you are creating physical volumes out of RAIDs over EtherDrive
storage, make sure to turn on md component detection so that LVM2
doesn't go snooping around on the underlying EtherDrive disks.
md_component_detection = 1
The snapshots feature in LVM2 did not work in early 2.6 kernels.
Lately, Coraid customers have reported success using snapshots on AoE-
backed logical volumes when using a recent kernel and aoe driver.
Older aoe drivers, like version 22, may need a fix
<https://bugzilla.redhat.com/attachment.cgi?id=311070> to work
correctly with snapshots.
Customers have reported data corruption and kernel panics when using
striped logical volumes (created with the "-i" option to lvcreate)
when using aoe driver versions prior to aoe6-48. No such problems
occur with normal logical volumes or with Software RAID's striping
(RAID 0).
Most systems have boot scripts that try to detect LVM physical volumes
early in the boot process, before AoE devices are available. In
playing with LVM, you may need to help LVM to recognize AoE devices
that are physical devices by running vgscan after loading the aoe
module.
There have been reports that partitions can interfere with LVM's
ability to use an AoE device as a physical volume. For example, with
partitions e0.1p1 and e0.1p2 residing on e0.1, pvcreate
/dev/etherd/e0.1 might complain,
Device /dev/etherd/e0.1 not found.
Removing the partitions allows LVM to create a physical volume from
e0.1.
55..77.. QQ:: II ggeett aann ""iinnvvaalliidd mmoodduullee ffoorrmmaatt"" eerrrroorr oonn mmooddpprroobbee.. WWhhyy??
A: The aoe module and the kernel must be built to match one another.
On module load, the kernel version, SMP support (yes or no), the
compiler version, and the target processor must be the same for the
module as it was building the kernel.
55..88.. QQ:: CCaann II aallllooww mmuullttiippllee LLiinnuuxx hhoossttss ttoo uussee aa ffiilleessyysstteemm tthhaatt iiss
oonn mmyy EEtthheerrDDrriivvee ssttoorraaggee??
A: Yes, but you're now taking advantage of the flexibility of
EtherDrive storage, using it like a SAN. Your software must be
"cluster aware", like GFS <http://sources.redhat.com/cluster/gfs/>.
Otherwise, each host will assume it is the sole user of the filesystem
and data corruption will result.
55..99.. QQ:: CCaann yyoouu ggiivvee mmee aann oovveerrvviieeww ooff GGFFSS aanndd rreellaatteedd ssooffttwwaarree??
A: Yes, here's a brief overview.
55..99..11.. BBaacckkggrroouunndd
GFS is a scalable, journaled filesystem designed to be used by more
than one computer at a time. There is a separate journal for each host
using the filesystem. All the hosts working together are called a
cluster, and each member of the cluster is called a cluster node.
To achieve acceptible performance, each cluster node remembers what
was on the block device the last time it looked. This is caching,
where data from copies in RAM are used temporarily instead of data
directly from the block device.
To avoid chaos, the data in the RAM cache of every cluster node has to
match what's on the block device. The members of the cluster (called
"cluster nodes") communicate over TCP/IP to agree on who is in the
cluster and who has the right to use a particular part of the shared
block device.
55..99..22.. HHaarrddwwaarree
To allow the cluster nodes to control membership in the cluster and to
control access to the shared block storage, "fencing" hardware can be
used.
Some network switches can be dynamically configured to turn single
ports on and off, effectively fencing a node off from the rest of the
network.
Remote power switches can be told to turn an outlet off, powering a
cluster node down, so that it is certainly not accessing the shared
storage.
55..99..33.. SSooffttwwaarree
The RedHat Cluster Suite developers have created several pieces of
software besides the GFS filesystem itself to allow the cluster nodes
to coordinate cluster membership and to control access to the shared
block device.
These parts are listed here, on the GFS Project Page.
http://sources.redhat.com/cluster/gfs/
<http://sources.redhat.com/cluster/gfs/>
GFS and its related software are undergoing continuous heavy
development and are maturing slowly but steadily.
As might be expected, the devleopers working for RedHat target RedHat
Enterprise Linux as the ultimate platform for GFS and its related
software. They also use Fedora Core as a platform for testing and
innovation.
That means that when choosing a distribution for running GFS, recent
versions of Fedora Core, RedHat Enterprise Linux (RHEL), and RHEL
clones like CentOS should be considered. On these platforms, RPMs are
available that have a good chance of working "out of the box."
With a RedHat-based distro like Fedora Core, using GFS means seeking
out the appropriate documentation, installing the necessary RPMs, and
creating a few text files for configuring the software.
Here is a good overview of what the process is generally like. Note
that if you're using RPMs, then building and installing the software
will not be necessary.
http://sources.redhat.com/cluster/doc/usage.txt
<http://sources.redhat.com/cluster/doc/usage.txt>
55..99..44.. UUssee
Once you have things ready, using the GFS is like using any other
filesystem.
Performance will be greatest when the filesystem operations of the
different nodes do not interfere with one another. For instance, if
all the nodes try to write to the same place in a directory or file,
much time will be spent in coordinating access (locking).
An easy way to eliminate a large amount of locking is to use the
"noatime" (no access time update) mount option. Even in traditional
filesystems the use of this option often results in a dramatic
performance benefit, because it eliminates the need to write to the
block storage just to record the time that the file was last accessed.
55..99..55.. FFeenncciinngg
There are several ways to keep a cluster node from accessing shared
storage when that node might have outdated assumptions about the state
of the cluster or the storage. Preventing the node from accessing the
storage is called "fencing", and it can be accomplished in several
ways.
One popular way is to simply kill the power to the fenced node by
using a remote power switch. Another is to use a network switch that
has ports that can be turned on and off remotely.
When the shared storage resource is a LUN on an SR, it is possible to
manipulate the LUN's mask list in order to accomplish fencing. You can
read about this technique in the Contributions area
</support/linux/contrib/>.
55..1100.. QQ:: HHooww ccaann II mmaakkee aa RRAAIIDD ooff mmoorree tthhaann 2277 ccoommppoonneennttss??
A: For Linux Software RAID, the kernel limits the number of disks in
one RAID to 27. However, you can easily overcome this limitation by
creating another level of RAID.
For example, to create a RAID 0 of thirty block devices, you may
create three ten-disk RAIDs (md1, md2, and md3) and then stripe across
them (md0 is a stripe over md1, md2, and md3).
Here is an example raidtools configuration file that implements the
above scenario for shelves 5, 6, and 7: multi-level RAID 0
configuration file <raid0-30component.conf>. Non-trivial raidtab
configuration files are easier to generate from a script than to
create by hand.
EtherDrive storage gives you a lot of freedom, so be creative.
55..1111.. QQ:: WWhhyy ddoo mmyy ddeevviiccee nnooddeess ddiissaappppeeaarr aafftteerr aa rreebboooott??
A: Some Linux distributions create device nodes dynamically. The
upcoming method of choice is called "udev". The aoe driver and udev
work together when the following rules are installed.
These rules go into a file with a name like 60-aoe.rules. Look in
your udev.conf file (usually /etc/udev/udev.conf) for the line
starting with udev_rules= to find out where rules go (usually
/etc/udev/rules.d).
# These rules tell udev what device nodes to create for aoe support.
# They may be installed along the following lines. Check the section
# 8 udev manpage to see whether your udev supports SUBSYSTEM, and
# whether it uses one or two equal signs for SUBSYSTEM and KERNEL.
# aoe char devices
SUBSYSTEM=="aoe", KERNEL=="discover", NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="err", NAME="etherd/%k", GROUP="disk", MODE="0440"
SUBSYSTEM=="aoe", KERNEL=="interfaces", NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="revalidate", NAME="etherd/%k", GROUP="disk", MODE="0220"
SUBSYSTEM=="aoe", KERNEL=="flush", NAME="etherd/%k", GROUP="disk", MODE="0220"
# aoe block devices
KERNEL=="etherd*", NAME="%k", GROUP="disk"
Unfortunately the syntax for the udev rules file has changed several
times as new versions of udev appear. You will probably have to modify
the example above for your system, but the existing rules and the udev
documentation should help you.
There is an example script in the aoe driver,
linux/Documentation/aoe/udev-install.sh, that can install the rules on
most systems.
The udev system can only work with the aoe driver if the aoe driver is
loaded. To avoid confusion, make sure that you load the aoe driver at
boot time.
55..1122.. QQ:: WWhhyy ddooeess RRAAIIDD iinniittiiaalliizzaattiioonn sseeeemm ssllooww??
A: The 2.6 Linux kernel has a problem with its RAID initialization
rate limiting feature. You can override this feature and speed up RAID
initialization by using the following commands. Note that these
commands change kernel memory, so the commands must be re-run after a
reboot.
echo 100000 > /proc/sys/dev/raid/speed_limit_max
echo 100000 > /proc/sys/dev/raid/speed_limit_min
55..1133.. QQ:: II ccaann oonnllyy uussee sshheellff zzeerroo!! WWhhyy wwoonn''tt ee11..99 wwoorrkk??
A: Every block device has a device file, usually in /dev, that has a
major and minor number. You can see these numbers using ls. Note the
high major numbers (1744, 2400, and 2401) in the example below.
ecashin@makki ~$ ls -l /dev/etherd/
total 0
brw------- 1 root disk 152, 1744 Mar 1 14:35 e10.9
brw------- 1 root disk 152, 2400 Feb 28 12:21 e15.0
brw------- 1 root disk 152, 2401 Feb 28 12:21 e15.0p1
The 2.6 Linux kernel allows high minor device numbers like this, but
until recently, 255 was the highest minor number one could use. Some
distributions contain userland software that cannot understand the
high minor numbers that 2.6 makes possible.
Here's a crude but reliable test that can determine whether your
system is ready to use devices with high minor numbers. In the example
below, we tried to create a device node with a minor number of 1744,
but ls shows it as 208.
root@kokone ~# mknod e10.9 b 152 1744
root@kokone ~# ls -l e10.9
brw-r--r-- 1 root root 158, 208 Mar 2 15:13 e10.9
On systems like this, you can still use the aoe driver to use up to
256 disks if you're willing to live without support for partitions.
Just make sure that the device nodes and the aoe driver are both
created with one partition per device.
The commands below show how to create a driver without partition
support and then to create compatible device nodes for shelf 10.
make install AOE_PARTITIONS=1
rm -rf /dev/etherd
env n_partitions=1 aoe-mkshelf /dev/etherd 10
As of version 1.9.0, the mdadm command supports large minor device
numbers. The mdadm versions before 1.9.0 do not. If you would like to
use versions of mdadm older than 1.9.0, you can configure your driver
and device nodes as outlined above. Be aware that it's easy confuse
yourself by creating a driver that doesn't match the device nodes.
55..1144.. QQ:: HHooww ccaann II ssttaarrtt mmyy AAooEE ssttoorraaggee oonn bboooott aanndd sshhuutt iitt ddoowwnn wwhheenn
tthhee ssyysstteemm sshhuuttss ddoowwnn??
A: That is really a question about your own system, so it's a question
you, as the system administrator, are in the best position to answer.
In general, though, many Linux distributions follow the same patterns
when it comes to system "init scripts". Most use a System V style.
The example below should help get you started if you have never
created and installed an init script. Start by reading the comments at
the top. Make sure you understand how your system works and what the
script does, because every system is different.
Here is an overview of what happens when the aoe module is loaded and
the aoe module begins AoE device discovery. It should help you to
understand the example script below. Starting up the aoe module on
boot can be tricky if necessary parts of the system are not ready when
you want to use AoE.
To discover an AoE device, the aoe driver must receive a Query Config
reponse packet that indicates the device is available. A Coraid SR
broadcasts this response unsolicited when you run the online SR
command, but it is usually sent in response to an AoE initiator
broadcasting a Query Config command to discover devices on the
network. Once an AoE device has been discovered, the aoe driver sends
an ATA Device Identify command to get information about the disk
drive. When the disk size is known, the aoe driver will install the
new block device in the system.
The aoe driver will broadcast this AoE discovery command when loaded,
and then once a minute thereafter.
The AoE discovery that takes place on loading the aoe driver does not
take long, but it does take some time. That's why you'll see "sleep"
commands in the example aoe-init script below. If AoE discovery is
failing, try unloading the aoe module and tuning your init script by
invoking it at the command line.
You will often find that a delay is necessary after loading your
network drivers (and before loading the aoe driver). This delay allows
the network interface to initialize and to become usable. An
additional delay is necessary after loading the aoe driver, so that
AoE discovery has time to take place before any AoE storage is used.
Without such a delay, the initial AoE Config Query broadcast packet
might never go out onto the AoE network, and then the AoE initiator
will not know about any AoE targets until the next periodic Config
Query broadcast occurs, usually one minute later.
#! /bin/sh
# aoe-init - example init script for ATA over Ethernet storage
#
# Edit this script for your purposes. (Changing "eth1" to the
# appropriate interface name, adding commands, etc.) You might
# need to tune the sleep times.
#
# Install this script in /etc/init.d with the other init scripts.
#
# Make it executable:
# chmod 755 /etc/init.d/aoe-init
#
# Install symlinks for boot time:
# cd /etc/rc3.d && ln -s ../init.d/aoe-init S99aoe-init
# cd /etc/rc5.d && ln -s ../init.d/aoe-init S99aoe-init
#
# Install symlinks for shutdown time:
# cd /etc/rc0.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc1.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc2.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc6.d && ln -s ../init.d/aoe-init K01aoe-init
#
case "$1" in
"start")
# load any needed network drivers here
# replace "eth1" with your aoe network interface
ifconfig eth1 up
# time for network interface to come up
sleep 4
modprobe aoe
# time for AoE discovery and udev
sleep 7
# add your raid assemble commands here
# add any LVM commands if needed (e.g. vgchange)
# add your filesystem mount commands here
test -d /var/lock/subsys && touch /var/lock/subsys/aoe-init
;;
"stop")
# add your filesystem umount commands here
# deactivate LVM volume groups if needed
# add your raid stop commands here
rmmod aoe
rm -f /var/lock/subsys/aoe-init
;;
*)
echo "usage: `basename $0` {start|stop}" 1>&2
;;
esac
55..1155.. QQ:: WWhhyy ddoo II ggeett ""ppeerrmmiissssiioonn ddeenniieedd"" wwhheenn II''mm rroooott??
A: Some newer systems come with SELinux (Security-Enhanced Linux),
which can limit what the root user can do.
SELinux is usually good about creating entries in the system logs when
it prevents root from doing something, so examine your logs for such
messages.
Check the SELinux documentation for information on how to configure or
disable SELinux according to your needs.
55..1166.. QQ:: WWhhyy ddooeess ffddiisskk aasskk mmee ffoorr tthhee nnuummbbeerr ooff ccyylliinnddeerrss??
A: Your fdisk is probably asking the kernel for the size of the disk
with a BLKGETSIZE block device ioctl, which returns the sector count
of the disk in a 32-bit number. If the size of the disk exceeds the
ability to be stored in this 32-bit number (2 TB is the limit), the
ioctl returns ETOOBIG as an error. This error indicates that the
program should try the 64-bit ioctl (BLKGETSIZE64), but when fdisk
doesn't do that, it just asks the user to supply the number of
cylinders.
You can tell fdisk the number of cylinders yourself. The number to use
(sectors / (255 * 63)) is printed by the following commands. Use the
appropriate device instead of "e0.0".
sectors=`cat /sys/block/etherd\!e0.0/size`
echo $sectors 255 63 '*' / p | dc
But no MSDOS partition table can ever work with more than 2TB. The
reason is that the numbers in the partition table itself are only 32
bits in size. That means you can't have a partition larger than 2TB in
size or starting further than 2TB from the beginning of the device.
Some options for multi-terabyte volumes are:
1. By doing without partitions, the filesystem can be created directly
on the AoE device itself (e.g., /dev/etherd/e1.0),
2. LVM2, the Logical Volume Manager, is a sophisticated way of
allocating storage to create logical volumes of desired sizes, and
3. GPT partition tables.
The last item in the list above is a new kind of partition table that
overcomes the limitations of the older MSDOS-style partition table.
Andrew Chernow has related his successful experiences using GPT
partition tables on large AoE devices in this contributed document
</support/linux/contrib/chernow/gpt.html>.
Please note that some versions of the GNU parted tool, such as version
1.8.6, have a bug. This bug allows the user to create an MSDOS-style
partition table with partitions larger than two terabytes even though
these partitions are too large for an MSDOS partition table. The
result is that the filesystems on these partitions will only be usable
until the next reboot.
55..1177.. QQ:: CCaann II uussee AAooEE eeqquuiippmmeenntt wwiitthh OOrraaccllee ssooffttwwaarree??
A: Oracle used to have a Oracle Storage Compatibility Program
<http://www.oracle.com/technology/deploy/availability/htdocs/oscp.html>,
but simple block-level storage technologies do not require Oracle
validation. ATA over Ethernet provides simple, block-level storage.
Oracle used to have a list of a frequently asked questions about
running Oracle on Linux, but they have replaced it with documentation
about their own Linux distribution list covering