Ceph - howto, rbd, lvm, cluster
Install ceph
- Installation depends about which version you want as they're all locked into Fixed releases (argonaut, bobtail etc). So go here for install options for your distro and the version you want:
Video to ceph intro
https://www.youtube.com/watch?v=UXcZ2bnnGZg http://www.youtube.com/watch?v=BBOBHMvKfyc&feature=g-high
Rebooting node stops everything / Set number of replicas across all nodes
This issue is more or less fixed in Cuttlefish+
Make sure that the min replica count is set to nodes-1.
ceph osd pool set <poolname> min_size 1
Then the remaing node[s] will start up with just 1 node if everything else is down.
Keep in mind this can potentially make stuff ugly as there are no replicas now.
More info here: http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10481
Add disks (OSD) or entire nodes
Use ceph-deploy tools from cuttlefish+ instead for all these sorts of things unless you know what you're doing ( http://ceph.com/docs/master/rados/deployment/ )
Prepare the disk as usual (partition or entire disk) - format with filesystem of choosing. Add to fstab and mount. Add to /etc/ceph/ceph.conf and replicate the new conf to the other nodes.
Start the disk, I'm assuming we've added osd.12 (sdd) on ceph1 here.
## Prepare disk first, create partition and format it <insert parted oneliner> mkfs.xfs -f /dev/sdd1 ## Create the disk ceph osd create [uuid] ## Auth stuff to make sure that the OSD is accepted into the cluser: mkdir /srv/ceph/[uuid_from_above] ceph-osd -i 12 --mkfs --mkkey ceph auth add osd.12 osd 'allow *' mon 'allow rwx' -i /etc/ceph/keyring.osd.12 ## Start it /etc/init.d/ceph start osd.12 ## Add it to the cluster and allow replicated based on CRUSH map. ceph osd crush set 12 osd.12 1.0 pool=default rack=unknownrack host=ceph1-osd
In the line above, if you exchange the pool/rack/host you can place your disk/node where you want.
If you add a new host entry, it will be the same as adding a new node (with the disk).
Check that is in the right place with:
ceph osd tree
More info here:
- http://ceph.com/docs/master/rados/operations/pools/
- http://ceph.com/docs/master/rados/operations/add-or-rm-osds/
Delete pools/OSD
Use ceph-deploy tools from cuttlefish+ instead for all these sorts of things unless you know what you're doing ( http://ceph.com/docs/master/rados/deployment/ )
Make sure you have the right disk, run
ceph osd tree
to get an overview.
If OSD is member of a pool and/or active
Here we'll delete osd5
## Mark it out ceph osd out 5 ## Wait for data migration to complete (ceph -w), then stop it service ceph -a stop osd.5 ## Now it is marked out and down
Delete the OSD
## If deleting from active stack, be sure to follow the above to mark it out and down ceph osd crush remove osd.5 ## Remove auth for disk ceph auth del osd.5 ## Remove disk ceph osd rm 5 ## Remove from ceph.conf and copy new conf to all hosts
Add Monitor node/service
Use ceph-deploy tools from cuttlefish+ instead for all these sorts of things unless you know what you're doing ( http://ceph.com/docs/master/rados/deployment/ )
Install ceph, add keys, ceph.conf, host files and prepare a storage for containing the maps.
Then add the monitor into the system (To keep quorum, keep either 1 or 3+ - not 2). Examples are adding monitor mon.21 with ip 192.168.0.68
cd /tmp; mkdir add_monitor; cd add_monitor ceph auth get mon. -o key > exported keyring for mon. ceph mon getmap -o map > got latest monmap ceph-mon -i 21 --mkfs --monmap map --keyring key > ceph-mon: created monfs at /srv/ceph/mon21 for mon.21 ceph mon add 21 192.168.0.68 > port defaulted to 6789added mon.21 at 192.168.0.68:6789/0 /etc/init.d/ceph start mon.21
Add the info to the ceph.conf file:
[mon] ... [mon.21] host = ceph2-mon mon addr = 192.168.0.68:6789 ...
Replicating from OSD-based to replication across hosts in a ceph cluster
More info here: http://jcftang.github.com/2012/09/06/going-from-replicating-across-osds-to-replicating-across-hosts-in-a-ceph-cluster/
Replication - see current level pr. OSD
ceph osd dump
CRUSH maps
- Redistributing, [de]assembling and finetuning; more info here:
## save current crushmap in binary ceph osd getcrushmap -o crush.running.map ## Convert to txt crushtool -d crush.running.map -o crush.map ## Edit it and re-convert to binary crushtool -c crush.map -o crush.new.map ## Inject into running system ceph osd setcrushmap -i crush.new.map ## If you've added a new ruleset and want to use that for a pool, do something like: ceph osd pool default crush rule = 4
KVM - add disk
- Substitute <host name='192.168.0.67' port='6789'/> with ip/port of your monitor[s]
- Make sure you have a distinct id for the slotNR: <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
- Generate new uuid: uuidgen
- Be sure to have added secret key to libvirt; detailed just below ( http://wiki.skytech.dk/index.php/Ceph_-_howto,_rbd,_lvm,_cluster#KVM_-_add_secret.2Fauth_for_use_with_ceph )
- Omit the <auth>..</auth> part if you're not using any auth
Pr host:
<disk type='network' device='disk'> <driver name='qemu' type='raw' cache='writeback'/> <auth username='libvirt'> <secret type='ceph' uuid='7a91dc24-b072-43c4-98fb-4b2415322b0f'/> </auth> <source protocol='rbd' name='sata/webserver'> <host name='192.168.0.67' port='6789'/> <host name='192.168.0.68' port='6789'/> <host name='192.168.0.69' port='6789'/> </source> <target dev='vda' bus='virtio'/> <boot order='1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk>
Pr pool:
<pool type='rbd'> <name>rbd</name> <uuid>f959641f-f518-4505-9e85-17d994e2a399</uuid> <source> <host name='192.168.0.67' port='6789'/> <host name='192.168.0.68' port='6789'/> <host name='192.168.0.69' port='6789'/> <name>test</name> <auth username='admin' type='ceph'> <secret type='ceph' uuid='7a91dc24-b072-43c4-98fb-4b2415322b0f'/> </auth> </source> </pool>
KVM - add secret/auth for use with ceph
If you're using any form of ceph auth, this needs to be added - else skip this part
Create ceph-auth for user (substitute pools sata/ssd with other rbd-based pools if needed)
cd /etc/ceph ceph auth get-or-create client.libvirt mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow rwx pool=sata, allow rwx pool=ssd' ## Doublecheck it has correct access ceph auth list
Create a secret.xml file (generate UUID below - like with uuidgen):
cat > secret.xml <<EOF <secret ephemeral='no' private='no'> <uuid>7a91dc24-b072-43c4-98fb-4b2415322b0f</uuid> <usage type='ceph'> <name>client.libvirt secret</name> </usage> </secret> EOF
Use it (on kvm host):
## Define secret virsh secret-define secret.xml ## Add correct key ceph auth get-key client.libvirt | tee client.libvirt.key virsh secret-set-value 7a91dc24-b072-43c4-98fb-4b2415322b0f --base64 $(cat client.libvirt.key)
The last key is the key from your /etc/ceph/keyring.admin
cat /etc/ceph/keyring.admin [client.admin] key = AQDAD8JQOLS9IxAAbox00eOmlM1h5ZLGPxHGHw==
Online resizing of KVM images (rbd)
Resize the desired block-image (here going from 30GB -> 40GB)
qemu-img resize -f rbd rbd:sata/disk3 40G > Image resized.
Find the attached target device:
virsh domblklist rbd-test > Target Source > ------------------------------------------------ > vdb sata/disk2-qemu-5g:rbd_cache=1 > vdc sata/disk3:rbd_cache=1 > hdc -
Then use virsh to tell the guest that the disk has a new size:
virsh blockresize --domain rbd-test --path "vdc" --size 40G > Block device 'vdc' is resized
Check raw rbd info
rbd --pool sata info disk3 > rbd image 'disk3': > size 40960 MB in 10240 objects > order 22 (4096 KB objects) > block_name_prefix: rb.0.13fb.23353e97 > parent: (pool -1)
Make sure you can see the change from dmesg (Guest should see the new size change).
dmesg > [...] > [75830.538557] vdb: detected capacity change from 118111600640 to 123480309760 > [...]
Then extend the partition - if it is a simple data volume, you can just fdisk, remove the old partition, create a new and access default values for start/end (Note, only applies to partitions which holds nothing else!)
ext3/ext4
Write the partition, fdisk -l to doublecheck the size, then remount the partition (the partition from above is mounted as a data dir under vdb1 in my case):
mount -o remount,rw /dev/vdb1
Check your fstab to make sure you get the correct options for the remount.
Afterwards, call the resize2fs:
resize2fs /dev/vdb1 > resize2fs 1.42.5 (29-Jul-2012) > Filesystem at /dev/vdb1 is mounted on /home/mirroruser/mirror; on-line resizing required > old_desc_blocks = 7, new_desc_blocks = 7 > The filesystem on /dev/vdb1 is now 28835584 blocks long.
doublecheck via df -h or the like.
XFS
xfs_growfs /share data blocks changed from 118111600640 to 123480309760
All done
Add/move journal in running cluster
Before starting this, it will be a good idea to make sure the cluster does not automark OSD's out if they've been down for default 300s. You can do that by issuing:
ceph osd set nodown
I'll assume we want to update node#1 having OSD 0,1,2 to put a journal on SSD with 10GB. It currently reside on each OSD with a 512MB journal Assuming we'll be mounting a SSD here: /srv/ceph/journal -- This will then hold all journals as /srv/ceph/journal/osd$id/journal A much better way is to create the above as partitions and not files and use those instead - I'll show inline: # Relevant ceph.conf options -- existing setup -- [osd] osd data = /srv/ceph/osd$id osd journal = /srv/ceph/osd$id/journal osd journal size = 512 # stop the OSD: /etc/init.d/ceph osd.0 stop /etc/init.d/ceph osd.1 stop /etc/init.d/ceph osd.2 stop # Flush the journal: ceph-osd -i 0 --flush-journal ceph-osd -i 1 --flush-journal ceph-osd -i 2 --flush-journal # Now update ceph.conf - this is very important or you'll just recreate journal on the same disk again -- change to [filebased journal] -- [osd] osd data = /srv/ceph/osd$id osd journal = /srv/ceph/journal/osd$id/journal osd journal size = 10000 -- change to [partitionbased journal (journal in this case would be on /dev/sda2)] -- [osd] osd data = /srv/ceph/osd$id osd journal = /dev/sda2 osd journal size = 0 # Create new journal on each disk ceph-osd -i 0 --mkjournal ceph-osd -i 1 --mkjournal ceph-osd -i 2 --mkjournal # Done, now start all OSD again /etc/init.d/ceph osd.0 start /etc/init.d/ceph osd.1 start /etc/init.d/ceph osd.2 start
If you set your cluster to not mark OSD's down, remember to remove it!
ceph osd unset nodown
Enjoy your new faster journal!
More info here (source of the above):
Prevent OSD from being marked down
This could be the case while adding new OSD/nodes and some osd keeps being marked down
ceph osd set nodown
Check it is set by issuing:
ceph osd dump | grep flags > flags no-down
When done, remember to unset it or no OSD will ever get marked down!
ceph osd unset nodown
Doublecheck with ceph osd dump | grep flags
OSD optimizations
A few things have helped make my cluster more stable:
Increase osd timeout
I've found the existing 30sec rule is sometimes too little for my 4disk lowend system.
Set the timeout accordingly (try and run debug first to determine if this is really the case).
osd op thread timeout = 60
Adjust transaction commit size
Default value is 300, which is a bit high for highly stressed system/lowcpu
It is adjustable inside the [osd] column:
[osd] ... ## Added new transactional size as default value of 300 is too much osd target transaction size = 50 ...
Change number of disks which can be in recovery
Since my 'cluster' is just 1 machine with 6 osd having multiple recoveries running in parallel will effectively kill it; so I've adjusted it down from the default values:
## Just 2 osd backilling kills a server osd max backfills = 1 osd recovery max active = 1
More info on all the various settings here:
- http://ceph.com/docs/master/rados/configuration/osd-config-ref/#recovery
- http://ceph.com/docs/master/rados/configuration/osd-config-ref/#backfilling
Adjust kernel values
Adjust nr_request in queue (staying in mem - default is 128)
echo 1024 > /sys/block/sdb/queue/nr_requests
Change scheduler as pr. Inktank blogpost:
Full cluster - how to recover
- Add more disks
- Try lower the weight ratio of the full disk
- If everything fails, up the full ratio and immediately change the weight ratio of the disk
ceph pg set_full_ratio 0.98
- If you're really desperate set change the replicate level down
ceph osd pool set <poolname> size existing_size-1
ceph osd pool set <poolname> min_size existing_min_size-1
- Keep in mind this is very dangerous if you hit 1 as min_size; then if any disk fail you'll loose all data!
Alternatively, if you want to change the default values (0.85 as near_full, 0.95 as full), you can add this to the mon section of your ceph.conf:
[mon] ... mon osd nearfull ratio = 0.92 mon osd full ratio = 0.98 ...
Loosing 2 nodes in a 3 node system
Will cause the cluster to halt because of a new feature. Explained by Greg Farnum from Ceph mailinglist:
This took me a bit to work out as well, but you've run afoul of a new post-argonaut feature intended to prevent people from writing with insufficient durability. Pools now have a "min size" and PGs in that pool won't go active if they don't have that many OSDs to write on. The clue here is the "incomplete" state. You can change it with "ceph osd pool foo set min_size 1", where "foo" is the name of the pool whose min_size you wish to change (and this command sets the min size to 1, obviously). The default for new pools is controlled by the "osd pool default min size" config value (which you should put in the global section). By default it'll be half of your default pool size. So in your case your pools have a default size of 3, and the min size is (3/2 = 1.5 rounded up), and the OSDs are refusing to go active because of the dramatically reduced redundancy. You can set the min size down though and they will go active.
Import LVM image into rbd
Do give thought about which version of rbd to use! Format 2 gives additional [great] benefits over format 1
- http://ceph.com/docs/master/man/8/rbd/#parameters
- http://ceph.com/docs/master/man/8/rbd/#commands (check for "This requires format 2")
I've succesfully snapshotted lots of running systems, but the catch is all disk io not still flushed will not make it. You can use something like fsfreeze to first freeze the guest filesystem (ext4, btrfs, xfs). It will basically flush the FS state to disk and blocking any future write access while maintaining Read accesses.
Turned out to be really simple
I started by grabbing a snapshot of a running lvm
lvcreate -L25G -s -n snap-webserver /dev/storage/webserver
And then just feeding that snapshot directly into rbd:
rbd import /dev/storage/snap-webserver sata/webserver ## Format 2: rbd --format 2 import /dev/storage/snap-webserver sata/webserver ## Format 2 for cuttlefish+: rbd --image-format 2 import /dev/storage/snap-webserver sata/webserver
Here sata/webserver means pool sata. The webserver image will be automatically created.
Import LVM image via ssh to remote host
## Using format 2 rbd images dd if=/dev/e0.0/snap-webhotel | pv | ssh root@remote-server 'rbd --image-format 2 import - sata/webserver'
Will run and show total copied data as well as speed
Import ceph image to LVM
Is pretty much just the reverse:
## Optional; create a snapshot first rbd snap create sata/webserver@snap1 ## Transfer image rbd export sata/webserver@snap1 - | pv | ssh root@remote-server 'dd of=/dev/lvm-storage/webserver'
Import from ceph cluster to another cluster
Prob. lots of ways to do it; I did it the usual way with import/export
ssh root@remote-server 'rbd export sata/webserver -' | pv | rbd --image-format 2 import - sata/webserver
In both cases using '-' is using stdin.
Create pool with pg_num
It seems to be important to create pools which are big enough in has the number in ^ 2.
So 2048, 4096 etc.
ceph osd pool create sata 4096
Mapping RBD volumes with kernel rbd-fuse client
Mapping volumes
It is recommended to use a 3.6+ kernel for this!
/sbin/modprobe rbd /usr/bin/rbd map --pool sata webserver-data --id admin --keyring /etc/ceph/keyring.admin
Then it will show up as device /etc/rbd[X]
Then you can format it, partition it or do what you want - and then mount it like a normal device.
Unmapping volumes
If you want to remove a mapping for a device, issue:
## First unmount from system, then: rbd unmap /dev/rbd[X]
Flashcache
- http://www.sebastien-han.fr/blog/2012/11/15/make-your-rbd-fly-with-flashcache/
- Options about how to use/mount the flashcache ( https://github.com/facebook/flashcache/blob/master/doc/flashcache-sa-guide.txt )
- Writethrough - safest, all writes are cached to ssd but also written to disk immediately. If your ssd has slower write performance than your disk (likely for early generation SSDs purchased in 2008-2010), this may limit your system write performance. All disk reads are cached (tunable).
- Writearound - again, very safe, writes are not written to ssd but directly to disk. Disk blocks will only be cached after they are read. All disk reads are cached (tunable).
- Writeback - fastest but less safe. Writes only go to the ssd initially, and based on various policies are written to disk later. All disk reads are cached (tunable).
Various stuff
Inflight IO
- You can peek at in-flight IO (and other state) with
cat /sys/kernel/debug/ceph/*/osdc
- Ask via socket admin call
- ceph --admin-daemon /var/run/ceph/ceph-osd.NN.asok dump_ops_in_flight
NFS over rbd (with pacemaker in 2x setup)
Backup VM's using rbd (req. format 2)
Bring down a server/rack/whatever for maint and stop ceph from redistributing data
Source: http://ceph.com/docs/master/rados/operations/troubleshooting-osd/#stopping-w-out-rebalancing
## Set cluster to now mark OSD down ceph osd set noout ## Once the cluster is set to nodown, you can begin stopping the OSDs within the failure domain that requires maintenance work. ceph osd stop osd.{num} ## Note Placement groups within the OSDs you stop will become degraded while you are addressing issues with within the failure domain. ## Once you have completed your maintenance, restart the OSDs. ceph osd start osd.{num} ## Finally, you must unset the cluster from nodown. ceph osd unset noout
RBD
Using rbd
Creating a 50GB share in sata named rpmbuilder-ceph could be done like:
rbd create sata/rpmbuilder-ceph [--format 2*] --size 51200
also qemu-img can create rbd images; I fancy it works the same, but am not entirely sure.
\* This will default create format 1 rbd blockdevices; format 2 has to be specified and brings a lot of new utility handling images (flattening, cloning etc). Use --format 2 to create as format 2 images.
Using KVM
qemu-img create -f rbd rbd:sata/rpmbuilder-ceph 50G
QEMU / KVM
Stalled guests during recovery/calc. pg's
Note: Qemu 1.4.2+ has this patch incorporated, so if you're using that or anything newer, you're all set.
Is a known problem caused by qemu flushing synchronous ( http://git.qemu.org/?p=qemu.git;a=commit;h=dc7588c1eb3008bda53dde1d6b890cd299758155 ).
There is a patch ready here: http://git.qemu.org/?p=qemu.git;a=commitdiff;h=dc7588c1eb3008bda53dde1d6b890cd299758155
More on this from this thread: http://www.spinics.net/lists/ceph-users/msg01352.html
Compiling new qemu-kvm on ubuntu 13.04 patched with async patch
I suggest using a normal user to build
Install prereqs:
## Get and install deps sudo apt-get install build-essential checkinstall ## I'm not sure if these are even needed sudo apt-get install git-core mercurial cd mkdir build cd build sudo apt-get build-dep qemu-kvm ## Grab source and patch it cd qemu-1.4.0+dfsg/debian/patches/ wget "http://git.qemu.org/?p=qemu.git;a=patch;h=dc7588c1eb3008bda53dde1d6b890cd299758155" -o 0020-async-rbd.patch patch < 0020-async-rbd.patch ## Build it dpkg-buildpackage -rfakeroot -b
It will take some time to build. Afterwards you'll have a bunch of .deb packages in the builds dir.
Just install the qemu-kvm_1.4.0+dfsg-1expubuntu4_amd64.deb and you should be good to try out the new async patch.
Download patched kvm
If you don't want the hassle of compiling yourself and you trust complete strangers with .deb package you can download the one I made here (only amd-64 bit - and only 1.4.0 stock ubuntu 13.04).
http://utils.skarta.net/qemu-kvm/qemu-kvm_1.4.0+dfsg-1expubuntu4_amd64.deb
Manual repair of PG
In Cuttlefish and later the repair procedure is much more intelligent and should be able to safely find the correct pg - so it should be safe to use instead
This is copied from ceph-users mailingslist
First part of the mail:
Some scrub errors showed up on our cluster last week. We had some issues with host stability a couple weeks ago; my guess is that errors were introduced at that point and a recent background scrub detected them. I was able to clear most of them via "ceph pg repair", but several remain. Based on some other posts, I'm guessing that they won't repair because it is the primary copy that has the error. All of our pools are set to size 3 so there _ought_ to be a way to verify and restore the correct data, right? Below is some log output about one of the problem PG's. Can anyone suggest a way to fix the inconsistencies? 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.00000000005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.00000000005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 19.1b repair 0 missing, 1 inconsistent objects 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 19.1b repair 2 errors, 2 fixed 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.00000000005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.00000000005b/5//19 digest 4289025870 != known digest 4190506501 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 19.1b deep-scrub 0 missing, 1 inconsistent objects 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 19.1b deep-scrub 2 errors
Actual solution proposed from an inktank employeer:
You need to find out where the third copy is. Corrupt it. Then let repair copy the data from a good copy. $ ceph pg map 19.1b You should see something like this: osdmap e158 pg 19.1b (19.1b) -> up [13, 22, xx] acting [13, 22, xx] The osd xx that is NOT 13 or 22 has the corrupted copy. Connect to the node that has that osd. Find in the mount for osd xx your object with name "rb.0.6989.2ae8944a.00000000005b" $ find /var/lib/ceph/osd/ceph-xx -name 'rb.0.6989.2ae8944a.00000000005b*' -ls 201326612 4 -rw-r--r-- 1 root root 255 May 22 14:11 /var/lib/ceph/osd/ceph-xx/current/19.1b_head/rb.0.6989.2ae8944a.00000000005b__head_XXXXXXXX__0 I would stop osd xx, first. In this case we find the file is 255 bytes long. In order to make sure this bad copy isn't used. Let's make the file 1 byte longer. $ truncate -s 256 /var/lib/ceph/osd/ceph-xx/current/19.1b_head/rb.0.6989.2ae8944a.00000000005b__head_XXXXXXXX__0 Restart osd xx. Not sure how what command does that on your platform. Verify that OSDs are all running. Shows all osds are up and in. $ ceph -s | grep osdmap osdmap e6: 6 osds: 6 up, 6 in $ ceph osd repair 19.1b instructing pg 19.1b on osd.13 to repair
Change read ahead buffer to improve read speeds
As read speeds go directly to the storage behind ssd or whatever you have it will be as slow as reading from all your distributed devices.
Change it to what the majority of your ceph object size is (4MB default)
You should be able to push that performance a bit with changing
echo 4096 > /sys/block/vda/queue/read_ahead_kb
Change filestore timers to speed up spinning disks
Make sure you understand the implications of doing this. I can (and most likely will) be a huge performance increase to change the default time of when the journal will flush to disk (either when it is half-full or when the max time has elapsed (5s)). By upping this you can make the spinning disk bundle up writes at least instead of small bursts.
## Change the min and max time for osd.4 - use * if you are certain about these values. People from the mailing list reports great success with 30-100s. Personally I've found 20s to be hotspot in our setup ceph tell osd.4 injectargs '--filestore_max_sync_interval 20' ceph tell osd.4 injectargs '--filestore_min_sync_interval 2'
Replace disk with ceph-deploy
ceph osd set noout stop ceph-osd id=12 ... Replace the drive, and once done ... start ceph-osd id=12 ceph osd unset noout
Define mount options using ceph-deploy
The mounting is actually done by "ceph-disk", which can also run from a udev rule. It gets options from the ceph configuration option "osd mount options {fstype}", which you can set globally or per-daemon as with any other ceph option.
Ceph on debian wheezy
You have to either find another package or build a new one with rbd support. The latter is detailed here
Ceph cache pool (ssd cache)
Not validated; grabbed from ceph mailing list
ceph osd tier add satapool ssdpool ceph osd tier cache-mode ssdpool writeback ceph osd pool set ssdpool hit_set_type bloom ceph osd pool set ssdpool hit_set_count 1 ## In this example 80-85% of the cache pool is equal to 280GB ceph osd pool set ssdpool target_max_bytes $((280*1024*1024*1024)) ceph osd tier set-overlay satapool ssdpool ceph osd pool set ssdpool hit_set_period 300 ceph osd pool set ssdpool cache_min_flush_age 300 # 10 minutes ceph osd pool set ssdpool cache_min_evict_age 1800 # 30 minutes ceph osd pool set ssdpool cache_target_dirty_ratio .4 ceph osd pool set ssdpool cache_target_full_ratio .8
CephFS
From 0.86 (giant RC) cephfs is now in semi-prod testing phase.
How to test with samba:
On Mon, 13 Oct 2014, Eric Eastman wrote: > I would be interested in testing the Samba VFS and Ganesha NFS integration > with CephFS. Are there any notes on how to configure these two interfaces > with CephFS? For samba, based on https://github.com/ceph/ceph-qa-suite/blob/master/tasks/samba.py#L106 I think you need something like [myshare] path = / writeable = yes vfs objects = ceph ceph:config_file = /etc/ceph/ceph.conf Not sure what the ganesha config looks like. Matt and the other folks at cohortfs would know more.
Putting journal on another device when using ceph-deploy
From ceph-users mailing list: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-August/042085.html
create N partition on your SSD for your N OSDs ceph osd set noout sudo service ceph stop osd.$ID ceph-osd -i osd.$ID --flush-journal rm -f /var/lib/ceph/osd/<osd-id>/journal ln -s /var/lib/ceph/osd/<osd-id>/journal /dev/<ssd-partition-for-your-journal> ceph-osd -i osd.$ID —mkjournal sudo service ceph start osd.$ID ceph osd unset noout
Recently I was in need of a robust HA NFS setup to share out to a lot of vmware hosts. These hosts would each run 3-5 VM with 10GB mem, 4 cores and 400GB storage. Each vm would need lots of bandwidth and high disk io - both iops and throughput. So a quick test was made with a standalone NFS server. With 20-25 VM it worked fine and we decided to move it to a ceph solution. We tried ceph+kvm (which would then share out via NFS), but it was too slow. Not sure exactly where the issue was, but after a lot of tweaking we gave up. Since these VMs are regularly re-instantiated via salt (every 2-3 week a new image is uploaded and started) we decided it wasn't super critical with storage, so we tested out cephfs. And now we had the performance we wanted. The last bit of testing involved having multiple cephfs (mds) servers (one master - at least one in hot-standby) and HA NFS. It was surprisingly easy:
Install prereq
- Install keepalived
- Install NFS server
aptitude install nfs-kernel-server aptitude install keepalived
Setup keepalived
A basic setup was required (we use this to switch the floating ip of the current NFS-master
~# cat /etc/keepalived/keepalived.conf vrrp_script chk_nfsd { # Requires keepalived-1.1.13 script "killall -0 nfsd" # cheaper than pidof interval 2 # check every 2 seconds } vrrp_instance VI7 { interface bond0 state MASTER nopreempt virtual_router_id 222 priority 101 # 101 on master, 100 on backup virtual_ipaddress { 10.45.8.191 } track_script { chk_nfsd } notify /usr/local/bin/nfs_statechange.sh } ~# cat /usr/local/bin/nfs_statechange.sh #!/bin/bash TYPE=$1 NAME=$2 STATE=$3 case $STATE in "MASTER") /etc/init.d/nfs-kernel-server start exit 0 ;; "BACKUP") /etc/init.d/nfs-kernel-server status exit 0 ;; "FAULT") /etc/init.d/nfs-kernel-server restart exit 0 ;; *) echo "unknown state" exit 1 ;; esac
The statechange setup was altered quite a bit in testing and I am pretty sure it is no longer needed. Since we are in production I need to test on another system first to be sure though.
Setup NFS to be HA
Surprisingly, what took the most time was adding a tiny bit of into to the exports file. Without this change, one would get a "Stale NFS handle" when nfs switched.
/share/esxi 10.0.0.0/8(rw,no_root_squash,insecure,async,no_subtree_check,fsid=42)
Be sure to add a fsid=XXX on _both_ NFS-servers - otherwise they'll serve using different fsid and the client will be confused.
Lastly, and here we will rely on cephfs, we need to share nfs-states between the hosts. Mount the cephfs somewhere and symlink /var/lib/nfs to it on both nodes. Now both NFS-servers will use this This is my setup:
~# ls -l /var/lib/nfs lrwxrwxrwx 1 statd nogroup 15 Mar 16 18:26 /var/lib/nfs -> /share/esxi/nfs
I have mounted /share/esxi as a cephfs filesystem and this is again shared out via NFS.
With this relatively simple setup we have tested how it works in HA-mode. While writing to the system with 2gbit we could shutdown ceph-mds (cephfs) and it would automatically activate the hotstandby with no loss in performance. When shutting down the master nfs it would switch to the other one in ~3-4 sec. During this time no writes/reads are possible, but the vms using the storage are all just waiting and will resume as nothing happened. We tested this with 2gbit write/reads and well and beside the 3-4s gap of no disk io it was completely transparant to all the guests (had 20 vms running on it at the time)
Monitoring ceph
Zabbix
Using systemd to control ceph
## Get a list of services systemctl status ceph-osd@* -l ## Control specifics: systemctl status ceph-osd@0.service
Verify crush rules before applying
$ crushtool -i crushmap.new --test --rule 6 --num-rep 4 --show-utilization $ crushtool -i crushmap.new --test --rule 6 --num-rep 4 --show-mappings
Notable updates in hammer 0.9.10
* ceph-objectstore-tool and ceph-monstore-tool now enable user to rebuild the monitor database from OSDs. (This feature is especially useful when all monitors fail to boot due to leveldb corruption.) * In RADOS Gateway, it is now possible to reshard an existing bucket's index using an off-line tool. Usage: $ radosgw-admin bucket reshard --bucket=<bucket_name> --num_shards=<num_shards> This will create a new linked bucket instance that points to the newly created index objects. The old bucket instance still exists and currently it's up to the user to manually remove the old bucket index objects. (Note that bucket resharding currently requires that all IO (especially writes) to the specific bucket is quiesced.)
Ceph improvements on SSD-pools
> Op 28 april 2017 om 19:14 schreef Sage Weil <sweil@redhat.com>: > > > Hi everyone, > > Are there any osd or filestore options that operators are tuning for > all-SSD clusters? If so (and they make sense) we'd like to introduce them > as defaults for ssd-backed OSDs. > osd_op_threads and osd_disk_threads. They can be increased to 2x or 4x to get performance improvments on SSDs. > BlueStore already has different hdd and ssd default values for many > options that it chooses based on the type of device; we'd like to do that > for other options in either filestore or the OSD if it makes sense. > > Thanks! > sage
= Enable monitoring via mgr-daemon (luminous+)
ceph mgr module enable dashboard ## Then visit <server>:7000 for dashboard