Ceph - howto, rbd, lvm, cluster: Difference between revisions

From Skytech
Jump to navigation Jump to search
Line 73: Line 73:
to get an overview.
to get an overview.


== If OSD is member of a pool and/or active ==
Delete a OSD
Here we'll delete osd5
<pre>
<pre>
## Mark it out
ceph osd crush remove osd.12
ceph osd out 5

## Wait for data migration to complete (ceph -w), then stop it
service ceph -a stop osd.5

## Now it is marked out and down
</pre>

Delete the OSD:
<pre>
## If deleting from active stack, be sure to follow the above to mark it out and down
ceph osd crush remove osd.5

## Remove auth for disk
ceph auth del osd.5

## Remove disk
ceph osd rm 5

## Remove from ceph.conf and copy new conf to all hosts
</pre>
</pre>



Revision as of 21:47, 19 January 2013


Install ceph

wget -q -O- https://raw.github.com/ceph/ceph/master/keys/release.asc | apt-key add -
echo deb http://ceph.com/debian/ $(lsb_release -sc) main | tee /etc/apt/sources.list.d/ceph.list
apt-get update && apt-get install ceph


Video to ceph intro

https://www.youtube.com/watch?v=UXcZ2bnnGZg
http://www.youtube.com/watch?v=BBOBHMvKfyc&feature=g-high

Rebooting node stops everything / Set number of replicas across all nodes

Make sure that the min replica count is set to nodes-1.

ceph osd pool set <poolname> min_size 1

Then the remaing node[s] will start up with just 1 node if everything else is down.

Keep in mind this can potentially make stuff ugly as there are no replicas now.

More info here: http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10481

Add disks (OSD) or entire nodes

Prepare the disk as usual (partition or entire disk) - format with filesystem of choosing. Add to fstab and mount. Add to /etc/ceph/ceph.conf and replicate the new conf to the other nodes.

Start the disk, I'm assuming we've added osd.12 (sdd) on ceph1 here.

## Prepare disk first, create partition and format it
<insert parted oneliner>
mkfs.xfs -f /dev/sdd1

## Auth stuff to make sure that the OSD is accepted into the cluser:
mkdir /srv/ceph/osd12
ceph-osd -i 12 --mkfs --mkkey
ceph auth add osd.12 osd 'allow *' mon 'allow rwx' -i /etc/ceph/keyring.osd.12

## Create disk and start it
ceph osd create osd.12
/etc/init.d/ceph start osd.12

## Add it to the cluster and allow replicated based on CRUSH map.
ceph osd crush set 12 osd.12 1.0 pool=default rack=unknownrack host=ceph1

In the line above, if you exchange the pool/rack/host you can place your disk/node where you want.

If you add a new host entry, it will be the same as adding a new node (with the disk).

Check that is in the right place with:

ceph osd tree

More info here:

Delete pools/OSD

Make sure you have the right disk, run

ceph osd tree

to get an overview.

If OSD is member of a pool and/or active

Here we'll delete osd5

## Mark it out
ceph osd out 5

## Wait for data migration to complete (ceph -w), then stop it
service ceph -a stop osd.5

## Now it is marked out and down

Delete the OSD:

## If deleting from active stack, be sure to follow the above to mark it out and down
ceph osd crush remove osd.5

## Remove auth for disk
ceph auth del osd.5

## Remove disk
ceph osd rm 5

## Remove from ceph.conf and copy new conf to all hosts

Add Monitor node/service

Install ceph, add keys, ceph.conf, host files and prepare a storage for containing the maps.

Then add the monitor into the system (To keep quorum, keep either 1 or 3+ - not 2). Examples are adding monitor mon.21 with ip 192.168.0.68

cd /tmp; mkdir add_monitor; cd add_monitor
ceph auth get mon. -o key
> exported keyring for mon.

ceph mon getmap -o map
> got latest monmap

ceph-mon -i 21 --mkfs --monmap map --keyring key 
> ceph-mon: created monfs at /srv/ceph/mon21 for mon.21

ceph mon add 21 192.168.0.68
> port defaulted to 6789added mon.21 at 192.168.0.68:6789/0

/etc/init.d/ceph start mon.21


Add the info to the ceph.conf file:

[mon]
...
[mon.21]
    host = ceph2-mon
    mon addr = 192.168.0.68:6789
...

Replicating from OSD-based to replication across hosts in a ceph cluster

More info here: http://jcftang.github.com/2012/09/06/going-from-replicating-across-osds-to-replicating-across-hosts-in-a-ceph-cluster/

Replication - see current level pr. OSD

ceph osd dump

CRUSH maps

Redistributing, [de]assembling and finetuning; more info here:

http://hpc.admin-magazine.com/Articles/RADOS-and-Ceph-Part-2

KVM - add disk

Pr host:

    <disk type='network' device='disk'>
      <driver name='qemu' type='raw'/>
      <source protocol="rbd" name="test/disk2-qemu-5g:rbd_cache=1">
        <host name='192.168.0.67' port='6789'/>
        <host name='192.168.0.68' port='6789'/>
      </source>
      <auth username='admin' type='ceph'>
        <secret type='ceph' uuid='7a91dc24-b072-43c4-98fb-4b2415322b0f'/>
      </auth>
      <target dev='vdb' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>

Pr pool:

<pool type='rbd'>
   <name>rbd</name>
   <uuid>f959641f-f518-4505-9e85-17d994e2a399</uuid>
   <source>
     <host name='192.168.0.67' port='6789'/>
     <host name='192.168.0.68' port='6789'/>
     <host name='192.168.0.69' port='6789'/>
     <name>test</name>
     <auth username='admin' type='ceph'>
       <secret type='ceph' uuid='7a91dc24-b072-43c4-98fb-4b2415322b0f'/>
     </auth>
   </source>
</pool>

KVM - add secret/auth for use with ceph

Create a secret.xml file:

<secret ephemeral='no' private='no'>
   <uuid>7a91dc24-b072-43c4-98fb-4b2415322b0f</uuid>
   <usage type='ceph'>
     <name>admin</name>
   </usage>
</secret>

Use it:

virsh secret-define secret.xml
virsh secret-set-value 7a91dc24-b072-43c4-98fb-4b2415322b0f AQDAD8JQOLS9IxAAbox00eOmlM1h5ZLGPxHGHw==

The last key is the key from your /etc/ceph/keyring.admin

cat /etc/ceph/keyring.admin 
[client.admin]
        key = AQDAD8JQOLS9IxAAbox00eOmlM1h5ZLGPxHGHw==

Online resizing of KVM images (rbd)

Resize the desired block-image (here going from 30GB -> 40GB)

qemu-img resize -f rbd rbd:sata/disk3 40G
> Image resized.

Find the attached target device:

virsh domblklist rbd-test
> Target     Source
> ------------------------------------------------
> vdb        sata/disk2-qemu-5g:rbd_cache=1
> vdc        sata/disk3:rbd_cache=1
> hdc        -

Then use virsh to tell the guest that the disk has a new size:

virsh blockresize --domain rbd-test --path "vdc" --size 40G
> Block device 'vdc' is resized

Check raw rbd info

rbd --pool sata info disk3
> rbd image 'disk3':
>        size 40960 MB in 10240 objects
>        order 22 (4096 KB objects)
>        block_name_prefix: rb.0.13fb.23353e97
>        parent:  (pool -1)

Make sure you can see the change from dmesg (Guest should see the new size change).

dmesg
> [...]
> [75830.538557] vdb: detected capacity change from 118111600640 to 123480309760
> [...]

Then extend the partition - if it is a simple data volume, you can just fdisk, remove the old partition, create a new and access default values for start/end (Note, only applies to partitions which holds nothing else!)

Write the partition, fdisk -l to doublecheck the size, then remount the partition (the partition from above is mounted as a data dir under vdb1 in my case):

mount -o remount,rw /dev/vdb1

Check your fstab to make sure you get the correct options for the remount.

Afterwards, call the resize2fs:

resize2fs /dev/vdb1
> resize2fs 1.42.5 (29-Jul-2012)
> Filesystem at /dev/vdb1 is mounted on /home/mirroruser/mirror; on-line resizing required
> old_desc_blocks = 7, new_desc_blocks = 7
> The filesystem on /dev/vdb1 is now 28835584 blocks long.

doublecheck via df -h or the like.

Add/move journal in running cluster

.. I'll assume we want to update node#1 having OSD 0,1,2 to put a journal on SSD with 10GB. It currently reside on each OSD with a 512MB journal ..
.. Assuming we'll be mounting a SSD here: /srv/ceph/journal   -- This will then hold all journals as /srv/ceph/journal/osd$id/journal ..
# Relevant ceph.conf options
.. existing setup ..
[osd]
    osd data = /srv/ceph/osd$id
    osd journal = /srv/ceph/osd$id/journal
    osd journal size = 512

# Mark all OSD out (One can do this with an entire node or just a single disk at a time - the latter require a setting under each OSD to locate the journal and not a global OSD option.
ceph osd out 0
ceph osd out 1
ceph osd out 2

# Make sure they're marked out ('''ceph osd tree'''), then stop them:
/etc/init.d/ceph -a osd.0 stop
/etc/init.d/ceph -a osd.1 stop
/etc/init.d/ceph -a osd.2 stop

# Flush the journal:
ceph-osd -i 0 --flush-journal
ceph-osd -i 1 --flush-journal
ceph-osd -i 2 --flush-journal

# Now update ceph.conf - '''this is very important or you'll just recreate journal on the same disk again'''
.. change to ..
[osd]
    osd data = /srv/ceph/osd$id
    osd journal = /srv/ceph/journal/osd$id/journal
    osd journal size = 10000

# Create new journal on each disk
ceph-osd -i 0 --mkjournal
ceph-osd -i 1 --mkjournal
ceph-osd -i 2 --mkjournal

# Done, now start all OSD again
/etc/init.d/ceph -a osd.0 start
/etc/init.d/ceph -a osd.1 start
/etc/init.d/ceph -a osd.2 start

# Mark them in again
ceph osd in 0
ceph osd in 1
ceph osd in 2

Enjoy your new faster journal!

More info here (source of the above):

Prevent OSD from being marked down

This could be the case while adding new OSD/nodes and some osd keeps being marked down

ceph osd set nodown

Check it is set by issuing:

ceph osd dump | grep flags
> flags no-down

When done, remember to unset it or no OSD will ever get marked down!

ceph osd unset nodown

Doublecheck with ceph osd dump | grep flags

Increase osd timeout

I've found the existing 30sec rule is sometimes too little for my 4disk lowend system.

Set the timeout accordingly (try and run debug first to determine if this is really the case).

osd op thread timeout = 60

Full OSD - how to recover

  • Add more disks
  • Try lower the weight ratio of the full disk
  • If everything fails, up the full ratio and immediately change the weight ratio of the disk
    • ceph pg set_full_ratio 0.98

Alternatively, if you want to change the default values (0.85 as near_full, 0.95 as full), you can add this to the mon section of your ceph.conf:

[mon]
...
    mon osd nearfull ratio = 0.92
    mon osd full ratio = 0.98
...

Loosing 2 nodes in a 3 node system

Will cause the cluster to halt because of a new feature. Explained by Greg Farnum from Ceph mailinglist:

This took me a bit to work out as well, but you've run afoul of a new
post-argonaut feature intended to prevent people from writing with
insufficient durability. Pools now have a "min size" and PGs in that
pool won't go active if they don't have that many OSDs to write on.
The clue here is the "incomplete" state. You can change it with "ceph
osd pool foo set min_size 1", where "foo" is the name of the pool
whose min_size you wish to change (and this command sets the min size
to 1, obviously). The default for new pools is controlled by the "osd
pool default min size" config value (which you should put in the
global section). By default it'll be half of your default pool size.

So in your case your pools have a default size of 3, and the min size
is (3/2 = 1.5 rounded up), and the OSDs are refusing to go active
because of the dramatically reduced redundancy. You can set the min
size down though and they will go active.

Import LVM image into rbd

Turned out to be really simple

I started by grabbing a snapshot of a running lvm

lvcreate -L25G -s -n snap-webserver /dev/storage/webserver

And then just feeding that snapshot directly into rbd:

rbd import /dev/storage/snap-webserver sata/webserver

Here sata/webserver means pool sata. The webserver image will be automatically created.