To facilitate the transition from Ceph-Ansible to Rook, SCS has almost finished developing a migration tool called Rookify. This tool simplifies and streamlines the migration process, making it easier for users to switch from a Ceph-Ansible deployment to Rook. The tool is now under first technical preview and is being tested.
Features and Design
Statemachine
Rookify is a python package that uses a state-machine based on the transitions-library to migrate the various resources (such as mons, mgrs, osds, mds and anything else) to Rook. Each of these resources has a corresponding module in Rookify, which can be executed independently or in combination with other modules.
It’s important to note that most modules have dependencies on other modules and will implicitly run them as needed. For example, the mgirate-mons module needs the analyze-ceph module to run first (as indicated by the REQUIRES variable).This is necessary for Rookify to determine the current location of the mons and where they should be migrated to.
Rookify can be configured by editing a comprehensive config.yaml file, such as the provided confing.example.yaml. This configuration file specifies various configuration-dependencies (like SSH keys, Kubernetes and Ceph configurations) and allows users to easily decide which modules should be run (see the migration_modules section below in config.yaml).
Pickle support
Rookify optionally (and recommended) supports using a pickle file (see on top below the general section in config.example.yaml). Pickle is a model for object serialization that saves the state of progress externally, i.e. which modules have been run and information about the target machine. This means that Rookify can 'save' its progress:
- if a running migration is for some reason stopped, Rookify can use the pickle file to continue from the saved state.
- if a migration is run in parts (e.g. modules are run incrementally), then the pickle file allows to save the state of the cluster, i.e. the migration.
⚠️ Important:: This means that if the same Rookify installation should be used to migrate more than one cluster, or the one cluster has significantly changed suddenly, the pickle file should be deleted manually.
A simple CLI
Currently, Rookify only offers a straightforward CLI interface, which can be displayed by running rookify with -h or --help:
usage: Rookify [-h] [-d] [-m] [-s]
options:
-h, --help show this help message and exit
-d, --dry-run Preflight data analysis and migration validation.
-m, --migrate Run the migration.
-s, --show-states Show states of the modules.
The -m or --migrate argument has been added to (really) execute Rookify, while execution without arguments runs in preflight mode, i.e. with -d or --dry-run.
The -s or --show-states has been added to track the state of progress. The pickle file is read out for this purpose. Each module then reports the known status on stdout.
Rookify's general workflow: migrating monitors?
Rookify's main function is to migrate all of Ceph's resources to Rook. Let's take a look at the migration_modules section in the config.yaml file:
# config.yaml
migration_modules:
- migrate_mons
This configuration instructs Rookify to perform the following steps:
- Preflight Mode: The
migrate_monsmodule first runs in a preflight mode, which can also be manually triggered using therookify --dry-runcommand. During this phase, Rookify runs the preflight methods for the configured modules and their dependent modules. If the migration has already been successfully completed, the module stops here. - Dependency Check: If the migrate_mons module has not been run before (indicated by an empty pickle file), Rookify checks for dependencies, e.g. other modules that need to be run first. It execute those modules, first in preflight mode and then for real. The state of each module will optionally be saved in the pickle file.
ceph-analyzemodule: Rookify identifies that theanalyze-cephmodule needs to be run first in any case. Theanalyze-cephmodule collects data on the running Ceph resources and the Kubernetes environment with the Rook operator. Note that, like any other module,ceph-analyzefirst runs in preflight mode to check if the state has already been captured in the pickle file. If no state is found,analyze-cephgathers the necessary information.k8s_prerequisites_checkmodule: Validation for k8s namespace is done here, as it is required that the cluster namespace has been created manually. Only the CephCluster resource is generated in the next step bycreate_rook_cluster.create_rook_clustermodule: After successfully running theanalyze-cephandk8s_prerequisites_checkmodules, Rookify will check for other dependencies, such as thecreate_rook_clustermodule. This module creates theclustermap.yamlfor Rook based on information fromanalyze-cephandk8s_prerequisites_check.
- Migration Execution: Following the successful execution of
analyze_cephandcreate_cluster, themigrate_monsmodule is executed. Rookify shuts down the first running Ceph monitor on the first worker node (usingsudo systemctl disable --now ceph-mon.target) and immediately activates the equivalent monitor in Rook (by setting its metadata in the clustermap.yaml to true). - Monitor Migration: Rookify continues this process for each monitor until all have been migrated to Rook and are running. Optionally, the state can be saved in the pickle file. If, for example, the pickle file contains a finished state of migrate_mons, it will skip this module and only check if it has indeed bin run successfully.
For both managers and monitors, Rookify will use the just described approach: it will try to switch of the ceph resource after it has made sure that it can re-create an equivalent in the Rook cluster. For OSDs and MDS the migratory algorithm is a bit different.
Migrating OSDs
Here the "one-by-one"-algorithm described for managers and monitors does not work, because Rook has a container called rook-ceph-osd-prepare that always tries to find all osds on a path and build all of them at once. Note that there are configuration options that give the impression to handle this, like useAllNodes=false und useAllDevices=false (see rook docs). Both variables are set to false per default in the Rook deployment of OSISM, nevertheless rook-ceph-osd-prepare tries to scan and process all osds per node. This means in effect, that a device is busy-error will occur as well as a crashloop-feedback. This behavior has been mitigated by shutting down all OSD daemons per node:
- all OSD daemons per node have to switched of at once
- paths of osd devices have to be fed one by to
rook-ceph-osd-prepareto enforce sequential processing - wait for each osds on the node to be started before continuing
Migrating MDSs
The "one-by-one"-algorithm described for managers and monitors does not work here either, because Rook might want to update instances while the migration is in process: This might happen, for example, if one MDS's instance of the Ceph-Ansible deployment is shutoff and Rook is allowed to rebuild this instance within Kubernetes. Then Rook might want to update all MDS instances and will consequently try to switch of all the instances in order to update them: also the ones that are still running under Ceph-Ansible. Consequently, Rook will not reach these instances and errors will be thrown.
One way to solve this problem, would be to switch of all mds-instances under Ceph-Ansible and let Rook rebuild all of them. That would cause some minimal downtime though, and Rookify strives to cause 0 downtime.
That is why Rookify currently uses the following approach:
- Two mds instances (at least one has to be active) will be left under Ceph-Ansible, all others will be switched off. For example, in case of a total of three mds instances, only one mds instance will be switched off.
- Of the remaining two instances, the first one will now also be shut down and consequently made available for scheduling in Rook, so that a Rook instance starts. The system waits for this start before the last (remaining) instance is shut down and activated in Rook. All further instances will then be mapped in Rook.
Test run: Give it a try and help with testing process
To get started with Rookify, make sure to checkout the README.md in the repository.
If you'd like to try to test the current state of Rookify (much appreciated – feel free to report any issues to Github), you can use the testbed of OSISM.
Testbed setup
ℹ️ Info: The OSISM testbed is intended for testing purposes, which means it may be unstable. If Ceph and K3s cannot be deployed without errors, you may need to wait for a fix or find a workaround to test Rookify.
In order to setup the testbed, first consult OSISM's testbed documentation to ensure you meet all requirements. If everything is in order, clone the repository and use make ceph to set up a Ceph testbed. This command will automatically pull the necessary Ansible roles, prepare a virtual environment, build the infrastructure with OpenStack, create a manager node, and deploy Ceph on three worker nodes:
git clone github.com:osism/testbed.git
make ceph
Once the infrastructure for Ceph and the testbed has been deployed, log in with make login and deploy K3s as well as a Rook operator:
make login
osism apply k3s
osism apply rook-operator
If you want to modify any configurations, such as a Rook setting, refer to /opt/configuration/environments/rook/ and check OSISM's documentation on Rook for various settings.
Rookify Setup/Configuration for OSISM's Testbed
1. Clone the Rookify Repository
Start by cloning the Rookify repository, setting it up, and building the Python package in a virtual environment using the Makefile. You can simply run make without any arguments to see a list of helper functions that can assist you in setting up and configuring Rookify. The command make setup downloads the necessary virtual environment libraries and installs the Rookify package into .venv/bin/rookify within the working directory:
git clone https://github.com/SovereignCloudStack/rookify
cd rookify
make setup
ℹ️ Info:
If you encounter an error from the python-rados library, you can run make check-radoslib to check if the library is installed locally. If not, install the package manually. The python-rados library should be version 2.0.0 at the time of writing (check the README.md file of Rookify for the most up-to-date documentation). The library could not be integrated within the setup because Ceph currently offers no builds for pip.
2. Configure Rookify
Copy config.example.osism.yaml to config.yaml and modify the various configuration settings as needed. Rookify will require access to an SSH key (e.g., the .id_rsa file in the Terraform directory in the testbed repository), Ceph configuration files (see /etc/ceph/ on one of the worker nodes), and Kubernetes files (e.g., ~/.kube/config from the manager node). Check if the Makefile contains any helper functions to assist you: run make in the root of the working directory to see all options that the Makefile offers.
📝 Note: Ensure that Rookify can connect to the testbed. Refer to OSISM-documentation about how to setup a VPN connection.
general:
machine_pickle_file: data.pickle
logging:
level: INFO # level at which logging should start
format:
time: "%Y-%m-%d %H:%M.%S" # other example: "iso"
renderer: console # or: json
ceph:
config: ./.ceph/ceph.conf
keyring: ./.ceph/ceph.client.admin.keyring
# fill in correct path to private key
ssh:
private_key: /home/USER/.ssh/cloud.private
hosts:
testbed-node-0:
address: 192.168.16.10
user: dragon
testbed-node-1:
address: 192.168.16.11
user: dragon
testbed-node-2:
address: 192.168.16.12
user: dragon
kubernetes:
config: ./k8s/config
rook:
cluster:
name: osism-ceph
namespace: rook-ceph
mds_placement_label: node-role.osism.tech/rook-mds
mgr_placement_label: node-role.osism.tech/rook-mgr
mon_placement_label: node-role.osism.tech/rook-mon
osd_placement_label: node-role.osism.tech/rook-osd
rgw_placement_label: node-role.osism.tech/rook-rgw
ceph:
image: quay.io/ceph/ceph:v18.2.1
migration_modules: # this sets the modules that need to be run. Note that some of the modules require other modules to be run as well, this will happen automatically.
- analyze_ceph
3. Run Rookify:
Finally, run Rookify to test it. Rookify allows the use of --dry-run to run modules in preflight mode. Note that Rookify always checks for a successful run of the various modules in preflight mode before starting the migration.
Additionally it is always safe to run the analyze_ceph module, as it will not make any real changes.
📝 Note:
By default, Rookify runs in preflight mode. If you execute Rookify without any arguments --dry-run will be appended.
To be on the secure side, you can now run Rookify in preflight mode. In this case you could also execute it directly, i.e. by adding the argument --migrate (because the analyze_ceph module does not break anything):
# Preflight mode
.venv/bin/rookify --dry-run
# Execution mode
.venv/bin/rookify --migrate
⚠️ Important: It is advised to run any module in preflight mode first, to avoid any real changes.
If all is setup correctly you will see an output similar to this:
.venv/bin/rookify --migrate
2024-09-02 15:21.37 [info ] Execution started with machine pickle file
2024-09-02 15:21.37 [info ] AnalyzeCephHandler ran successfully.
Note that now there is a data.pickle file in the root of your working directory. The file should contain data:
du -sh data.pickle
8.0K data.pickle
At this point we can re-edit the config.yaml file to migrate the osds, mds, managers and radosgateway resources:
migration_modules:
- analyze_ceph
- create_rook_cluster
- migrate_mons
- migrate_osds
- migrate_osd_pools
- migrate_mds
- migrate_mds_pools
- migrate_mgrs
- migrate_rgws
- migrate_rgw_pools
ℹ️ Info:
Some of these are redundant in the sense that their REQUIRED variables contain the modules as their dependencies already. For example migrate_osds has the following REQUIRED-variable: REQUIRES = ["migrate_mons"], and migrate_mons has REQUIRES = ["analyze_ceph", "create_rook_cluster"]. So it would be fine to skip the first three modules. Still, the extra mention of the modules could improve clarity for the reader. In effect Rookify will run the modules only once, so it does not hurt to add them in the config.yaml.
We can first run Rookify with in preflight mode ( --dry-run ) to check if all is ok and then run it with --migration. Executing Rookify should then give you an output similar to this:
.venv/bin/rookify --migration
2024-09-04 08:52.02 [info ] Execution started with machine pickle file
2024-09-04 08:52.04 [info ] AnalyzeCephHandler ran successfully.
2024-09-04 08:52.04 [info ] Validated Ceph to expect cephx auth
2024-09-04 08:52.04 [warning ] Rook Ceph cluster will be configured without a public network and determine it automatically during runtime
2024-09-04 08:52.04 [info ] Rook Ceph cluster will be configured without a cluster network
2024-09-04 08:52.11 [warning ] ceph-mds filesystem 'cephfs' uses an incompatible pool metadata name 'cephfs_metadata' and can not be migrated to Rook automatically
2024-09-04 08:52.16 [info ] Creating Rook cluster definition
2024-09-04 08:52.16 [info ] Waiting for Rook cluster created
2024-09-04 08:52.16 [info ] Migrating ceph-mon daemon 'testbed-node-0'
2024-09-04 08:52.32 [info ] Disabled ceph-mon daemon 'testbed-node-0'
2024-09-04 08:53.45 [info ] Quorum of 3 ceph-mon daemons successful
2024-09-04 08:53.45 [info ] Migrating ceph-mon daemon 'testbed-node-1'
2024-09-04 08:54.07 [info ] Disabled ceph-mon daemon 'testbed-node-1'
2024-09-04 08:54.44 [info ] Quorum of 3 ceph-mon daemons successful
2024-09-04 08:54.44 [info ] Migrating ceph-mon daemon 'testbed-node-2'
2024-09-04 08:55.04 [info ] Disabled ceph-mon daemon 'testbed-node-2'
2024-09-04 08:55.52 [info ] Quorum of 3 ceph-mon daemons successful
2024-09-04 08:55.52 [info ] Migrating ceph-osd host 'testbed-node-0'
2024-09-04 08:55.55 [info ] Disabled ceph-osd daemon 'testbed-node-0@0'
2024-09-04 08:55.57 [info ] Disabled ceph-osd daemon 'testbed-node-0@4'
2024-09-04 08:55.57 [info ] Enabling Rook based ceph-osd node 'testbed-node-0'
2024-09-04 08:57.00 [info ] Rook based ceph-osd daemon 'testbed-node-0@0' available
2024-09-04 08:57.02 [info ] Rook based ceph-osd daemon 'testbed-node-0@4' available
2024-09-04 08:57.02 [info ] Migrating ceph-osd host 'testbed-node-1'
2024-09-04 08:57.05 [info ] Disabled ceph-osd daemon 'testbed-node-1@1'
2024-09-04 08:57.07 [info ] Disabled ceph-osd daemon 'testbed-node-1@3'
2024-09-04 08:57.07 [info ] Enabling Rook based ceph-osd node 'testbed-node-1'
2024-09-04 08:58.46 [info ] Rook based ceph-osd daemon 'testbed-node-1@1' available
2024-09-04 08:58.46 [info ] Rook based ceph-osd daemon 'testbed-node-1@3' available
2024-09-04 08:58.46 [info ] Migrating ceph-osd host 'testbed-node-2'
2024-09-04 08:58.48 [info ] Disabled ceph-osd daemon 'testbed-node-2@2'
2024-09-04 08:58.50 [info ] Disabled ceph-osd daemon 'testbed-node-2@5'
2024-09-04 08:58.50 [info ] Enabling Rook based ceph-osd node 'testbed-node-2'
2024-09-04 09:00.25 [info ] Rook based ceph-osd daemon 'testbed-node-2@2' available
2024-09-04 09:00.27 [info ] Rook based ceph-osd daemon 'testbed-node-2@5' available
2024-09-04 09:00.27 [info ] Migrating ceph-mds daemon at host 'testbed-node-0'
2024-09-04 09:00.27 [info ] Migrating ceph-mds daemon at host 'testbed-node-1'
2024-09-04 09:00.27 [info ] Migrating ceph-mds daemon at host 'testbed-node-2'
2024-09-04 09:00.27 [info ] Migrating ceph-mgr daemon at host'testbed-node-0'
2024-09-04 09:01.03 [info ] Disabled ceph-mgr daemon 'testbed-node-0' and enabling Rook based daemon
2024-09-04 09:01.20 [info ] 3 ceph-mgr daemons are available
2024-09-04 09:01.20 [info ] Migrating ceph-mgr daemon at host'testbed-node-1'
2024-09-04 09:01.51 [info ] Disabled ceph-mgr daemon 'testbed-node-1' and enabling Rook based daemon
2024-09-04 09:02.09 [info ] 3 ceph-mgr daemons are available
2024-09-04 09:02.09 [info ] Migrating ceph-mgr daemon at host'testbed-node-2'
2024-09-04 09:02.41 [info ] Disabled ceph-mgr daemon 'testbed-node-2' and enabling Rook based daemon
2024-09-04 09:03.00 [info ] 3 ceph-mgr daemons are available
2024-09-04 09:03.00 [info ] Migrating ceph-rgw zone 'default'
2024-09-04 09:03.00 [info ] Migrated ceph-rgw zone 'default'
2024-09-04 09:03.00 [info ] Migrating ceph-osd pool 'backups'
2024-09-04 09:03.01 [info ] Migrated ceph-osd pool 'backups'
2024-09-04 09:03.01 [info ] Migrating ceph-osd pool 'volumes'
2024-09-04 09:03.01 [info ] Migrated ceph-osd pool 'volumes'
2024-09-04 09:03.01 [info ] Migrating ceph-osd pool 'images'
2024-09-04 09:03.01 [info ] Migrated ceph-osd pool 'images'
2024-09-04 09:03.01 [info ] Migrating ceph-osd pool 'metrics'
2024-09-04 09:03.01 [info ] Migrated ceph-osd pool 'metrics'
2024-09-04 09:03.01 [info ] Migrating ceph-osd pool 'vms'
2024-09-04 09:03.01 [info ] Migrated ceph-osd pool 'vms'
2024-09-04 09:03.01 [info ] Migrating ceph-rgw daemon at host 'testbed-node-2'
2024-09-04 09:04.27 [info ] Disabled ceph-rgw host 'testbed-node-2'
2024-09-04 09:04.35 [info ] Rook based RGW daemon for node 'testbed-node-2' available
2024-09-04 09:04.35 [info ] Migrating ceph-rgw daemon at host 'testbed-node-1'
2024-09-04 09:04.41 [info ] Disabled ceph-rgw host 'testbed-node-1'
2024-09-04 09:05.09 [info ] Rook based RGW daemon for node 'testbed-node-1' available
2024-09-04 09:05.09 [info ] Migrating ceph-rgw daemon at host 'testbed-node-0'
2024-09-04 09:05.13 [info ] Disabled ceph-rgw host 'testbed-node-0'
2024-09-04 09:05.19 [info ] Rook based RGW daemon for node 'testbed-node-0' available
Wow! You migrated from Ceph-Ansible to Rook!
Now login into the testbed and check your ceph cluster's health with ceph -s. Also use kubectl to check if all Rook pods are running.