Avoiding downtime in migrations
When working with a database certain operations may require downtime. Since we cannot have downtime in migrations we need to use a set of steps to get the same end result without downtime. This guide describes various operations that may appear to need downtime, their impact, and how to perform them without requiring downtime.
Dropping Columns
Removing columns is tricky because running GitLab processes may still be using the columns. To work around this safely, you will need three steps in three releases:
- Ignoring the column (release M)
- Dropping the column (release M+1)
- Removing the ignore rule (release M+2)
The reason we spread this out across three releases is that dropping a column is a destructive operation that can't be rolled back easily.
Following this procedure helps us to make sure there are no deployments to GitLab.com and upgrade processes for self-managed installations that lump together any of these steps.
Step 1: Ignoring the column (release M)
The first step is to ignore the column in the application code. This is
necessary because Rails caches the columns and re-uses this cache in various
places. This can be done by defining the columns to ignore. For example, to ignore
updated_at
in the User model you'd use the following:
class User < ApplicationRecord
include IgnorableColumns
ignore_column :updated_at, remove_with: '12.7', remove_after: '2020-01-22'
end
Multiple columns can be ignored, too:
ignore_columns %i[updated_at created_at], remove_with: '12.7', remove_after: '2020-01-22'
If the model exists in CE and EE, the column has to be ignored in the CE model. If the model only exists in EE, then it has to be added there.
We require indication of when it is safe to remove the column ignore with:
-
remove_with
: set to a GitLab release typically two releases (M+2) after adding the column ignore. -
remove_after
: set to a date after which we consider it safe to remove the column ignore, typically after the M+1 release date, during the M+2 development cycle.
This information allows us to reason better about column ignores and makes sure we don't remove column ignores too early for both regular releases and deployments to GitLab.com. For example, this avoids a situation where we deploy a bulk of changes that include both changes to ignore the column and subsequently remove the column ignore (which would result in a downtime).
In this example, the change to ignore the column went into release 12.5.
Step 2: Dropping the column (release M+1)
Continuing our example, dropping the column goes into a post-deployment migration in release 12.6:
remove_column :user, :updated_at
Step 3: Removing the ignore rule (release M+2)
With the next release, in this example 12.7, we set up another merge request to remove the ignore rule.
This removes the ignore_column
line and - if not needed anymore - also the inclusion of IgnoreableColumns
.
This should only get merged with the release indicated with remove_with
and once
the remove_after
date has passed.
Renaming Columns
Renaming columns the normal way requires downtime as an application may continue using the old column name during/after a database migration. To rename a column without requiring downtime we need two migrations: a regular migration, and a post-deployment migration. Both these migration can go in the same release.
Step 1: Add The Regular Migration
First we need to create the regular migration. This migration should use
Gitlab::Database::MigrationHelpers#rename_column_concurrently
to perform the
renaming. For example
# A regular migration in db/migrate
class RenameUsersUpdatedAtToUpdatedAtTimestamp < Gitlab::Database::Migration[1.0]
disable_ddl_transaction!
def up
rename_column_concurrently :users, :updated_at, :updated_at_timestamp
end
def down
undo_rename_column_concurrently :users, :updated_at, :updated_at_timestamp
end
end
This will take care of renaming the column, ensuring data stays in sync, and copying over indexes and foreign keys.
If a column contains one or more indexes that don't contain the name of the original column, the previously described procedure will fail. In that case, you'll first need to rename these indexes.
Step 2: Add A Post-Deployment Migration
The renaming procedure requires some cleaning up in a post-deployment migration.
We can perform this cleanup using
Gitlab::Database::MigrationHelpers#cleanup_concurrent_column_rename
:
# A post-deployment migration in db/post_migrate
class CleanupUsersUpdatedAtRename < Gitlab::Database::Migration[1.0]
disable_ddl_transaction!
def up
cleanup_concurrent_column_rename :users, :updated_at, :updated_at_timestamp
end
def down
undo_cleanup_concurrent_column_rename :users, :updated_at, :updated_at_timestamp
end
end
If you're renaming a large table, please carefully consider the state when the first migration has run but the second cleanup migration hasn't been run yet. With Canary it is possible that the system runs in this state for a significant amount of time.
Changing Column Constraints
Adding or removing a NOT NULL
clause (or another constraint) can typically be
done without requiring downtime. However, this does require that any application
changes are deployed first. Thus, changing the constraints of a column should
happen in a post-deployment migration.
Avoid using change_column
as it produces an inefficient query because it re-defines
the whole column type.
You can check the following guides for each specific use case:
Changing Column Types
Changing the type of a column can be done using
Gitlab::Database::MigrationHelpers#change_column_type_concurrently
. This
method works similarly to rename_column_concurrently
. For example, let's say
we want to change the type of users.username
from string
to text
.
Step 1: Create A Regular Migration
A regular migration is used to create a new column with a temporary name along with setting up some triggers to keep data in sync. Such a migration would look as follows:
# A regular migration in db/migrate
class ChangeUsersUsernameStringToText < Gitlab::Database::Migration[1.0]
disable_ddl_transaction!
def up
change_column_type_concurrently :users, :username, :text
end
def down
undo_change_column_type_concurrently :users, :username
end
end
Step 2: Create A Post Deployment Migration
Next we need to clean up our changes using a post-deployment migration:
# A post-deployment migration in db/post_migrate
class ChangeUsersUsernameStringToTextCleanup < Gitlab::Database::Migration[1.0]
disable_ddl_transaction!
def up
cleanup_concurrent_column_type_change :users, :username
end
def down
undo_cleanup_concurrent_column_type_change :users, :username, :string
end
end
And that's it, we're done!
Casting data to a new type
Some type changes require casting data to a new type. For example when changing from text
to jsonb
.
In this case, use the type_cast_function
option.
Make sure there is no bad data and the cast will always succeed. You can also provide a custom function that handles
casting errors.
Example migration:
def up
change_column_type_concurrently :users, :settings, :jsonb, type_cast_function: 'jsonb'
end
Changing The Schema For Large Tables
While change_column_type_concurrently
and rename_column_concurrently
can be
used for changing the schema of a table without downtime, it doesn't work very
well for large tables. Because all of the work happens in sequence the migration
can take a very long time to complete, preventing a deployment from proceeding.
They can also produce a lot of pressure on the database due to it rapidly
updating many rows in sequence.
To reduce database pressure you should instead use a background migration
when migrating a column in a large table (for example, issues
). This will
spread the work / load over a longer time period, without slowing down deployments.
For more information, see the documentation on cleaning up background migrations.
Adding Indexes
Adding indexes does not require downtime when add_concurrent_index
is used.
See also Migration Style Guide for more information.
Dropping Indexes
Dropping an index does not require downtime.
Adding Tables
This operation is safe as there's no code using the table just yet.
Dropping Tables
Dropping tables can be done safely using a post-deployment migration, but only if the application no longer uses the table.
Renaming Tables
Renaming tables requires downtime as an application may continue using the old table name during/after a database migration.
If the table and the ActiveRecord model is not in use yet, removing the old table and creating a new one is the preferred way to "rename" the table.
Renaming a table is possible without downtime by following our multi-release rename table process.
Adding Foreign Keys
Adding foreign keys usually works in 3 steps:
- Start a transaction
- Run
ALTER TABLE
to add the constraint(s) - Check all existing data
Because ALTER TABLE
typically acquires an exclusive lock until the end of a
transaction this means this approach would require downtime.
GitLab allows you to work around this by using
Gitlab::Database::MigrationHelpers#add_concurrent_foreign_key
. This method
ensures that no downtime is needed.
Removing Foreign Keys
This operation does not require downtime.
integer
primary keys to bigint
Migrating To prevent the overflow risk for some tables
with integer
primary key (PK), we have to migrate their PK to bigint
. The process to do this
without downtime and causing too much load on the database is described below.
Initialize the conversion and start migrating existing data (release N)
To start the process, add a regular migration to create the new bigint
columns. Use the provided
initialize_conversion_of_integer_to_bigint
helper. The helper also creates a database trigger
to keep in sync both columns for any new records (see an example):
class InitializeConversionOfCiStagesToBigint < ActiveRecord::Migration[6.1]
include Gitlab::Database::MigrationHelpers
TABLE = :ci_stages
COLUMNS = %i(id)
def up
initialize_conversion_of_integer_to_bigint(TABLE, COLUMNS)
end
def down
revert_initialize_conversion_of_integer_to_bigint(TABLE, COLUMNS)
end
end
Ignore the new bigint
columns:
module Ci
class Stage < Ci::ApplicationRecord
include IgnorableColumns
ignore_column :id_convert_to_bigint, remove_with: '14.2', remove_after: '2021-08-22'
end
To migrate existing data, we introduced new type of batched background migrations.
Unlike the classic background migrations, built on top of Sidekiq, batched background migrations
don't have to enqueue and schedule all the background jobs at the beginning.
They also have other advantages, like automatic tuning of the batch size, better progress visibility,
and collecting metrics. To start the process, use the provided backfill_conversion_of_integer_to_bigint
helper (example):
class BackfillCiStagesForBigintConversion < ActiveRecord::Migration[6.1]
include Gitlab::Database::MigrationHelpers
TABLE = :ci_stages
COLUMNS = %i(id)
def up
backfill_conversion_of_integer_to_bigint(TABLE, COLUMNS)
end
def down
revert_backfill_conversion_of_integer_to_bigint(TABLE, COLUMNS)
end
end
Monitor the background migration
Check how the migration is performing while it's running. Multiple ways to do this are described below.
High-level status of batched background migrations
See how to check the status of batched background migrations.
Query the database
We can query the related database tables directly. Requires access to read-only replica. Example queries:
-- Get details for batched background migration for given table
SELECT * FROM batched_background_migrations WHERE table_name = 'namespaces'\gx
-- Get count of batched background migration jobs by status for given table
SELECT
batched_background_migrations.id, batched_background_migration_jobs.status, COUNT(*)
FROM
batched_background_migrations
JOIN batched_background_migration_jobs ON batched_background_migrations.id = batched_background_migration_jobs.batched_background_migration_id
WHERE
table_name = 'namespaces'
GROUP BY
batched_background_migrations.id, batched_background_migration_jobs.status;
-- Batched background migration progress for given table (based on estimated total number of tuples)
SELECT
m.table_name,
LEAST(100 * sum(j.batch_size) / pg_class.reltuples, 100) AS percentage_complete
FROM
batched_background_migrations m
JOIN batched_background_migration_jobs j ON j.batched_background_migration_id = m.id
JOIN pg_class ON pg_class.relname = m.table_name
WHERE
j.status = 3 AND m.table_name = 'namespaces'
GROUP BY m.id, pg_class.reltuples;
Sidekiq logs
We can also use the Sidekiq logs to monitor the worker that executes the batched background migrations:
- Sign in to Kibana with a
@gitlab.com
email address. - Change the index pattern to
pubsub-sidekiq-inf-gprd*
. - Add filter for
json.queue: cronjob:database_batched_background_migration
.
PostgreSQL slow queries log
Slow queries log keeps track of low queries that took above 1 second to execute. To see them for batched background migration:
- Sign in to Kibana with a
@gitlab.com
email address. - Change the index pattern to
pubsub-postgres-inf-gprd*
. - Add filter for
json.endpoint_id.keyword: Database::BatchedBackgroundMigrationWorker
. - Optional. To see only updates, add a filter for
json.command_tag.keyword: UPDATE
. - Optional. To see only failed statements, add a filter for
json.error_severity.keyword: ERROR
. - Optional. Add a filter by table name.
Grafana dashboards
To monitor the health of the database, use these additional metrics:
- PostgreSQL Tuple Statistics: if you see high rate of updates for the tables being actively converted, or increasing percentage of dead tuples for this table, it might mean that autovacuum cannot keep up.
- PostgreSQL Overview: if you see high system usage or transactions per second (TPS) on the primary database server, it might mean that the migration is causing problems.
Prometheus metrics
Number of metrics for each batched background migration are published to Prometheus. These metrics can be searched for and visualized in Thanos (see an example).
Swap the columns (release N + 1)
After the background is completed and the new bigint
columns are populated for all records, we can
swap the columns. Swapping is done with post-deployment migration. The exact process depends on the
table being converted, but in general it's done in the following steps:
- Using the provided
ensure_batched_background_migration_is_finished
helper, make sure the batched migration has finished (see an example). If the migration has not completed, the subsequent steps fail anyway. By checking in advance we aim to have more helpful error message. - Create indexes using the
bigint
columns that match the existing indexes using theinteger
column (see an example). - Create foreign keys (FK) using the
bigint
columns that match the existing FKs using theinteger
column. Do this both for FK referencing other tables, and FKs that reference the table that is being migrated (see an example). - Inside a transaction, swap the columns:
- Lock the tables involved. To reduce the chance of hitting a deadlock, we recommended to do this in parent to child order (see an example).
- Rename the columns to swap names (see an example)
- Reset the trigger function (see an example).
- Swap the defaults (see an example).
- Swap the PK constraint (if any) (see an example).
- Remove old indexes and rename new ones (see an example).
- Remove old FKs (if still present) and rename new ones (see an example).
See example merge request, and migration.
integer
columns (release N + 2)
Remove the trigger and old Using post-deployment migration and the provided cleanup_conversion_of_integer_to_bigint
helper,
drop the database trigger and the old integer
columns (see an example).
Remove ignore rules (release N + 3)
In the next release after the columns were dropped, remove the ignore rules as we do not need them anymore (see an example).
Data migrations
Data migrations can be tricky. The usual approach to migrate data is to take a 3 step approach:
- Migrate the initial batch of data
- Deploy the application code
- Migrate any remaining data
Usually this works, but not always. For example, if a field's format is to be changed from JSON to something else we have a bit of a problem. If we were to change existing data before deploying application code we'll most likely run into errors. On the other hand, if we were to migrate after deploying the application code we could run into the same problems.
If you merely need to correct some invalid data, then a post-deployment migration is usually enough. If you need to change the format of data (for example, from JSON to something else) it's typically best to add a new column for the new data format, and have the application use that. In such a case the procedure would be:
- Add a new column in the new format
- Copy over existing data to this new column
- Deploy the application code
- In a post-deployment migration, copy over any remaining data
In general there is no one-size-fits-all solution, therefore it's best to discuss these kind of migrations in a merge request to make sure they are implemented in the best way possible.