Archive A Large Database Table

Recently I was tasked with the seemingly simple task of running an alter table to update the data type of a table. However, because the table had tens of millions of rows I soon discovered the task would not be so easy to solve. In this post, I'll explain the solution: creating an archive version of the table. 

To begin with, I tested how long the database would take to update every row with the new data type. At almost two hours, it was too long to run in production because the website would be unresponsive for the entirety of the process. 

Next, I evaluated the data within the table and discovered that even though tens of millions of rows were stored, we only needed to access data that was recently saved (within the last day). Thus, I decided to move the rest of the data to a separate table via a cron script and then run the migration on the much smaller table. 

In this scenario, I am using Laravel, but the code samples can easily be adapted to your needs. To start with I created the new archive table.

php artisan make:migration create_search_archive_table
public function up()
{
    Schema::create('search_archive', function (Blueprint $table) {
        $table->increments('id');
        $table->integer('original_id')->unsigned()->index();
        $table->string('keyword');
        $table->timestamp('original_created_at')->nullable();
        $table->timestamp('original_updated_at')->nullable();
        $table->timestamps();
    });
}

Notice that in the create table function, I have preserved the original id and timestamps from the search table while copying the rest of the information and also creating new ids and timestamps only for the archive table. Now let's run the migration.

php artisan migrate

The second step of this process will be to slowly fill the archive table using a cron job. 

php artisan make:command fillSearchArchive

In the handle function of this new artisan command, I will query the existing search table to retrieve records not added in the last day, save them in search_archive, then delete them from search. But first we need a SearchArchive model that relates to the table. 

php artisan make:model SearchArchive
 public function handle() {
        $searches = Search::where('created_at', '>', Carbon::now()->subHour(24))->limit($this->argument('rows'))->get();

        foreach ($searches as $search) {
            $archivedSearch = new searchArchive();
            $archivedSearch->original_id = $search->id;
            $archivedSearch->keyword = $search->keyword;
            $archivedSearch->original_created_at = $search->created_at;
            $archivedSearch->original_updated_at = $search->updated_at;
            $archivedSearch->save();
            $search->delete();
        }
    }

This command is written to receive a parameter called rows, which enables us to introduce variability to the script. During the night when traffic is lower, we can amp up the script to handle more rows, whereas during the day it will handle fewer requests. This prevents the website from slowing down while still finishing the archive as fast as possible.

We can run the command manually and at this point you should see rows from the search table moving to search_archive:

php artisan archive:search 1000

However, to complete the solution we need to schedule this command to run in the background until the original search table is small enough to run the original alter table without disrupting service to the website. In Kernel.php I added the following scheduler command. 

$schedule->command('search:archive 7500')->everyMinute()->between('09:00', '16:30');
$schedule->command('search:archive 5000')->everyMinute()->between('16:31', '23:30');
$schedule->command('search:archive 10000')->everyMinute()->between('23:30', '08:59');

As with any solution, this has some limitations, namely:

  1. Resetting the primary key of the search table is now unadvisable because we will create duplicate ids for different searches
  2. Any data analysis of historical and real-time search data will now have to query two different tables.

Do you have any questions or comments? Be sure to let me know below!