Extracting Tenants from a Monolith

Introduction

Multi-tenant applications often start as monoliths, but as they grow, there are compelling reasons to extract specific tenants into isolated environments. 

This could be for performance isolation, security requirements, or to offer premium dedicated hosting. 

In this post, I'll share how I built a system to efficiently extract tenants from a large Rails monolith with a complex data model to their dedicated environments while significantly reducing extraction size and improving performance.

The Challenge

The application is a multi-tenant platform with the following characteristics:
  • Multiple tenants with parent-child relationships
  • Complex cross-tenant associations and polymorphic relationships
  • Several huge tables (10+ million rows each)
  • Critical data that must be preserved during tenant extraction
  • Extraction and import operations that were becoming prohibitively slow

Some tables had grown to contain years of historical data, most of which wasn't critical for a functional tenant environment. This led to unnecessarily large extraction files (several GB per tenant) and import times exceeding an hour, which was problematic for quickly creating new isolated environments.

System Architecture Overview

Before diving into the optimizations, it's worth understanding the tenant extraction system's architecture:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  CLI Interface  │────▶│  Extract Runner │────▶│     Exporter    │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                        │
                                                        ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  SQL Generator  │◀────│  Database Util  │◀────│  Query Config   │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Key components:
  • CLI Interface: Provides command-line options and argument parsing
  • Extract Runner: Orchestrates the overall extraction process
  • Exporter: Handles the core data export logic
  • Database Util: Provides database connectivity and utility functions
  • Query Config: Stores SQL queries in YAML for maintainability
  • SQL Generator: Creates SQL scripts for extraction and import

This modular design allows for clean separation of concerns and easier maintenance.

My Approach: Smart Data Filtering for Tenant Extraction

Rather than extracting every row in every table for a tenant, I identified opportunities to be more selective while maintaining data integrity. Here's how I optimized three of the largest tables when extracting tenant data:

1. Audit Logs Table

The audit logs table records every action performed on various entity types. This table had grown to over 20 million rows, but most historical logs weren't critical for a functioning tenant environment.

My solution: Use Common Table Expressions (CTEs) to limit logs by entity type and recency selectively.

# In queries.yml - Using CTEs to filter logs by entity type
activity_logs: |
  WITH type_a_logs AS (
    SELECT logs.* FROM activity_logs logs 
    WHERE logs.entity_type = 'TypeA' 
    AND logs.entity_id IN (
      SELECT id FROM entity_a WHERE tenant_id IN (
        SELECT id FROM tenants WHERE parent_id IN (SELECT id FROM extraction_tenant_ids)
      )
    )
    ORDER BY logs.id DESC LIMIT 100000
  ),
  type_b_logs AS (
    SELECT logs.* FROM activity_logs logs 
    WHERE logs.entity_type = 'TypeB' 
    AND logs.entity_id IN (
      SELECT id FROM entity_b WHERE tenant_id IN (
        SELECT id FROM tenants WHERE parent_id IN (SELECT id FROM extraction_tenant_ids)
      )
    )
    ORDER BY logs.id DESC LIMIT 100000
  ),
  type_c_logs AS (
    SELECT logs.* FROM activity_logs logs 
    WHERE logs.entity_type = 'TypeC' 
    AND logs.entity_id IN (
      SELECT id FROM entity_c WHERE owner_id IN (
        SELECT owner_id FROM tenants WHERE id IN (SELECT id FROM extraction_tenant_ids)
      )
    )
    ORDER BY logs.id DESC LIMIT 100000
  )
  SELECT * FROM type_a_logs
  UNION ALL
  SELECT * FROM type_b_logs
  UNION ALL
  SELECT * FROM type_c_logs

This query:
  • Creates separate CTEs for each entity type I care about (TypeA, TypeB, TypeC)
  • For each entity type, find only entities related to the tenant being extracted
  • Limits to the 100,000 most recent logs per entity type
  • Combine the results with UNION ALL

The beauty of this approach is that it respects the appropriate join paths for each entity type, which can differ depending on the entity's relationship to the tenant. I used different join conditions for TypeC than for TypeA and TypeB because they have different relationship structures.

2. Historical Transaction Table

My transaction history table tracked every change in a critical system, growing to millions of rows per tenant. For tenant extraction purposes, I only needed recent history.

# In queries.yml - Filtering to only recent transactions
transaction_history: "SELECT * FROM transaction_history WHERE tenant_id IN (SELECT id FROM extraction_tenant_ids) AND created_at > (NOW() - INTERVAL '1 month')"

This simple date filter reduced the extraction size by over 90% while preserving all recent transaction data. After analyzing usage patterns and determining that older transactions were rarely referenced but could always be accessed in the primary database if needed, I chose one month as the cutoff point.

3. Worker Task Records

For background worker task records, I only needed the most recent entries:

# In queries.yml - Limiting to the most recent task records
worker_tasks: "SELECT * FROM worker_tasks WHERE tenant_id IN (SELECT id FROM extraction_tenant_ids) ORDER BY created_at DESC LIMIT 10000"

For this table, I chose a limit-based approach rather than time-based, as the frequency of task creation varies significantly between tenants. Some active tenants might generate 10,000 tasks weekly, while fewer active ones might take months to reach that number.

Technical Implementation Details

Table Selection Logic

The extraction process uses three mechanisms to determine which tables to include:

# First, get tables with direct tenant relationships
tenant_tables = db.tables_with_column("tenant_id")

# Next, get tables with secondary relationships
secondary_tables = db.tables_with_column("secondary_id")

# Process standard tables with simple filtering
tenant_tables.each do |table_name|
  next if table_name == "tenants" # Already processed
  process_table(f, table_name)
end

# Process special case tables with custom queries
SPECIAL_TABLES.each do |table_name|
  process_table(f, table_name)
end

I maintain a constant called SPECIAL_TABLES, which lists all tables requiring custom queries. When the exporter encounters these tables, it looks up the corresponding query in the YAML configuration rather than constructing a generic one.

Batched Processing for Large Tables

For tables with millions of rows, I implemented batched processing to prevent memory issues:

def export_large_table(table_name, query)
  batch_size = 10_000
  offset = 0
  total_rows = 0
  
  loop do
    batch_query = "#{query} LIMIT #{batch_size} OFFSET #{offset}"
    rows = db.connection.exec(batch_query)
    break if rows.count == 0
    
    # Process this batch
    export_rows_to_csv(table_name, rows)
    
    total_rows += rows.count
    offset += batch_size
    puts "  Exported #{total_rows} rows from #{table_name}"
  end
end

This approach allows processing tables of any size without running out of memory and provides better progress reporting.

Parallelization Strategy for Performance

One of the key performance enhancements I implemented was parallelizing both the export and import processes. With large tenants having hundreds of tables, processing them sequentially was a major bottleneck.

Parallel Export Process

I leveraged the Ruby Parallel gem to process multiple tables simultaneously:

def process_tables_in_parallel(tables)
  puts "Processing #{tables.size} tables in parallel with #{@parallel_workers} workers..."
  
  Parallel.each(tables, in_processes: @parallel_workers) do |table_name|
    begin
      # Create a separate exporter instance for this table
      table_exporter = TableExporter.new(@db, @config, table_name, @output_dir, @progress_file, @queries)
      table_exporter.process
    rescue => e
      puts "Error processing table #{table_name}: #{e.message}"
      # Log the error but continue with other tables
      next
    end
  end
end

The implementation includes several important features:
  1. Dynamic Worker Count: The number of parallel workers is configurable based on the available CPU cores and memory.
  2. Independent Table Exporters: Each table gets its own exporter instance to avoid sharing state.
  3. Isolation of Failures: Errors in one table's export won't affect other tables.
  4. Progress Tracking: Each exporter updates a shared progress file to enable resuming interrupted operations.

For large tenants, I found that optimal performance usually occurs with worker counts between 4-8, depending on the server's hardware.

Parallel Import Strategy

The import process also benefits from parallelization, although it requires a different approach due to database constraints:

def import_data_in_parallel
  # First, import tables that other tables depend on
  import_priority_tables

  # Get remaining tables to import
  remaining_tables = @tables - @processed_tables
  
  # Group tables for parallel import
  table_groups = create_table_groups(remaining_tables)
  
  table_groups.each_with_index do |group, index|
    puts "Importing table group #{index + 1}/#{table_groups.size} (#{group.size} tables)..."
    
    Parallel.each(group, in_processes: @parallel_workers) do |table|
      import_table(table)
    end
  end
end

def create_table_groups(tables)
  # Use foreign key relationships to group tables that can be safely imported in parallel
  groups = []
  dependency_graph = build_dependency_graph(tables)
  
  until dependency_graph.empty?
    # Find tables with no dependencies
    independent_tables = dependency_graph.keys.select { |t| dependency_graph[t].empty? }
    groups << independent_tables
    
    # Remove these tables from the graph
    independent_tables.each { |t| dependency_graph.delete(t) }
    
    # Remove these tables from dependencies
    dependency_graph.each { |_, deps| deps.subtract(independent_tables) }
  end
  
  groups
end

This implementation includes several critical enhancements:

  1. Dependency-Aware Parallelization: Tables are grouped based on their foreign key relationships to avoid constraint violations.
  2. Priority Tables First: Core tables that other tables depend on are imported first, sequentially.
  3. Group-Based Processing: Tables are imported in waves, each containing tables that can be safely imported in parallel.
  4. Transaction Isolation: Each table import runs in its own transaction to prevent deadlocks.

The results were impressive: import times for large tenants dropped from over an hour to under 15 minutes. The graph-based approach to managing dependencies was essential for maintaining data integrity while maximizing parallelism.

A key learning was that blindly parallelizing database operations could lead to deadlocks and constraint violations. Instead, a carefully orchestrated approach that respects database dependencies provides the best balance of performance and reliability.

Improving Error Resilience for Tenant Imports

Besides optimizing extraction size, I also improved the reliability of the import process when setting up the isolated tenant environment. One key improvement was in updating database sequences after import:

# Breaking sequence updates into batches with error handling
def update_sequences
  puts "Updating sequences..."
  batch_size = 50

  sequences.each_slice(batch_size) do |batch|
    # Create a transaction for each batch
    db.connection.exec("BEGIN;")
    
    batch.each do |sequence|
      begin
        # Update each sequence to match the imported data
        sql = "SELECT setval('#{sequence[:name]}', COALESCE((SELECT MAX(#{sequence[:column]}) FROM #{sequence[:table]}), 1), true)"
        db.connection.exec(sql)
        puts "  Updated sequence #{sequence[:name]}"
      rescue => e
        # Log error but continue with other sequences
        puts "  Error updating sequence #{sequence[:name]}: #{e.message}"
      end
    end
    
    # Commit the batch
    db.connection.exec("COMMIT;")
  end
  
  puts "Sequences updated."
end

This approach:
  • Processes sequences in batches to avoid overwhelming the database
  • It uses individual error handling so a single failure doesn't stop the entire process
  • Wraps each batch in a transaction for better performance

I also added a finalization phase of tenant imports:

def run_post_import_finalization
  begin
    # Re-enable constraints in a separate transaction
    db.connection.exec("SET session_replication_role = 'origin';")
    puts "Re-enabled database constraints"
    
    # Update sequences to match imported data
    update_sequences
    
    # Refresh materialized views if they exist
    if table_exists?("materialized_views_meta")
      puts "Refreshing materialized views..."
      db.connection.exec("SELECT refresh_all_materialized_views()")
    end
    
    puts "Post-import finalization completed successfully"
    true
  rescue => e
    puts "Error during post-import finalization: #{e.message}"
    puts "Attempting to continue despite error..."
    # Still try to enable constraints even if other steps fail
    begin
      db.connection.exec("SET session_replication_role = 'origin';")
    rescue => inner_e
      puts "Fatal error: Failed to re-enable constraints: #{inner_e.message}"
    end
    false
  end
end

This includes multiple levels of error handling to ensure that even if parts of the finalization fail, critical steps like re-enabling constraints are still executed.

Architecture Lessons

This project taught me several valuable lessons about designing tenant extraction systems for large monolithic databases:

  1. Be selective about data: Not all historical data is necessary for a functioning tenant environment.
  2. Use database features effectively: CTEs, window functions, and other PostgreSQL features can help you filter data efficiently.
  3. Prioritize error resilience: Assume failures will happen and design your system to continue despite them.
  4. Batch operations for large datasets: Breaking work into smaller transactions reduces memory usage and improves reliability.
  5. Configuration-driven design: Storing queries in a YAML file rather than hardcoding them made the system more maintainable and adaptable.
  6. Progress tracking and resumability: Storing progress information allowed the system to resume interrupted extractions, a crucial feature for large tenants.

Conclusion

By being more selective about what data I included when extracting tenants from a monolith and improving error handling during imports, I was able to dramatically improve the performance and reliability of the tenant extraction system.

The key insight was recognizing that tenant extraction doesn't need to include every historical record to be useful - it just needs to include the data necessary for a functional isolated environment. 

I achieved dramatic improvements with minimal risk by applying targeted filters to the largest tables.
These techniques can be applied to almost any Rails monolith with a growing PostgreSQL database to make tenant extraction more manageable and reliable.

Comments

No comments yet. Be the first to comment!
Your email address will be verified before your first comment is posted. It will not be displayed publicly.
Legal Information
By commenting, you agree to our Privacy Policy and Terms of Service.