Introduction
Multi-tenant applications often start as monoliths, but as they grow, there are compelling reasons to extract specific tenants into isolated environments.
This could be for performance isolation, security requirements, or to offer premium dedicated hosting.
In this post, I'll share how I built a system to efficiently extract tenants from a large Rails monolith with a complex data model to their dedicated environments while significantly reducing extraction size and improving performance.
This could be for performance isolation, security requirements, or to offer premium dedicated hosting.
In this post, I'll share how I built a system to efficiently extract tenants from a large Rails monolith with a complex data model to their dedicated environments while significantly reducing extraction size and improving performance.
The Challenge
The application is a multi-tenant platform with the following characteristics:
- Multiple tenants with parent-child relationships
- Complex cross-tenant associations and polymorphic relationships
- Several huge tables (10+ million rows each)
- Critical data that must be preserved during tenant extraction
- Extraction and import operations that were becoming prohibitively slow
Some tables had grown to contain years of historical data, most of which wasn't critical for a functional tenant environment. This led to unnecessarily large extraction files (several GB per tenant) and import times exceeding an hour, which was problematic for quickly creating new isolated environments.
System Architecture Overview
Before diving into the optimizations, it's worth understanding the tenant extraction system's architecture:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ CLI Interface │────▶│ Extract Runner │────▶│ Exporter │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ SQL Generator │◀────│ Database Util │◀────│ Query Config │ └─────────────────┘ └─────────────────┘ └─────────────────┘
Key components:
- CLI Interface: Provides command-line options and argument parsing
- Extract Runner: Orchestrates the overall extraction process
- Exporter: Handles the core data export logic
- Database Util: Provides database connectivity and utility functions
- Query Config: Stores SQL queries in YAML for maintainability
- SQL Generator: Creates SQL scripts for extraction and import
This modular design allows for clean separation of concerns and easier maintenance.
My Approach: Smart Data Filtering for Tenant Extraction
Rather than extracting every row in every table for a tenant, I identified opportunities to be more selective while maintaining data integrity. Here's how I optimized three of the largest tables when extracting tenant data:
1. Audit Logs Table
The audit logs table records every action performed on various entity types. This table had grown to over 20 million rows, but most historical logs weren't critical for a functioning tenant environment.
My solution: Use Common Table Expressions (CTEs) to limit logs by entity type and recency selectively.
# In queries.yml - Using CTEs to filter logs by entity type activity_logs: | WITH type_a_logs AS ( SELECT logs.* FROM activity_logs logs WHERE logs.entity_type = 'TypeA' AND logs.entity_id IN ( SELECT id FROM entity_a WHERE tenant_id IN ( SELECT id FROM tenants WHERE parent_id IN (SELECT id FROM extraction_tenant_ids) ) ) ORDER BY logs.id DESC LIMIT 100000 ), type_b_logs AS ( SELECT logs.* FROM activity_logs logs WHERE logs.entity_type = 'TypeB' AND logs.entity_id IN ( SELECT id FROM entity_b WHERE tenant_id IN ( SELECT id FROM tenants WHERE parent_id IN (SELECT id FROM extraction_tenant_ids) ) ) ORDER BY logs.id DESC LIMIT 100000 ), type_c_logs AS ( SELECT logs.* FROM activity_logs logs WHERE logs.entity_type = 'TypeC' AND logs.entity_id IN ( SELECT id FROM entity_c WHERE owner_id IN ( SELECT owner_id FROM tenants WHERE id IN (SELECT id FROM extraction_tenant_ids) ) ) ORDER BY logs.id DESC LIMIT 100000 ) SELECT * FROM type_a_logs UNION ALL SELECT * FROM type_b_logs UNION ALL SELECT * FROM type_c_logs
This query:
- Creates separate CTEs for each entity type I care about (TypeA, TypeB, TypeC)
- For each entity type, find only entities related to the tenant being extracted
- Limits to the 100,000 most recent logs per entity type
- Combine the results with UNION ALL
The beauty of this approach is that it respects the appropriate join paths for each entity type, which can differ depending on the entity's relationship to the tenant. I used different join conditions for TypeC than for TypeA and TypeB because they have different relationship structures.
2. Historical Transaction Table
My transaction history table tracked every change in a critical system, growing to millions of rows per tenant. For tenant extraction purposes, I only needed recent history.
# In queries.yml - Filtering to only recent transactions transaction_history: "SELECT * FROM transaction_history WHERE tenant_id IN (SELECT id FROM extraction_tenant_ids) AND created_at > (NOW() - INTERVAL '1 month')"
This simple date filter reduced the extraction size by over 90% while preserving all recent transaction data. After analyzing usage patterns and determining that older transactions were rarely referenced but could always be accessed in the primary database if needed, I chose one month as the cutoff point.
3. Worker Task Records
For background worker task records, I only needed the most recent entries:
# In queries.yml - Limiting to the most recent task records worker_tasks: "SELECT * FROM worker_tasks WHERE tenant_id IN (SELECT id FROM extraction_tenant_ids) ORDER BY created_at DESC LIMIT 10000"
For this table, I chose a limit-based approach rather than time-based, as the frequency of task creation varies significantly between tenants. Some active tenants might generate 10,000 tasks weekly, while fewer active ones might take months to reach that number.
Technical Implementation Details
Table Selection Logic
The extraction process uses three mechanisms to determine which tables to include:
# First, get tables with direct tenant relationships tenant_tables = db.tables_with_column("tenant_id") # Next, get tables with secondary relationships secondary_tables = db.tables_with_column("secondary_id") # Process standard tables with simple filtering tenant_tables.each do |table_name| next if table_name == "tenants" # Already processed process_table(f, table_name) end # Process special case tables with custom queries SPECIAL_TABLES.each do |table_name| process_table(f, table_name) end
I maintain a constant called SPECIAL_TABLES, which lists all tables requiring custom queries. When the exporter encounters these tables, it looks up the corresponding query in the YAML configuration rather than constructing a generic one.
Batched Processing for Large Tables
For tables with millions of rows, I implemented batched processing to prevent memory issues:
def export_large_table(table_name, query) batch_size = 10_000 offset = 0 total_rows = 0 loop do batch_query = "#{query} LIMIT #{batch_size} OFFSET #{offset}" rows = db.connection.exec(batch_query) break if rows.count == 0 # Process this batch export_rows_to_csv(table_name, rows) total_rows += rows.count offset += batch_size puts " Exported #{total_rows} rows from #{table_name}" end end
This approach allows processing tables of any size without running out of memory and provides better progress reporting.
Parallelization Strategy for Performance
One of the key performance enhancements I implemented was parallelizing both the export and import processes. With large tenants having hundreds of tables, processing them sequentially was a major bottleneck.
Parallel Export Process
I leveraged the Ruby Parallel gem to process multiple tables simultaneously:
def process_tables_in_parallel(tables) puts "Processing #{tables.size} tables in parallel with #{@parallel_workers} workers..." Parallel.each(tables, in_processes: @parallel_workers) do |table_name| begin # Create a separate exporter instance for this table table_exporter = TableExporter.new(@db, @config, table_name, @output_dir, @progress_file, @queries) table_exporter.process rescue => e puts "Error processing table #{table_name}: #{e.message}" # Log the error but continue with other tables next end end end
The implementation includes several important features:
- Dynamic Worker Count: The number of parallel workers is configurable based on the available CPU cores and memory.
- Independent Table Exporters: Each table gets its own exporter instance to avoid sharing state.
- Isolation of Failures: Errors in one table's export won't affect other tables.
- Progress Tracking: Each exporter updates a shared progress file to enable resuming interrupted operations.
For large tenants, I found that optimal performance usually occurs with worker counts between 4-8, depending on the server's hardware.
Parallel Import Strategy
The import process also benefits from parallelization, although it requires a different approach due to database constraints:
def import_data_in_parallel # First, import tables that other tables depend on import_priority_tables # Get remaining tables to import remaining_tables = @tables - @processed_tables # Group tables for parallel import table_groups = create_table_groups(remaining_tables) table_groups.each_with_index do |group, index| puts "Importing table group #{index + 1}/#{table_groups.size} (#{group.size} tables)..." Parallel.each(group, in_processes: @parallel_workers) do |table| import_table(table) end end end def create_table_groups(tables) # Use foreign key relationships to group tables that can be safely imported in parallel groups = [] dependency_graph = build_dependency_graph(tables) until dependency_graph.empty? # Find tables with no dependencies independent_tables = dependency_graph.keys.select { |t| dependency_graph[t].empty? } groups << independent_tables # Remove these tables from the graph independent_tables.each { |t| dependency_graph.delete(t) } # Remove these tables from dependencies dependency_graph.each { |_, deps| deps.subtract(independent_tables) } end groups end
This implementation includes several critical enhancements:
- Dependency-Aware Parallelization: Tables are grouped based on their foreign key relationships to avoid constraint violations.
- Priority Tables First: Core tables that other tables depend on are imported first, sequentially.
- Group-Based Processing: Tables are imported in waves, each containing tables that can be safely imported in parallel.
- Transaction Isolation: Each table import runs in its own transaction to prevent deadlocks.
The results were impressive: import times for large tenants dropped from over an hour to under 15 minutes. The graph-based approach to managing dependencies was essential for maintaining data integrity while maximizing parallelism.
A key learning was that blindly parallelizing database operations could lead to deadlocks and constraint violations. Instead, a carefully orchestrated approach that respects database dependencies provides the best balance of performance and reliability.
Improving Error Resilience for Tenant Imports
Besides optimizing extraction size, I also improved the reliability of the import process when setting up the isolated tenant environment. One key improvement was in updating database sequences after import:
# Breaking sequence updates into batches with error handling def update_sequences puts "Updating sequences..." batch_size = 50 sequences.each_slice(batch_size) do |batch| # Create a transaction for each batch db.connection.exec("BEGIN;") batch.each do |sequence| begin # Update each sequence to match the imported data sql = "SELECT setval('#{sequence[:name]}', COALESCE((SELECT MAX(#{sequence[:column]}) FROM #{sequence[:table]}), 1), true)" db.connection.exec(sql) puts " Updated sequence #{sequence[:name]}" rescue => e # Log error but continue with other sequences puts " Error updating sequence #{sequence[:name]}: #{e.message}" end end # Commit the batch db.connection.exec("COMMIT;") end puts "Sequences updated." end
This approach:
- Processes sequences in batches to avoid overwhelming the database
- It uses individual error handling so a single failure doesn't stop the entire process
- Wraps each batch in a transaction for better performance
I also added a finalization phase of tenant imports:
def run_post_import_finalization begin # Re-enable constraints in a separate transaction db.connection.exec("SET session_replication_role = 'origin';") puts "Re-enabled database constraints" # Update sequences to match imported data update_sequences # Refresh materialized views if they exist if table_exists?("materialized_views_meta") puts "Refreshing materialized views..." db.connection.exec("SELECT refresh_all_materialized_views()") end puts "Post-import finalization completed successfully" true rescue => e puts "Error during post-import finalization: #{e.message}" puts "Attempting to continue despite error..." # Still try to enable constraints even if other steps fail begin db.connection.exec("SET session_replication_role = 'origin';") rescue => inner_e puts "Fatal error: Failed to re-enable constraints: #{inner_e.message}" end false end end
This includes multiple levels of error handling to ensure that even if parts of the finalization fail, critical steps like re-enabling constraints are still executed.
Architecture Lessons
This project taught me several valuable lessons about designing tenant extraction systems for large monolithic databases:
- Be selective about data: Not all historical data is necessary for a functioning tenant environment.
- Use database features effectively: CTEs, window functions, and other PostgreSQL features can help you filter data efficiently.
- Prioritize error resilience: Assume failures will happen and design your system to continue despite them.
- Batch operations for large datasets: Breaking work into smaller transactions reduces memory usage and improves reliability.
- Configuration-driven design: Storing queries in a YAML file rather than hardcoding them made the system more maintainable and adaptable.
- Progress tracking and resumability: Storing progress information allowed the system to resume interrupted extractions, a crucial feature for large tenants.
Conclusion
By being more selective about what data I included when extracting tenants from a monolith and improving error handling during imports, I was able to dramatically improve the performance and reliability of the tenant extraction system.
The key insight was recognizing that tenant extraction doesn't need to include every historical record to be useful - it just needs to include the data necessary for a functional isolated environment.
I achieved dramatic improvements with minimal risk by applying targeted filters to the largest tables.
These techniques can be applied to almost any Rails monolith with a growing PostgreSQL database to make tenant extraction more manageable and reliable.