Configuration Guide - Data Phantom Platform

Overview

Data Phantom uses a YAML-based configuration system that allows you to customize all aspects of the application without code changes. The configuration file supports environment variable substitution for sensitive data.

Configuration File: Located at src/main/resources/config-dev.yml
Environment Support: Use ${VAR_NAME:default_value} syntax for environment variables

Configuration File Structure

The configuration file is organized into logical sections:

                            config-dev.yml
├── server                 # Server and port configuration
├── meta_store            # Database connection settings
├── jwt                   # JWT authentication settings
├── concurrency_config    # Thread pool and performance settings
├── reconciliation_settings # Data comparison settings
└── connector             # External service connectors
    ├── aws_emr          # AWS EMR configuration
    └── mysql            # MySQL connector settings
                        

Server Configuration

Configure the main application server and administrative interface:

Basic Server Configuration

                                server:
  applicationConnectors:
    - type: http
      port: 9092
      bindHost: 0.0.0.0

  adminConnectors:
    - type: http
      port: 9093
      bindHost: 0.0.0.0
                            

Server Parameters

Parameter	Default	Description
`applicationConnectors.port`	9092	Main API server port
`adminConnectors.port`	9093	Administrative interface port
`bindHost`	0.0.0.0	Network interface to bind to

Security Note: In production, consider binding to specific interfaces rather than 0.0.0.0, and use HTTPS with proper SSL certificates.

Database Configuration

Configure the metadata database connection and connection pool settings:

Database Configuration

                                meta_store:
  driverClass: org.mariadb.jdbc.Driver
  url: ${MYSQL_URL:jdbc:mariadb://localhost:3306/data_phantom}
  user: ${MYSQL_USER:root}
  password: ${MYSQL_PASSWORD:your-password}
  
  # Connection Pool Settings
  maxSize: 50
  minSize: 10
  maxWaitForConnection: 30s
  maxConnectionAge: 30m
  minIdleTime: 10m
  
  # Health Check Settings
  validationQuery: "SELECT 1"
  validationQueryTimeout: 3s
  checkConnectionOnBorrow: true
  checkConnectionOnReturn: true
                            

Database Parameters

Parameter	Recommended	Description
`maxSize`	50	Maximum number of database connections
`minSize`	10	Minimum number of connections to maintain
`maxWaitForConnection`	30s	Maximum time to wait for a connection
`maxConnectionAge`	30m	Maximum lifetime of a connection

JWT Authentication

Configure JSON Web Token authentication settings:

JWT Configuration

                                jwt:
  secretKey: "${JWT_SECRET_KEY:your-super-secret-jwt-key-change-this-in-production}"
  
  tokenExpirationMinutes: 60
  refreshTokenExpirationDays: 7
                            

Security Requirements

Secret Key: Must be at least 256 bits (32 characters) for HS256
Production: Always use environment variables for the secret key
Rotation: Consider implementing key rotation for enhanced security

Generating a Secure Secret Key

                                # Generate a secure random key
openssl rand -base64 32

# Or use Java
java -cp target/classes com.annihilator.data.playground.utility.KeyGenerator
                            

Concurrency Configuration

Configure thread pools and execution settings for optimal performance:

Concurrency Settings

                                concurrency_config:
  # Thread Pool Sizes
  adhoc_threadpool_size: 200
  scheduled_threadpool_size: 200
  
  # Scheduler Settings
  scheduler_sleep_time: 300000          # 5 minutes
  playground_execution_grace_period: 300000    # 5 minutes
  playground_max_execution_frequency: 360000   # 6 minutes
                            

Performance Guidelines

Thread Pool Sizing

adhoc_threadpool_size: Based on expected concurrent ad-hoc runs
scheduled_threadpool_size: Based on number of scheduled playgrounds
Recommendation: Start with 200, monitor CPU and memory usage

Scheduler Timing

scheduler_sleep_time: Priority queue scan interval (5 minutes default)
grace_period: Time before considering a playground stuck
max_frequency: Prevents playground execution overlap
Auto-discovery: Picks up playground cron changes every 5 minutes

AWS Integration

Configure AWS EMR and S3 integration for cloud-based data processing:

AWS EMR Configuration

                                connector:
  aws_emr:
    # Authentication
    access_key: ${AWS_ACCESS_KEY_ID:your-access-key}
    secret_key: ${AWS_SECRET_ACCESS_KEY:your-secret-key}
    
    # S3 Configuration
    s3_bucket: ${AWS_S3_BUCKET:your-s3-bucket}
    s3_path_prefix: ${AWS_S3_PATH_PREFIX:data-phantom}
    
    # Regional Settings
    region: ${AWS_REGION:us-east-1}
    
    # EMR Cluster Settings
    stack_name: ${AWS_STACK_NAME:DataPhantomClusterStack}
    cluster_logical_id: ${AWS_CLUSTER_LOGICAL_ID:DataPhantomCluster}
    
    # Performance Settings
    step_polling_interval: 30000         # 30 seconds
    stack_update_polling_interval: 30000 # 30 seconds
    stack_update_check_max_attempt: 60
    
    # S3 Settings
    s3_output_preview_line_count: 100
    s3_max_keys_per_request: 20
    max_step_retries: 3
                            

AWS Optimization Tips

Cost Optimization

Use spot instances for EMR clusters when possible
Configure auto-termination for idle clusters
Use S3 lifecycle policies for data archival

Performance Optimization

step_polling_interval: Lower values = faster updates, more API calls
s3_max_keys_per_request: Higher values = fewer API calls, more memory
max_step_retries: Higher values = more resilience, longer recovery time

Data Reconciliation

Configure settings for data comparison and validation:

Reconciliation Settings

                                reconciliation_settings:
  exact_match_threshold: 1048576      # 1MB - files smaller use exact matching
  
  false_positive_rate: 0.1            # 10% false positive rate for bloom filter
  estimated_rows: 1000000             # Estimated rows for bloom filter sizing
                            

Algorithm Selection

Exact Matching

Used for files smaller than exact_match_threshold

Pros: 100% accurate, no false positives
Cons: Memory intensive for large files
Best for: Small to medium datasets

Bloom Filter

Used for files larger than exact_match_threshold

Pros: Memory efficient, fast processing
Cons: Configurable false positive rate
Best for: Large datasets (millions of records)

Configuration Management

All configurations and secrets should be managed in config-dev.yml:

Database Configuration

                                    # Add to src/main/resources/config-dev.yml

# ============================================
# Database Configuration
# ============================================
meta_store:
  driverClass: org.mariadb.jdbc.Driver
  url: jdbc:mariadb://localhost:3306/data_phantom
  user: root
  password: your_secure_password
  maxSize: 50
  minSize: 10
  maxWaitForConnection: 30s
  maxConnectionAge: 30m
  minIdleTime: 10m
  validationQuery: "SELECT 1"
  validationQueryTimeout: 3s
                                

AWS Configuration

                                    # Add to src/main/resources/config-dev.yml

# ============================================
# AWS EMR Configuration
# ============================================
connector:
  aws_emr:
    access_key: your-access-key
    secret_key: your-secret-key
    s3_bucket: your-s3-bucket
    s3_path_prefix: data-phantom
    region: us-east-1
    stack_name: DataPhantomClusterStack
    cluster_logical_id: DataPhantomCluster
    step_polling_interval: 30000
    max_step_retries: 3
                                

Security Configuration

                                    # Add to src/main/resources/config-dev.yml

# ============================================
# JWT Security Configuration
# ============================================
jwt:
  secretKey: "your-super-secret-jwt-key-change-this-in-production"
  tokenExpirationMinutes: 60
  refreshTokenExpirationDays: 7


# ============================================
# SQL Connector Configuration
# ============================================
connector:
  mysql:
    output_directory: /tmp/sql-output
                                

Best Practices for Configuration

Never commit secrets: Add config-dev.yml to .gitignore if it contains real credentials
Use different configs per environment: Create config-dev.yml, config-staging.yml, config-prod.yml
Keep a template: Maintain config-template.yml with placeholder values in version control
Environment-specific values: Use different credential sets for development, staging, and production

                                # Example: Create a template file for version control
cp config-dev.yml config-template.yml

# Replace real values with placeholders in template
sed -i 's/password: .*/password: YOUR_PASSWORD_HERE/g' config-template.yml
sed -i 's/access_key: .*/access_key: YOUR_AWS_ACCESS_KEY/g' config-template.yml
sed -i 's/secret_key: .*/secret_key: YOUR_AWS_SECRET_KEY/g' config-template.yml

# Add template to git, ignore actual config
git add config-template.yml
echo "config-dev.yml" >> .gitignore
                            

Performance Tuning

Optimize Data Phantom for your specific workload and infrastructure:

Server Performance

                                    # JVM Tuning (add to startup script)
export JAVA_OPTS="-Xmx4g -Xms2g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"

# For production workloads
export JAVA_OPTS="-Xmx8g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI"
                                

Database Performance

                                    # High-performance database configuration
meta_store:
  maxSize: 100              # More connections for high concurrency
  minSize: 20               # Keep more connections warm
  maxWaitForConnection: 10s # Fail fast under load
  maxConnectionAge: 15m     # Shorter connection lifetime
  minIdleTime: 5m           # More aggressive cleanup
                                

Concurrency Tuning

                                    # High-throughput configuration
concurrency_config:
  adhoc_threadpool_size: 500      # More concurrent executions
  scheduled_threadpool_size: 200   # Adequate for scheduled tasks
  scheduler_sleep_time: 60000      # More frequent checks (1 minute)
                                

Performance Monitoring

Monitor these key metrics to optimize performance:

CPU Usage: Should typically be below 80%
Memory Usage: Monitor heap usage and GC frequency
Database Connections: Monitor pool utilization
Thread Pool Usage: Monitor queue sizes and active threads
API Response Times: Track endpoint performance

Security Best Practices

Secure your Data Phantom deployment:

Authentication & Authorization

Use strong, randomly generated JWT secret keys
Implement short token expiration times (15-60 minutes)
Use refresh tokens for long-lived sessions
Consider implementing role-based access control (RBAC)

Network Security

Use HTTPS in production with valid SSL certificates
Implement firewall rules to restrict access
Use VPC and security groups in AWS
Consider using a reverse proxy (nginx, Apache)

Database Security

Use strong database passwords
Enable SSL/TLS for database connections
Create dedicated database users with minimal permissions
Regular database backups and encryption at rest

AWS Security

Use IAM roles instead of access keys when possible
Follow principle of least privilege for AWS permissions
Enable S3 bucket encryption
Use VPC endpoints for S3 access

Production Deployment Checklist

Change default JWT secret key

Use environment variables for all sensitive data

Enable HTTPS with valid SSL certificates

Configure firewall and network access rules

Set up monitoring and logging

Configure database backup strategy

Review and test disaster recovery procedures