Overview

Data Phantom uses a YAML-based configuration system that allows you to customize all aspects of the application without code changes. The configuration file supports environment variable substitution for sensitive data.

Configuration File: Located at src/main/resources/config-dev.yml
Environment Support: Use ${VAR_NAME:default_value} syntax for environment variables

Configuration File Structure

The configuration file is organized into logical sections:

config-dev.yml ├── server # Server and port configuration ├── meta_store # Database connection settings ├── jwt # JWT authentication settings ├── concurrency_config # Thread pool and performance settings ├── reconciliation_settings # Data comparison settings └── connector # External service connectors ├── aws_emr # AWS EMR configuration └── mysql # MySQL connector settings

Server Configuration

Configure the main application server and administrative interface:

Basic Server Configuration

server: applicationConnectors: - type: http port: 9092 bindHost: 0.0.0.0 adminConnectors: - type: http port: 9093 bindHost: 0.0.0.0

Server Parameters

Parameter Default Description
applicationConnectors.port 9092 Main API server port
adminConnectors.port 9093 Administrative interface port
bindHost 0.0.0.0 Network interface to bind to
Security Note: In production, consider binding to specific interfaces rather than 0.0.0.0, and use HTTPS with proper SSL certificates.

Database Configuration

Configure the metadata database connection and connection pool settings:

Database Configuration

meta_store: driverClass: org.mariadb.jdbc.Driver url: ${MYSQL_URL:jdbc:mariadb://localhost:3306/data_phantom} user: ${MYSQL_USER:root} password: ${MYSQL_PASSWORD:your-password} # Connection Pool Settings maxSize: 50 minSize: 10 maxWaitForConnection: 30s maxConnectionAge: 30m minIdleTime: 10m # Health Check Settings validationQuery: "SELECT 1" validationQueryTimeout: 3s checkConnectionOnBorrow: true checkConnectionOnReturn: true

Database Parameters

Parameter Recommended Description
maxSize 50 Maximum number of database connections
minSize 10 Minimum number of connections to maintain
maxWaitForConnection 30s Maximum time to wait for a connection
maxConnectionAge 30m Maximum lifetime of a connection

JWT Authentication

Configure JSON Web Token authentication settings:

JWT Configuration

jwt: secretKey: "${JWT_SECRET_KEY:your-super-secret-jwt-key-change-this-in-production}" tokenExpirationMinutes: 60 refreshTokenExpirationDays: 7

Security Requirements

  • Secret Key: Must be at least 256 bits (32 characters) for HS256
  • Production: Always use environment variables for the secret key
  • Rotation: Consider implementing key rotation for enhanced security

Generating a Secure Secret Key

# Generate a secure random key openssl rand -base64 32 # Or use Java java -cp target/classes com.annihilator.data.playground.utility.KeyGenerator

Concurrency Configuration

Configure thread pools and execution settings for optimal performance:

Concurrency Settings

concurrency_config: # Thread Pool Sizes adhoc_threadpool_size: 200 scheduled_threadpool_size: 200 # Scheduler Settings scheduler_sleep_time: 300000 # 5 minutes playground_execution_grace_period: 300000 # 5 minutes playground_max_execution_frequency: 360000 # 6 minutes

Performance Guidelines

Thread Pool Sizing
  • adhoc_threadpool_size: Based on expected concurrent ad-hoc runs
  • scheduled_threadpool_size: Based on number of scheduled playgrounds
  • Recommendation: Start with 200, monitor CPU and memory usage
Scheduler Timing
  • scheduler_sleep_time: Priority queue scan interval (5 minutes default)
  • grace_period: Time before considering a playground stuck
  • max_frequency: Prevents playground execution overlap
  • Auto-discovery: Picks up playground cron changes every 5 minutes

AWS Integration

Configure AWS EMR and S3 integration for cloud-based data processing:

AWS EMR Configuration

connector: aws_emr: # Authentication access_key: ${AWS_ACCESS_KEY_ID:your-access-key} secret_key: ${AWS_SECRET_ACCESS_KEY:your-secret-key} # S3 Configuration s3_bucket: ${AWS_S3_BUCKET:your-s3-bucket} s3_path_prefix: ${AWS_S3_PATH_PREFIX:data-phantom} # Regional Settings region: ${AWS_REGION:us-east-1} # EMR Cluster Settings stack_name: ${AWS_STACK_NAME:DataPhantomClusterStack} cluster_logical_id: ${AWS_CLUSTER_LOGICAL_ID:DataPhantomCluster} # Performance Settings step_polling_interval: 30000 # 30 seconds stack_update_polling_interval: 30000 # 30 seconds stack_update_check_max_attempt: 60 # S3 Settings s3_output_preview_line_count: 100 s3_max_keys_per_request: 20 max_step_retries: 3

AWS Optimization Tips

Cost Optimization
  • Use spot instances for EMR clusters when possible
  • Configure auto-termination for idle clusters
  • Use S3 lifecycle policies for data archival
Performance Optimization
  • step_polling_interval: Lower values = faster updates, more API calls
  • s3_max_keys_per_request: Higher values = fewer API calls, more memory
  • max_step_retries: Higher values = more resilience, longer recovery time

Data Reconciliation

Configure settings for data comparison and validation:

Reconciliation Settings

reconciliation_settings: exact_match_threshold: 1048576 # 1MB - files smaller use exact matching false_positive_rate: 0.1 # 10% false positive rate for bloom filter estimated_rows: 1000000 # Estimated rows for bloom filter sizing

Algorithm Selection

Exact Matching

Used for files smaller than exact_match_threshold

  • Pros: 100% accurate, no false positives
  • Cons: Memory intensive for large files
  • Best for: Small to medium datasets
Bloom Filter

Used for files larger than exact_match_threshold

  • Pros: Memory efficient, fast processing
  • Cons: Configurable false positive rate
  • Best for: Large datasets (millions of records)

Configuration Management

All configurations and secrets should be managed in config-dev.yml:

Database Configuration

# Add to src/main/resources/config-dev.yml # ============================================ # Database Configuration # ============================================ meta_store: driverClass: org.mariadb.jdbc.Driver url: jdbc:mariadb://localhost:3306/data_phantom user: root password: your_secure_password maxSize: 50 minSize: 10 maxWaitForConnection: 30s maxConnectionAge: 30m minIdleTime: 10m validationQuery: "SELECT 1" validationQueryTimeout: 3s

AWS Configuration

# Add to src/main/resources/config-dev.yml # ============================================ # AWS EMR Configuration # ============================================ connector: aws_emr: access_key: your-access-key secret_key: your-secret-key s3_bucket: your-s3-bucket s3_path_prefix: data-phantom region: us-east-1 stack_name: DataPhantomClusterStack cluster_logical_id: DataPhantomCluster step_polling_interval: 30000 max_step_retries: 3

Security Configuration

# Add to src/main/resources/config-dev.yml # ============================================ # JWT Security Configuration # ============================================ jwt: secretKey: "your-super-secret-jwt-key-change-this-in-production" tokenExpirationMinutes: 60 refreshTokenExpirationDays: 7 # ============================================ # SQL Connector Configuration # ============================================ connector: mysql: output_directory: /tmp/sql-output

Best Practices for Configuration

  • Never commit secrets: Add config-dev.yml to .gitignore if it contains real credentials
  • Use different configs per environment: Create config-dev.yml, config-staging.yml, config-prod.yml
  • Keep a template: Maintain config-template.yml with placeholder values in version control
  • Environment-specific values: Use different credential sets for development, staging, and production
# Example: Create a template file for version control cp config-dev.yml config-template.yml # Replace real values with placeholders in template sed -i 's/password: .*/password: YOUR_PASSWORD_HERE/g' config-template.yml sed -i 's/access_key: .*/access_key: YOUR_AWS_ACCESS_KEY/g' config-template.yml sed -i 's/secret_key: .*/secret_key: YOUR_AWS_SECRET_KEY/g' config-template.yml # Add template to git, ignore actual config git add config-template.yml echo "config-dev.yml" >> .gitignore

Performance Tuning

Optimize Data Phantom for your specific workload and infrastructure:

Server Performance

# JVM Tuning (add to startup script) export JAVA_OPTS="-Xmx4g -Xms2g -XX:+UseG1GC -XX:MaxGCPauseMillis=200" # For production workloads export JAVA_OPTS="-Xmx8g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI"

Database Performance

# High-performance database configuration meta_store: maxSize: 100 # More connections for high concurrency minSize: 20 # Keep more connections warm maxWaitForConnection: 10s # Fail fast under load maxConnectionAge: 15m # Shorter connection lifetime minIdleTime: 5m # More aggressive cleanup

Concurrency Tuning

# High-throughput configuration concurrency_config: adhoc_threadpool_size: 500 # More concurrent executions scheduled_threadpool_size: 200 # Adequate for scheduled tasks scheduler_sleep_time: 60000 # More frequent checks (1 minute)

Performance Monitoring

Monitor these key metrics to optimize performance:

  • CPU Usage: Should typically be below 80%
  • Memory Usage: Monitor heap usage and GC frequency
  • Database Connections: Monitor pool utilization
  • Thread Pool Usage: Monitor queue sizes and active threads
  • API Response Times: Track endpoint performance

Security Best Practices

Secure your Data Phantom deployment:

Authentication & Authorization

  • Use strong, randomly generated JWT secret keys
  • Implement short token expiration times (15-60 minutes)
  • Use refresh tokens for long-lived sessions
  • Consider implementing role-based access control (RBAC)

Network Security

  • Use HTTPS in production with valid SSL certificates
  • Implement firewall rules to restrict access
  • Use VPC and security groups in AWS
  • Consider using a reverse proxy (nginx, Apache)

Database Security

  • Use strong database passwords
  • Enable SSL/TLS for database connections
  • Create dedicated database users with minimal permissions
  • Regular database backups and encryption at rest

AWS Security

  • Use IAM roles instead of access keys when possible
  • Follow principle of least privilege for AWS permissions
  • Enable S3 bucket encryption
  • Use VPC endpoints for S3 access

Production Deployment Checklist