Configuration Guide
Comprehensive guide to configuring Data Phantom Platform for your environment
Overview
Data Phantom uses a YAML-based configuration system that allows you to customize all aspects of the application without code changes. The configuration file supports environment variable substitution for sensitive data.
src/main/resources/config-dev.yml
Environment Support: Use
${VAR_NAME:default_value}
syntax for environment variables
Configuration File Structure
The configuration file is organized into logical sections:
config-dev.yml
├── server # Server and port configuration
├── meta_store # Database connection settings
├── jwt # JWT authentication settings
├── concurrency_config # Thread pool and performance settings
├── reconciliation_settings # Data comparison settings
└── connector # External service connectors
├── aws_emr # AWS EMR configuration
└── mysql # MySQL connector settings
Server Configuration
Configure the main application server and administrative interface:
Basic Server Configuration
server:
applicationConnectors:
- type: http
port: 9092
bindHost: 0.0.0.0
adminConnectors:
- type: http
port: 9093
bindHost: 0.0.0.0
Server Parameters
Parameter | Default | Description |
---|---|---|
applicationConnectors.port |
9092 | Main API server port |
adminConnectors.port |
9093 | Administrative interface port |
bindHost |
0.0.0.0 | Network interface to bind to |
Database Configuration
Configure the metadata database connection and connection pool settings:
Database Configuration
meta_store:
driverClass: org.mariadb.jdbc.Driver
url: ${MYSQL_URL:jdbc:mariadb://localhost:3306/data_phantom}
user: ${MYSQL_USER:root}
password: ${MYSQL_PASSWORD:your-password}
# Connection Pool Settings
maxSize: 50
minSize: 10
maxWaitForConnection: 30s
maxConnectionAge: 30m
minIdleTime: 10m
# Health Check Settings
validationQuery: "SELECT 1"
validationQueryTimeout: 3s
checkConnectionOnBorrow: true
checkConnectionOnReturn: true
Database Parameters
Parameter | Recommended | Description |
---|---|---|
maxSize |
50 | Maximum number of database connections |
minSize |
10 | Minimum number of connections to maintain |
maxWaitForConnection |
30s | Maximum time to wait for a connection |
maxConnectionAge |
30m | Maximum lifetime of a connection |
JWT Authentication
Configure JSON Web Token authentication settings:
JWT Configuration
jwt:
secretKey: "${JWT_SECRET_KEY:your-super-secret-jwt-key-change-this-in-production}"
tokenExpirationMinutes: 60
refreshTokenExpirationDays: 7
Security Requirements
- Secret Key: Must be at least 256 bits (32 characters) for HS256
- Production: Always use environment variables for the secret key
- Rotation: Consider implementing key rotation for enhanced security
Generating a Secure Secret Key
# Generate a secure random key
openssl rand -base64 32
# Or use Java
java -cp target/classes com.annihilator.data.playground.utility.KeyGenerator
Concurrency Configuration
Configure thread pools and execution settings for optimal performance:
Concurrency Settings
concurrency_config:
# Thread Pool Sizes
adhoc_threadpool_size: 200
scheduled_threadpool_size: 200
# Scheduler Settings
scheduler_sleep_time: 300000 # 5 minutes
playground_execution_grace_period: 300000 # 5 minutes
playground_max_execution_frequency: 360000 # 6 minutes
Performance Guidelines
Thread Pool Sizing
- adhoc_threadpool_size: Based on expected concurrent ad-hoc runs
- scheduled_threadpool_size: Based on number of scheduled playgrounds
- Recommendation: Start with 200, monitor CPU and memory usage
Scheduler Timing
- scheduler_sleep_time: Priority queue scan interval (5 minutes default)
- grace_period: Time before considering a playground stuck
- max_frequency: Prevents playground execution overlap
- Auto-discovery: Picks up playground cron changes every 5 minutes
AWS Integration
Configure AWS EMR and S3 integration for cloud-based data processing:
AWS EMR Configuration
connector:
aws_emr:
# Authentication
access_key: ${AWS_ACCESS_KEY_ID:your-access-key}
secret_key: ${AWS_SECRET_ACCESS_KEY:your-secret-key}
# S3 Configuration
s3_bucket: ${AWS_S3_BUCKET:your-s3-bucket}
s3_path_prefix: ${AWS_S3_PATH_PREFIX:data-phantom}
# Regional Settings
region: ${AWS_REGION:us-east-1}
# EMR Cluster Settings
stack_name: ${AWS_STACK_NAME:DataPhantomClusterStack}
cluster_logical_id: ${AWS_CLUSTER_LOGICAL_ID:DataPhantomCluster}
# Performance Settings
step_polling_interval: 30000 # 30 seconds
stack_update_polling_interval: 30000 # 30 seconds
stack_update_check_max_attempt: 60
# S3 Settings
s3_output_preview_line_count: 100
s3_max_keys_per_request: 20
max_step_retries: 3
AWS Optimization Tips
Cost Optimization
- Use spot instances for EMR clusters when possible
- Configure auto-termination for idle clusters
- Use S3 lifecycle policies for data archival
Performance Optimization
- step_polling_interval: Lower values = faster updates, more API calls
- s3_max_keys_per_request: Higher values = fewer API calls, more memory
- max_step_retries: Higher values = more resilience, longer recovery time
Data Reconciliation
Configure settings for data comparison and validation:
Reconciliation Settings
reconciliation_settings:
exact_match_threshold: 1048576 # 1MB - files smaller use exact matching
false_positive_rate: 0.1 # 10% false positive rate for bloom filter
estimated_rows: 1000000 # Estimated rows for bloom filter sizing
Algorithm Selection
Exact Matching
Used for files smaller than exact_match_threshold
- Pros: 100% accurate, no false positives
- Cons: Memory intensive for large files
- Best for: Small to medium datasets
Bloom Filter
Used for files larger than exact_match_threshold
- Pros: Memory efficient, fast processing
- Cons: Configurable false positive rate
- Best for: Large datasets (millions of records)
Configuration Management
All configurations and secrets should be managed in config-dev.yml
:
Database Configuration
# Add to src/main/resources/config-dev.yml
# ============================================
# Database Configuration
# ============================================
meta_store:
driverClass: org.mariadb.jdbc.Driver
url: jdbc:mariadb://localhost:3306/data_phantom
user: root
password: your_secure_password
maxSize: 50
minSize: 10
maxWaitForConnection: 30s
maxConnectionAge: 30m
minIdleTime: 10m
validationQuery: "SELECT 1"
validationQueryTimeout: 3s
AWS Configuration
# Add to src/main/resources/config-dev.yml
# ============================================
# AWS EMR Configuration
# ============================================
connector:
aws_emr:
access_key: your-access-key
secret_key: your-secret-key
s3_bucket: your-s3-bucket
s3_path_prefix: data-phantom
region: us-east-1
stack_name: DataPhantomClusterStack
cluster_logical_id: DataPhantomCluster
step_polling_interval: 30000
max_step_retries: 3
Security Configuration
# Add to src/main/resources/config-dev.yml
# ============================================
# JWT Security Configuration
# ============================================
jwt:
secretKey: "your-super-secret-jwt-key-change-this-in-production"
tokenExpirationMinutes: 60
refreshTokenExpirationDays: 7
# ============================================
# SQL Connector Configuration
# ============================================
connector:
mysql:
output_directory: /tmp/sql-output
Best Practices for Configuration
- Never commit secrets: Add
config-dev.yml
to.gitignore
if it contains real credentials - Use different configs per environment: Create
config-dev.yml
,config-staging.yml
,config-prod.yml
- Keep a template: Maintain
config-template.yml
with placeholder values in version control - Environment-specific values: Use different credential sets for development, staging, and production
# Example: Create a template file for version control
cp config-dev.yml config-template.yml
# Replace real values with placeholders in template
sed -i 's/password: .*/password: YOUR_PASSWORD_HERE/g' config-template.yml
sed -i 's/access_key: .*/access_key: YOUR_AWS_ACCESS_KEY/g' config-template.yml
sed -i 's/secret_key: .*/secret_key: YOUR_AWS_SECRET_KEY/g' config-template.yml
# Add template to git, ignore actual config
git add config-template.yml
echo "config-dev.yml" >> .gitignore
Performance Tuning
Optimize Data Phantom for your specific workload and infrastructure:
Server Performance
# JVM Tuning (add to startup script)
export JAVA_OPTS="-Xmx4g -Xms2g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
# For production workloads
export JAVA_OPTS="-Xmx8g -Xms4g -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI"
Database Performance
# High-performance database configuration
meta_store:
maxSize: 100 # More connections for high concurrency
minSize: 20 # Keep more connections warm
maxWaitForConnection: 10s # Fail fast under load
maxConnectionAge: 15m # Shorter connection lifetime
minIdleTime: 5m # More aggressive cleanup
Concurrency Tuning
# High-throughput configuration
concurrency_config:
adhoc_threadpool_size: 500 # More concurrent executions
scheduled_threadpool_size: 200 # Adequate for scheduled tasks
scheduler_sleep_time: 60000 # More frequent checks (1 minute)
Performance Monitoring
Monitor these key metrics to optimize performance:
- CPU Usage: Should typically be below 80%
- Memory Usage: Monitor heap usage and GC frequency
- Database Connections: Monitor pool utilization
- Thread Pool Usage: Monitor queue sizes and active threads
- API Response Times: Track endpoint performance
Security Best Practices
Secure your Data Phantom deployment:
Authentication & Authorization
- Use strong, randomly generated JWT secret keys
- Implement short token expiration times (15-60 minutes)
- Use refresh tokens for long-lived sessions
- Consider implementing role-based access control (RBAC)
Network Security
- Use HTTPS in production with valid SSL certificates
- Implement firewall rules to restrict access
- Use VPC and security groups in AWS
- Consider using a reverse proxy (nginx, Apache)
Database Security
- Use strong database passwords
- Enable SSL/TLS for database connections
- Create dedicated database users with minimal permissions
- Regular database backups and encryption at rest
AWS Security
- Use IAM roles instead of access keys when possible
- Follow principle of least privilege for AWS permissions
- Enable S3 bucket encryption
- Use VPC endpoints for S3 access