Getting Started with Data Phantom
This guide will help you set up and run Data Phantom Platform on your local machine or server environment.
Overview
Data Phantom is a comprehensive data processing platform that enables you to create, schedule, and manage SQL workflows across multiple engines. This guide covers the complete setup process from installation to creating your first data pipeline.
Difficulty: Beginner to Intermediate
System Architecture
Data Phantom Platform operates through three core execution flows, each designed for specific use cases and fault tolerance.
1. Scheduled Flow
Automated execution based on cron expressions with intelligent scheduling:
- Priority Queue: Manages all playgrounds with non-empty cron expressions
- Auto-Discovery: Scans meta store every 5 minutes for playground updates (configurable)
- Conflict Prevention: Uses configurable grace periods to prevent overlapping executions
- Load Balancing: Distributes scheduled tasks across available resources
2. Adhoc Flow
On-demand execution for immediate data processing needs:
- Manual Triggers: Execute entire playgrounds instantly via API or dashboard
- Limited Runs: Execute only selected tasks within a playground
- Real-time Monitoring: Track execution progress and status updates
- Immediate Feedback: Get instant results without waiting for schedules
3. Recovery Flow
Fault-tolerant execution with intelligent checkpoint recovery:
- State Persistence: Continuously saves execution state to meta store
- Smart Resume: Automatically resumes from last successful checkpoint after system restart
- Task Skipping: Avoids re-running tasks that were already completed or failed
- Data Integrity: Ensures no data loss during system failures
Data Reconciliation Engine
Automated data validation system that compares task outputs with adaptive algorithms:
Exact Match (< 1MB)
For smaller files, uses precise byte-by-byte comparison:
- 100% accuracy for small datasets
- Memory efficient for files under 1MB (configurable)
- Ideal for configuration files and small reports
Bloom Filter (> 1MB)
For larger files, uses probabilistic matching for performance:
- Memory efficient for files over 1MB
- Configurable false positive rate
- Ideal for large datasets with millions of records
Prerequisites
Before you begin, ensure you have the following installed on your system:
Java 11+
OpenJDK or Oracle JDK 11 or higher
java -version
Maven 3.6+
For building the application
mvn -version
MariaDB/MySQL
Database server for metadata storage
mysql --version
AWS Account (Optional)
For EMR and S3 integration
aws --version
Installation
Clone the Repository
First, clone the Data Phantom repository to your local machine:
git clone https://github.com/arcticOak2/annihilator-data-playground.git
cd annihilator-data-playground
Install MariaDB
Install MariaDB based on your operating system:
# Install MariaDB using Homebrew
brew install mariadb
# Start MariaDB service
brew services start mariadb
# Secure installation (optional but recommended)
mysql_secure_installation
# Update package index
sudo apt update
# Install MariaDB
sudo apt install mariadb-server
# Start MariaDB service
sudo systemctl start mariadb
sudo systemctl enable mariadb
# Secure installation
sudo mysql_secure_installation
# Install MariaDB
sudo yum install mariadb-server
# Start MariaDB service
sudo systemctl start mariadb
sudo systemctl enable mariadb
# Secure installation
sudo mysql_secure_installation
# Download MariaDB from: https://mariadb.org/download/
# Run the installer and follow the setup wizard
# Or use Chocolatey:
choco install mariadb
# Start MariaDB service
net start mysql
Setup Database
Create the database and initialize the schema:
# Connect to MariaDB
mysql -u root -p
# Create database
CREATE DATABASE data_phantom;
EXIT;
# Run DDL script to create tables
mysql -u root -p data_phantom < src/main/resources/database.ddl
Configure Application Settings
Add all your configurations and secrets to config-dev.yml
:
# Edit src/main/resources/config-dev.yml
# Update the following sections with your values:
# ============================================
# Database Configuration
# ============================================
meta_store:
url: jdbc:mariadb://localhost:3306/data_phantom
user: root
password: your_root_password
# ============================================
# JWT Configuration
# ============================================
jwt:
secretKey: "your-super-secret-jwt-key-change-this-in-production"
tokenExpirationMinutes: 60
refreshTokenExpirationDays: 7
# ============================================
# AWS Configuration (Optional - for EMR integration)
# ============================================
connector:
aws_emr:
access_key: your_access_key
secret_key: your_secret_key
s3_bucket: your_s3_bucket
s3_path_prefix: data-phantom
region: us-east-1
stack_name: DataPhantomClusterStack
cluster_logical_id: DataPhantomCluster
config-dev.yml
with real credentials to version control. Use environment-specific config files or a secrets management system for production.
Build the Application
Build the application using Maven:
# Clean and build the project
mvn clean install
# This will:
# - Compile the source code
# - Run unit tests
# - Package the application into a JAR file
# - Create target/annihilator-data-phantom-1.0-SNAPSHOT.jar
If the build is successful, you should see output similar to:
[INFO] BUILD SUCCESS
[INFO] Total time: 45.123 s
[INFO] Finished at: 2024-01-15T10:30:45Z
Start the Application
Run the Data Phantom server:
java -jar target/annihilator-data-phantom-1.0-SNAPSHOT.jar server src/main/resources/config-dev.yml
The application will start and you should see output indicating the server is running:
INFO [2024-01-15 10:35:00,123] io.dropwizard.server.ServerFactory: Starting DataPhantomApplication
INFO [2024-01-15 10:35:01,456] org.eclipse.jetty.server.Server: Started @2345ms
Configuration
Data Phantom uses a YAML configuration file located at src/main/resources/config-dev.yml
. Here are the key configuration sections:
Server Configuration
server:
applicationConnectors:
- type: http
port: 9092
adminConnectors:
- type: http
port: 9093
Database Configuration
meta_store:
driverClass: org.mariadb.jdbc.Driver
url: ${MYSQL_URL:jdbc:mariadb://localhost:3306/data_phantom}
user: ${MYSQL_USER:root}
password: ${MYSQL_PASSWORD:your-password}
maxSize: 50
JWT Authentication
jwt:
secretKey: "${JWT_SECRET_KEY:your-secret-key}"
tokenExpirationMinutes: 60
refreshTokenExpirationDays: 7
For detailed configuration options, see the Configuration Guide.
Complete Development Setup
Set up the complete Data Phantom Platform with both backend and frontend for full development experience:
Backend Setup
Java-based API server with Dropwizard
Frontend Setup
React-based dashboard interface
Database Setup
MariaDB with DDL schema
Clone Required Repositories
Clone both the backend and frontend repositories:
# Clone backend repository
git clone https://github.com/arcticOak2/annihilator-data-playground.git
cd annihilator-data-playground
# Clone frontend repository (in separate terminal/directory)
git clone https://github.com/arcticOak2/data-phantom-dashboard.git
cd data-phantom-dashboard
Setup and Initialize Database
Install MariaDB and create the schema using the DDL file:
# Install MariaDB (macOS)
brew install mariadb
brew services start mariadb
# Create database
mysql -u root -p
CREATE DATABASE data_phantom;
EXIT;
# Run DDL file to create all meta tables
cd annihilator-data-playground
mysql -u root -p data_phantom < src/main/resources/database.ddl
src/main/resources/database.ddl
in the backend repository.
Build and Start Backend
Build the backend JAR and start the server:
# Navigate to backend directory
cd annihilator-data-playground
# Update configuration file with your database credentials
# Edit src/main/resources/config-dev.yml and add your secrets:
# - Database password
# - JWT secret key
# - AWS credentials (if using EMR)
# Build the project
mvn clean install
# Start the backend server
java -jar target/annihilator-data-phantom-1.0-SNAPSHOT.jar server src/main/resources/config-dev.yml
src/main/resources/config-dev.yml
with your database password, JWT secret key, and other credentials before starting the server.
Configuration Parameters Explained
Here are the key configuration sections you need to customize:
Server Configuration
server:
applicationConnectors:
- type: http
port: 9092 # Main API server port
adminConnectors:
- type: http
port: 9093 # Admin/health check port
Purpose: Defines which ports the application listens on for API requests and admin operations.
Database Configuration
meta_store:
driverClass: org.mariadb.jdbc.Driver
url: jdbc:mariadb://localhost:3306/data_phantom
user: root
password: your_root_password # UPDATE THIS
maxSize: 50 # Max database connections
minSize: 10 # Min database connections
validationQuery: "SELECT 1" # Health check query
Purpose: Connects to MariaDB/MySQL database where all playground metadata, task definitions, and execution history are stored.
JWT Authentication
jwt:
secretKey: "your-super-secret-jwt-key" # UPDATE THIS
tokenExpirationMinutes: 60 # Access token lifetime
refreshTokenExpirationDays: 7 # Refresh token lifetime
Purpose: Secures API endpoints with JWT tokens. Users must authenticate to access the platform.
Concurrency & Performance
concurrency_config:
adhoc_threadpool_size: 200 # Concurrent adhoc executions
scheduled_threadpool_size: 200 # Concurrent scheduled executions
scheduler_sleep_time: 300000 # 5 minutes - priority queue scan interval
playground_execution_grace_period: 300000 # 5 minutes - execution timeout
playground_max_execution_frequency: 360000 # 6 minutes - min time between runs
Purpose: Controls how many playgrounds can run simultaneously and how the scheduler operates.
Reconciliation Settings
reconciliation_settings:
exact_match_threshold: 1048576 # 1MB - files smaller use exact matching
false_positive_rate: 0.1 # 10% - bloom filter error rate
estimated_rows: 1000000 # Expected rows for bloom filter sizing
Purpose: Determines when to use exact matching vs bloom filter for data reconciliation based on file size.
AWS EMR Configuration (Optional)
connector:
aws_emr:
access_key: your-access-key # UPDATE THIS
secret_key: your-secret-key # UPDATE THIS
s3_bucket: your-s3-bucket # UPDATE THIS
s3_path_prefix: data-phantom # S3 folder prefix
region: us-east-1 # AWS region
stack_name: DataPhantomClusterStack # CloudFormation stack
cluster_logical_id: DataPhantomCluster
step_polling_interval: 30000 # 30 seconds - EMR status check
max_step_retries: 3 # Retry failed EMR steps
Purpose: Enables execution of Hive, Presto, and Spark SQL tasks on AWS EMR clusters. All task outputs are stored in S3.
"random"
parameter. Data Phantom will automatically update this parameter with a random integer value to spin up EMR clusters dynamically.
CloudFormation Setup Requirements:
- Random Parameter: Your CloudFormation template must include a parameter named
"random"
- EMR Lifecycle: The CloudFormation script should handle EMR cluster lifecycle management (creation, scaling, termination)
- Stack Updates: Data Phantom updates the stack by changing the random parameter value
- Cluster Management: Auto-scaling, spot instances, and termination policies should be defined in the CloudFormation template
Parameters:
random:
Type: Number
Description: Random value to trigger cluster updates
Default: 12345
MySQL Connector (Optional)
connector:
mysql:
driverClass: org.mariadb.jdbc.Driver
url: jdbc:mariadb://localhost:3306/your_data_db
user: your_username # UPDATE THIS
password: your_password # UPDATE THIS
outputDirectory: /tmp/sql-output # Local output directory
Purpose: Enables MySQL tasks to connect to external MySQL/MariaDB databases for data extraction and processing.
The backend will start on:
- API Server:
http://localhost:9092
- Admin Interface:
http://localhost:9093
Setup and Start Frontend
Install dependencies and start the React dashboard:
# Navigate to frontend directory (in new terminal)
cd data-phantom-dashboard
# Install dependencies
npm install
# Start the development server
npm start
The frontend dashboard will be available at:
- React Dashboard:
http://localhost:3000
Verify Complete Setup
Test that all components are working together:
Backend Health Check
Verify API is running:
curl http://localhost:9093/health
Database Connection
Test database connectivity:
curl http://localhost:9092/data-phantom/ping
Frontend Dashboard
Access the React dashboard at http://localhost:3000
and verify it can communicate with the backend API.
Creating Your First Playground
With both backend and frontend running, you can now create your first data processing playground:
Access the Application
Open your web browser and navigate to:
http://localhost:9092
You should see the Data Phantom API endpoints. For the web interface, clone and run the React dashboard.
Register a User
Create a user account using the API:
curl -X POST http://localhost:9092/auth/register \
-H "Content-Type: application/json" \
-d '{
"username": "your_username",
"email": "your_email@example.com",
"password": "your_secure_password"
}'
Create a Playground
Create your first playground for organizing your data processing tasks:
curl -X POST http://localhost:9092/data-phantom/playground \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_JWT_TOKEN" \
-d '{
"name": "My First Playground",
"userId": "your_user_id",
"cronExpression": "0 9 * * MON-FRI"
}'
Verification
Verify that your Data Phantom installation is working correctly:
Health Check
Test the health endpoint:
curl http://localhost:9093/health
Database Connection
Verify database connectivity:
curl http://localhost:9092/data-phantom/ping
API Endpoints
Test API accessibility:
curl http://localhost:9092/data-phantom/playground/test-user
Next Steps
Congratulations! You now have Data Phantom running. Here's what you can do next: