Overview

Data Phantom is a comprehensive data processing platform that enables you to create, schedule, and manage SQL workflows across multiple engines. This guide covers the complete setup process from installation to creating your first data pipeline.

Time to Complete: Approximately 15-30 minutes
Difficulty: Beginner to Intermediate

System Architecture

Data Phantom Platform operates through three core execution flows, each designed for specific use cases and fault tolerance.

1. Scheduled Flow

Automated execution based on cron expressions with intelligent scheduling:

  • Priority Queue: Manages all playgrounds with non-empty cron expressions
  • Auto-Discovery: Scans meta store every 5 minutes for playground updates (configurable)
  • Conflict Prevention: Uses configurable grace periods to prevent overlapping executions
  • Load Balancing: Distributes scheduled tasks across available resources

2. Adhoc Flow

On-demand execution for immediate data processing needs:

  • Manual Triggers: Execute entire playgrounds instantly via API or dashboard
  • Limited Runs: Execute only selected tasks within a playground
  • Real-time Monitoring: Track execution progress and status updates
  • Immediate Feedback: Get instant results without waiting for schedules

3. Recovery Flow

Fault-tolerant execution with intelligent checkpoint recovery:

  • State Persistence: Continuously saves execution state to meta store
  • Smart Resume: Automatically resumes from last successful checkpoint after system restart
  • Task Skipping: Avoids re-running tasks that were already completed or failed
  • Data Integrity: Ensures no data loss during system failures

Data Reconciliation Engine

Automated data validation system that compares task outputs with adaptive algorithms:

Exact Match (< 1MB)

For smaller files, uses precise byte-by-byte comparison:

  • 100% accuracy for small datasets
  • Memory efficient for files under 1MB (configurable)
  • Ideal for configuration files and small reports

Bloom Filter (> 1MB)

For larger files, uses probabilistic matching for performance:

  • Memory efficient for files over 1MB
  • Configurable false positive rate
  • Ideal for large datasets with millions of records
S3 Integration: All task outputs are written to S3, and reconciliation reads directly from S3 for scalable comparison across distributed data.

Prerequisites

Before you begin, ensure you have the following installed on your system:

Java 11+

OpenJDK or Oracle JDK 11 or higher

java -version

Maven 3.6+

For building the application

mvn -version

MariaDB/MySQL

Database server for metadata storage

mysql --version

AWS Account (Optional)

For EMR and S3 integration

aws --version

Installation

1

Clone the Repository

First, clone the Data Phantom repository to your local machine:

git clone https://github.com/arcticOak2/annihilator-data-playground.git cd annihilator-data-playground
2

Install MariaDB

Install MariaDB based on your operating system:

# Install MariaDB using Homebrew brew install mariadb # Start MariaDB service brew services start mariadb # Secure installation (optional but recommended) mysql_secure_installation
# Update package index sudo apt update # Install MariaDB sudo apt install mariadb-server # Start MariaDB service sudo systemctl start mariadb sudo systemctl enable mariadb # Secure installation sudo mysql_secure_installation
# Install MariaDB sudo yum install mariadb-server # Start MariaDB service sudo systemctl start mariadb sudo systemctl enable mariadb # Secure installation sudo mysql_secure_installation
# Download MariaDB from: https://mariadb.org/download/ # Run the installer and follow the setup wizard # Or use Chocolatey: choco install mariadb # Start MariaDB service net start mysql
3

Setup Database

Create the database and initialize the schema:

# Connect to MariaDB mysql -u root -p # Create database CREATE DATABASE data_phantom; EXIT; # Run DDL script to create tables mysql -u root -p data_phantom < src/main/resources/database.ddl
Note: Make sure to set a strong password for your database root user during the secure installation process.
4

Configure Application Settings

Add all your configurations and secrets to config-dev.yml:

# Edit src/main/resources/config-dev.yml # Update the following sections with your values: # ============================================ # Database Configuration # ============================================ meta_store: url: jdbc:mariadb://localhost:3306/data_phantom user: root password: your_root_password # ============================================ # JWT Configuration # ============================================ jwt: secretKey: "your-super-secret-jwt-key-change-this-in-production" tokenExpirationMinutes: 60 refreshTokenExpirationDays: 7 # ============================================ # AWS Configuration (Optional - for EMR integration) # ============================================ connector: aws_emr: access_key: your_access_key secret_key: your_secret_key s3_bucket: your_s3_bucket s3_path_prefix: data-phantom region: us-east-1 stack_name: DataPhantomClusterStack cluster_logical_id: DataPhantomCluster
Security Note: Never commit config-dev.yml with real credentials to version control. Use environment-specific config files or a secrets management system for production.
5

Build the Application

Build the application using Maven:

# Clean and build the project mvn clean install # This will: # - Compile the source code # - Run unit tests # - Package the application into a JAR file # - Create target/annihilator-data-phantom-1.0-SNAPSHOT.jar

If the build is successful, you should see output similar to:

[INFO] BUILD SUCCESS [INFO] Total time: 45.123 s [INFO] Finished at: 2024-01-15T10:30:45Z
6

Start the Application

Run the Data Phantom server:

java -jar target/annihilator-data-phantom-1.0-SNAPSHOT.jar server src/main/resources/config-dev.yml

The application will start and you should see output indicating the server is running:

INFO [2024-01-15 10:35:00,123] io.dropwizard.server.ServerFactory: Starting DataPhantomApplication INFO [2024-01-15 10:35:01,456] org.eclipse.jetty.server.Server: Started @2345ms

Configuration

Data Phantom uses a YAML configuration file located at src/main/resources/config-dev.yml. Here are the key configuration sections:

Server Configuration

server: applicationConnectors: - type: http port: 9092 adminConnectors: - type: http port: 9093

Database Configuration

meta_store: driverClass: org.mariadb.jdbc.Driver url: ${MYSQL_URL:jdbc:mariadb://localhost:3306/data_phantom} user: ${MYSQL_USER:root} password: ${MYSQL_PASSWORD:your-password} maxSize: 50

JWT Authentication

jwt: secretKey: "${JWT_SECRET_KEY:your-secret-key}" tokenExpirationMinutes: 60 refreshTokenExpirationDays: 7

For detailed configuration options, see the Configuration Guide.

Complete Development Setup

Set up the complete Data Phantom Platform with both backend and frontend for full development experience:

Backend Setup

Java-based API server with Dropwizard

Frontend Setup

React-based dashboard interface

Database Setup

MariaDB with DDL schema

1

Clone Required Repositories

Clone both the backend and frontend repositories:

# Clone backend repository git clone https://github.com/arcticOak2/annihilator-data-playground.git cd annihilator-data-playground # Clone frontend repository (in separate terminal/directory) git clone https://github.com/arcticOak2/data-phantom-dashboard.git cd data-phantom-dashboard
2

Setup and Initialize Database

Install MariaDB and create the schema using the DDL file:

# Install MariaDB (macOS) brew install mariadb brew services start mariadb # Create database mysql -u root -p CREATE DATABASE data_phantom; EXIT; # Run DDL file to create all meta tables cd annihilator-data-playground mysql -u root -p data_phantom < src/main/resources/database.ddl
DDL Location: The database schema file is located at src/main/resources/database.ddl in the backend repository.
3

Build and Start Backend

Build the backend JAR and start the server:

# Navigate to backend directory cd annihilator-data-playground # Update configuration file with your database credentials # Edit src/main/resources/config-dev.yml and add your secrets: # - Database password # - JWT secret key # - AWS credentials (if using EMR) # Build the project mvn clean install # Start the backend server java -jar target/annihilator-data-phantom-1.0-SNAPSHOT.jar server src/main/resources/config-dev.yml
Configuration: Update src/main/resources/config-dev.yml with your database password, JWT secret key, and other credentials before starting the server.

Configuration Parameters Explained

Here are the key configuration sections you need to customize:

Server Configuration
server: applicationConnectors: - type: http port: 9092 # Main API server port adminConnectors: - type: http port: 9093 # Admin/health check port

Purpose: Defines which ports the application listens on for API requests and admin operations.

Database Configuration
meta_store: driverClass: org.mariadb.jdbc.Driver url: jdbc:mariadb://localhost:3306/data_phantom user: root password: your_root_password # UPDATE THIS maxSize: 50 # Max database connections minSize: 10 # Min database connections validationQuery: "SELECT 1" # Health check query

Purpose: Connects to MariaDB/MySQL database where all playground metadata, task definitions, and execution history are stored.

JWT Authentication
jwt: secretKey: "your-super-secret-jwt-key" # UPDATE THIS tokenExpirationMinutes: 60 # Access token lifetime refreshTokenExpirationDays: 7 # Refresh token lifetime

Purpose: Secures API endpoints with JWT tokens. Users must authenticate to access the platform.

Concurrency & Performance
concurrency_config: adhoc_threadpool_size: 200 # Concurrent adhoc executions scheduled_threadpool_size: 200 # Concurrent scheduled executions scheduler_sleep_time: 300000 # 5 minutes - priority queue scan interval playground_execution_grace_period: 300000 # 5 minutes - execution timeout playground_max_execution_frequency: 360000 # 6 minutes - min time between runs

Purpose: Controls how many playgrounds can run simultaneously and how the scheduler operates.

Reconciliation Settings
reconciliation_settings: exact_match_threshold: 1048576 # 1MB - files smaller use exact matching false_positive_rate: 0.1 # 10% - bloom filter error rate estimated_rows: 1000000 # Expected rows for bloom filter sizing

Purpose: Determines when to use exact matching vs bloom filter for data reconciliation based on file size.

AWS EMR Configuration (Optional)
connector: aws_emr: access_key: your-access-key # UPDATE THIS secret_key: your-secret-key # UPDATE THIS s3_bucket: your-s3-bucket # UPDATE THIS s3_path_prefix: data-phantom # S3 folder prefix region: us-east-1 # AWS region stack_name: DataPhantomClusterStack # CloudFormation stack cluster_logical_id: DataPhantomCluster step_polling_interval: 30000 # 30 seconds - EMR status check max_step_retries: 3 # Retry failed EMR steps

Purpose: Enables execution of Hive, Presto, and Spark SQL tasks on AWS EMR clusters. All task outputs are stored in S3.

CloudFormation Requirement: The system expects a CloudFormation script to be deployed with a "random" parameter. Data Phantom will automatically update this parameter with a random integer value to spin up EMR clusters dynamically.
CloudFormation Setup Requirements:
  • Random Parameter: Your CloudFormation template must include a parameter named "random"
  • EMR Lifecycle: The CloudFormation script should handle EMR cluster lifecycle management (creation, scaling, termination)
  • Stack Updates: Data Phantom updates the stack by changing the random parameter value
  • Cluster Management: Auto-scaling, spot instances, and termination policies should be defined in the CloudFormation template
Example CloudFormation Parameter:
Parameters: random: Type: Number Description: Random value to trigger cluster updates Default: 12345
MySQL Connector (Optional)
connector: mysql: driverClass: org.mariadb.jdbc.Driver url: jdbc:mariadb://localhost:3306/your_data_db user: your_username # UPDATE THIS password: your_password # UPDATE THIS outputDirectory: /tmp/sql-output # Local output directory

Purpose: Enables MySQL tasks to connect to external MySQL/MariaDB databases for data extraction and processing.

The backend will start on:

  • API Server: http://localhost:9092
  • Admin Interface: http://localhost:9093
4

Setup and Start Frontend

Install dependencies and start the React dashboard:

# Navigate to frontend directory (in new terminal) cd data-phantom-dashboard # Install dependencies npm install # Start the development server npm start

The frontend dashboard will be available at:

  • React Dashboard: http://localhost:3000
Frontend Repository: The React dashboard is maintained separately at data-phantom-dashboard.
5

Verify Complete Setup

Test that all components are working together:

Backend Health Check

Verify API is running:

curl http://localhost:9093/health

Database Connection

Test database connectivity:

curl http://localhost:9092/data-phantom/ping

Frontend Dashboard

Access the React dashboard at http://localhost:3000 and verify it can communicate with the backend API.

Creating Your First Playground

With both backend and frontend running, you can now create your first data processing playground:

1

Access the Application

Open your web browser and navigate to:

http://localhost:9092

You should see the Data Phantom API endpoints. For the web interface, clone and run the React dashboard.

2

Register a User

Create a user account using the API:

curl -X POST http://localhost:9092/auth/register \ -H "Content-Type: application/json" \ -d '{ "username": "your_username", "email": "your_email@example.com", "password": "your_secure_password" }'
3

Create a Playground

Create your first playground for organizing your data processing tasks:

curl -X POST http://localhost:9092/data-phantom/playground \ -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_JWT_TOKEN" \ -d '{ "name": "My First Playground", "userId": "your_user_id", "cronExpression": "0 9 * * MON-FRI" }'

Verification

Verify that your Data Phantom installation is working correctly:

Health Check

Test the health endpoint:

curl http://localhost:9093/health

Database Connection

Verify database connectivity:

curl http://localhost:9092/data-phantom/ping

API Endpoints

Test API accessibility:

curl http://localhost:9092/data-phantom/playground/test-user

Next Steps

Congratulations! You now have Data Phantom running. Here's what you can do next:

Quick Start Guide

Get started quickly with step-by-step visual guide

Quick Start

Advanced Configuration

Customize Data Phantom for your specific needs

Configuration

Install Dashboard

Set up the React-based web interface

Dashboard