Getting Started - Data Phantom Platform

Overview

Data Phantom is a comprehensive data processing platform that enables you to create, schedule, and manage SQL workflows across multiple engines. This guide covers the complete setup process from installation to creating your first data pipeline.

Time to Complete: Approximately 15-30 minutes
Difficulty: Beginner to Intermediate

System Architecture

Data Phantom Platform operates through three core execution flows, each designed for specific use cases and fault tolerance.

1. Scheduled Flow

Automated execution based on cron expressions with intelligent scheduling:

Priority Queue: Manages all playgrounds with non-empty cron expressions
Auto-Discovery: Scans meta store every 5 minutes for playground updates (configurable)
Conflict Prevention: Uses configurable grace periods to prevent overlapping executions
Load Balancing: Distributes scheduled tasks across available resources

2. Adhoc Flow

On-demand execution for immediate data processing needs:

Manual Triggers: Execute entire playgrounds instantly via API or dashboard
Limited Runs: Execute only selected tasks within a playground
Real-time Monitoring: Track execution progress and status updates
Immediate Feedback: Get instant results without waiting for schedules

3. Recovery Flow

Fault-tolerant execution with intelligent checkpoint recovery:

State Persistence: Continuously saves execution state to meta store
Smart Resume: Automatically resumes from last successful checkpoint after system restart
Task Skipping: Avoids re-running tasks that were already completed or failed
Data Integrity: Ensures no data loss during system failures

Data Reconciliation Engine

Automated data validation system that compares task outputs with adaptive algorithms:

Exact Match (< 1MB)

For smaller files, uses precise byte-by-byte comparison:

100% accuracy for small datasets
Memory efficient for files under 1MB (configurable)
Ideal for configuration files and small reports

Bloom Filter (> 1MB)

For larger files, uses probabilistic matching for performance:

Memory efficient for files over 1MB
Configurable false positive rate
Ideal for large datasets with millions of records

S3 Integration: All task outputs are written to S3, and reconciliation reads directly from S3 for scalable comparison across distributed data.

Prerequisites

Before you begin, ensure you have the following installed on your system:

Java 11+

OpenJDK or Oracle JDK 11 or higher

java -version

Maven 3.6+

For building the application

mvn -version

MariaDB/MySQL

Database server for metadata storage

mysql --version

AWS Account (Optional)

For EMR and S3 integration

aws --version

Installation

1

Clone the Repository

First, clone the Data Phantom repository to your local machine:

                                        git clone https://github.com/arcticOak2/annihilator-data-playground.git
cd annihilator-data-playground
                                    

2

Install MariaDB

Install MariaDB based on your operating system:

                                                # Install MariaDB using Homebrew
brew install mariadb

# Start MariaDB service
brew services start mariadb

# Secure installation (optional but recommended)
mysql_secure_installation
                                            

                                                # Update package index
sudo apt update

# Install MariaDB
sudo apt install mariadb-server

# Start MariaDB service
sudo systemctl start mariadb
sudo systemctl enable mariadb

# Secure installation
sudo mysql_secure_installation
                                            

                                                # Install MariaDB
sudo yum install mariadb-server

# Start MariaDB service
sudo systemctl start mariadb
sudo systemctl enable mariadb

# Secure installation
sudo mysql_secure_installation
                                            

                                                # Download MariaDB from: https://mariadb.org/download/
# Run the installer and follow the setup wizard
# Or use Chocolatey:
choco install mariadb

# Start MariaDB service
net start mysql
                                            

3

Setup Database

Create the database and initialize the schema:

                                        # Connect to MariaDB
mysql -u root -p

# Create database
CREATE DATABASE data_phantom;
EXIT;

# Run DDL script to create tables
mysql -u root -p data_phantom < src/main/resources/database.ddl
                                    

Note: Make sure to set a strong password for your database root user during the secure installation process.

4

Configure Application Settings

Add all your configurations and secrets to config-dev.yml:

                                        # Edit src/main/resources/config-dev.yml
# Update the following sections with your values:

# ============================================
# Database Configuration
# ============================================
meta_store:
  url: jdbc:mariadb://localhost:3306/data_phantom
  user: root
  password: your_root_password


# ============================================
# JWT Configuration
# ============================================
jwt:
  secretKey: "your-super-secret-jwt-key-change-this-in-production"
  tokenExpirationMinutes: 60
  refreshTokenExpirationDays: 7


# ============================================
# AWS Configuration (Optional - for EMR integration)
# ============================================
connector:
  aws_emr:
    access_key: your_access_key
    secret_key: your_secret_key
    s3_bucket: your_s3_bucket
    s3_path_prefix: data-phantom
    region: us-east-1
    stack_name: DataPhantomClusterStack
    cluster_logical_id: DataPhantomCluster
                                    

Security Note: Never commit config-dev.yml with real credentials to version control. Use environment-specific config files or a secrets management system for production.

5

Build the Application

Build the application using Maven:

                                        # Clean and build the project
mvn clean install

# This will:
# - Compile the source code
# - Run unit tests
# - Package the application into a JAR file
# - Create target/annihilator-data-phantom-1.0-SNAPSHOT.jar
                                    

If the build is successful, you should see output similar to:

                                        [INFO] BUILD SUCCESS
[INFO] Total time: 45.123 s
[INFO] Finished at: 2024-01-15T10:30:45Z
                                    

6

Start the Application

Run the Data Phantom server:

                                        java -jar target/annihilator-data-phantom-1.0-SNAPSHOT.jar server src/main/resources/config-dev.yml
                                    

The application will start and you should see output indicating the server is running:

                                        INFO  [2024-01-15 10:35:00,123] io.dropwizard.server.ServerFactory: Starting DataPhantomApplication
INFO  [2024-01-15 10:35:01,456] org.eclipse.jetty.server.Server: Started @2345ms
                                    

Configuration

Data Phantom uses a YAML configuration file located at src/main/resources/config-dev.yml. Here are the key configuration sections:

Server Configuration

                                    server:
  applicationConnectors:
    - type: http
      port: 9092
  adminConnectors:
    - type: http
      port: 9093
                                

Database Configuration

                                    meta_store:
  driverClass: org.mariadb.jdbc.Driver
  url: ${MYSQL_URL:jdbc:mariadb://localhost:3306/data_phantom}
  user: ${MYSQL_USER:root}
  password: ${MYSQL_PASSWORD:your-password}
  maxSize: 50
                                

JWT Authentication

                                    jwt:
  secretKey: "${JWT_SECRET_KEY:your-secret-key}"
  tokenExpirationMinutes: 60
  refreshTokenExpirationDays: 7
                                

For detailed configuration options, see the Configuration Guide.

Complete Development Setup

Set up the complete Data Phantom Platform with both backend and frontend for full development experience:

Backend Setup

Java-based API server with Dropwizard

Frontend Setup

React-based dashboard interface

Database Setup

MariaDB with DDL schema

1

Clone Required Repositories

Clone both the backend and frontend repositories:

                                        # Clone backend repository
git clone https://github.com/arcticOak2/annihilator-data-playground.git
cd annihilator-data-playground

# Clone frontend repository (in separate terminal/directory)
git clone https://github.com/arcticOak2/data-phantom-dashboard.git
cd data-phantom-dashboard
                                    

2

Setup and Initialize Database

Install MariaDB and create the schema using the DDL file:

                                        # Install MariaDB (macOS)
brew install mariadb
brew services start mariadb

# Create database
mysql -u root -p
CREATE DATABASE data_phantom;
EXIT;

# Run DDL file to create all meta tables
cd annihilator-data-playground
mysql -u root -p data_phantom < src/main/resources/database.ddl
                                    

DDL Location: The database schema file is located at src/main/resources/database.ddl in the backend repository.

3

Build and Start Backend

Build the backend JAR and start the server:

                                        # Navigate to backend directory
cd annihilator-data-playground

# Update configuration file with your database credentials
# Edit src/main/resources/config-dev.yml and add your secrets:
# - Database password
# - JWT secret key
# - AWS credentials (if using EMR)

# Build the project
mvn clean install

# Start the backend server
java -jar target/annihilator-data-phantom-1.0-SNAPSHOT.jar server src/main/resources/config-dev.yml
                                    

Configuration: Update src/main/resources/config-dev.yml with your database password, JWT secret key, and other credentials before starting the server.

Configuration Parameters Explained

Here are the key configuration sections you need to customize:

Server Configuration

                                                server:
  applicationConnectors:
    - type: http
      port: 9092              # Main API server port
  adminConnectors:
    - type: http
      port: 9093              # Admin/health check port
                                            

Purpose: Defines which ports the application listens on for API requests and admin operations.

Database Configuration

                                                meta_store:
  driverClass: org.mariadb.jdbc.Driver
  url: jdbc:mariadb://localhost:3306/data_phantom
  user: root
  password: your_root_password    # UPDATE THIS
  maxSize: 50                     # Max database connections
  minSize: 10                     # Min database connections
  validationQuery: "SELECT 1"    # Health check query
                                            

Purpose: Connects to MariaDB/MySQL database where all playground metadata, task definitions, and execution history are stored.

JWT Authentication

                                                jwt:
  secretKey: "your-super-secret-jwt-key"    # UPDATE THIS
  tokenExpirationMinutes: 60                # Access token lifetime
  refreshTokenExpirationDays: 7             # Refresh token lifetime
                                            

Purpose: Secures API endpoints with JWT tokens. Users must authenticate to access the platform.

Concurrency & Performance

                                                concurrency_config:
  adhoc_threadpool_size: 200              # Concurrent adhoc executions
  scheduled_threadpool_size: 200          # Concurrent scheduled executions
  scheduler_sleep_time: 300000            # 5 minutes - priority queue scan interval
  playground_execution_grace_period: 300000    # 5 minutes - execution timeout
  playground_max_execution_frequency: 360000   # 6 minutes - min time between runs
                                            

Purpose: Controls how many playgrounds can run simultaneously and how the scheduler operates.

Reconciliation Settings

                                                reconciliation_settings:
  exact_match_threshold: 1048576    # 1MB - files smaller use exact matching
  false_positive_rate: 0.1          # 10% - bloom filter error rate
  estimated_rows: 1000000           # Expected rows for bloom filter sizing
                                            

Purpose: Determines when to use exact matching vs bloom filter for data reconciliation based on file size.

AWS EMR Configuration (Optional)

                                                connector:
  aws_emr:
    access_key: your-access-key           # UPDATE THIS
    secret_key: your-secret-key           # UPDATE THIS
    s3_bucket: your-s3-bucket             # UPDATE THIS
    s3_path_prefix: data-phantom          # S3 folder prefix
    region: us-east-1                     # AWS region
    stack_name: DataPhantomClusterStack   # CloudFormation stack
    cluster_logical_id: DataPhantomCluster
    step_polling_interval: 30000          # 30 seconds - EMR status check
    max_step_retries: 3                   # Retry failed EMR steps
                                            

Purpose: Enables execution of Hive, Presto, and Spark SQL tasks on AWS EMR clusters. All task outputs are stored in S3.

CloudFormation Requirement: The system expects a CloudFormation script to be deployed with a "random" parameter. Data Phantom will automatically update this parameter with a random integer value to spin up EMR clusters dynamically.

CloudFormation Setup Requirements:

Random Parameter: Your CloudFormation template must include a parameter named "random"
EMR Lifecycle: The CloudFormation script should handle EMR cluster lifecycle management (creation, scaling, termination)
Stack Updates: Data Phantom updates the stack by changing the random parameter value
Cluster Management: Auto-scaling, spot instances, and termination policies should be defined in the CloudFormation template

Example CloudFormation Parameter:

                                                        Parameters:
  random:
    Type: Number
    Description: Random value to trigger cluster updates
    Default: 12345
                                                    

MySQL Connector (Optional)

                                                connector:
  mysql:
    driverClass: org.mariadb.jdbc.Driver
    url: jdbc:mariadb://localhost:3306/your_data_db
    user: your_username                   # UPDATE THIS
    password: your_password               # UPDATE THIS
    outputDirectory: /tmp/sql-output      # Local output directory
                                            

Purpose: Enables MySQL tasks to connect to external MySQL/MariaDB databases for data extraction and processing.

The backend will start on:

API Server: http://localhost:9092
Admin Interface: http://localhost:9093

4

Setup and Start Frontend

Install dependencies and start the React dashboard:

                                        # Navigate to frontend directory (in new terminal)
cd data-phantom-dashboard

# Install dependencies
npm install

# Start the development server
npm start
                                    

The frontend dashboard will be available at:

React Dashboard: http://localhost:3000

Frontend Repository: The React dashboard is maintained separately at data-phantom-dashboard.

5

Verify Complete Setup

Test that all components are working together:

Backend Health Check

Verify API is running:

curl http://localhost:9093/health

Database Connection

Test database connectivity:

curl http://localhost:9092/data-phantom/ping

Frontend Dashboard

Access the React dashboard at http://localhost:3000 and verify it can communicate with the backend API.

Creating Your First Playground

With both backend and frontend running, you can now create your first data processing playground:

1

Access the Application

Open your web browser and navigate to:

http://localhost:9092

You should see the Data Phantom API endpoints. For the web interface, clone and run the React dashboard.

2

Register a User

Create a user account using the API:

                                        curl -X POST http://localhost:9092/auth/register \
  -H "Content-Type: application/json" \
  -d '{
    "username": "your_username",
    "email": "your_email@example.com",
    "password": "your_secure_password"
  }'
                                    

3

Create a Playground

Create your first playground for organizing your data processing tasks:

                                        curl -X POST http://localhost:9092/data-phantom/playground \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -d '{
    "name": "My First Playground",
    "userId": "your_user_id",
    "cronExpression": "0 9 * * MON-FRI"
  }'
                                    

Verification

Verify that your Data Phantom installation is working correctly:

Health Check

Test the health endpoint:

curl http://localhost:9093/health

Database Connection

Verify database connectivity:

curl http://localhost:9092/data-phantom/ping

API Endpoints

Test API accessibility:

curl http://localhost:9092/data-phantom/playground/test-user

Next Steps

Congratulations! You now have Data Phantom running. Here's what you can do next:

Quick Start Guide

Get started quickly with step-by-step visual guide

Quick Start

Advanced Configuration

Customize Data Phantom for your specific needs

Configuration

Install Dashboard

Set up the React-based web interface

Dashboard