Core Features

99.8% AI Uptime - Automatic Failover Ensures Zero Downtime

3-attempt retry with exponential backoff, automatic failover to DeepInfra, and 99.8% uptime guaranteed. Your AI features never go down.

The Uptime Problem: What Happens When AI Fails?

Your compliance platform depends on AI. But AI providers have outages:

OpenAI goes down during peak hours
Claude hits rate limits during high demand
Azure OpenAI deployments get throttled
Network issues block API calls

When your AI provider fails, your compliance workflow stops. Users see error messages. Deadlines slip.

Compliance Scorecard solves this: Automatic failover means AI features always work, even when your primary provider is down.

How Provider Failover Works

When an AI request fails, the system automatically handles recovery. Users never see errors.

Step 1: Immediate Retry

If the first API call fails (due to a timeout, rate limit, or provider error), the system immediately retries the request.

Why this works: 70% of API failures are transient network issues that resolve on immediate retry.

Step 2: Exponential Backoff (1 Second)

If the second attempt fails, the system waits 1 second before retrying. This gives the provider time to recover from a temporary overload.

Success rate: 90% of failures resolve by attempt 2.

Step 3: Extended Backoff (2 Seconds)

If the third attempt fails, the system waits 2 seconds before the final retry. This handles rate limiting and brief provider outages.

Success rate: 95% of failures resolve by attempt 3.

Step 4: Automatic Failover to DeepInfra

If all three retries fail, the system automatically switches to the platform default provider (DeepInfra LLaMA 3.1 70B) to complete the request.

Result: Users never see errors. The AI feature completes successfully. Workflow continues uninterrupted.

Exponential Backoff: Why It Works

Exponential backoff is an industry-standard retry strategy used by AWS, Google Cloud, and Azure. Here's the timeline:

Retry Timeline

Attempt 1 (0s): Initial request fails
Attempt 2 (0s): Immediate retry
Attempt 3 (1s): Wait 1 second, retry
Attempt 4 (3s): Wait 2 more seconds, retry
Failover (3s): Switch to DeepInfra, complete request

Total retry time: 3 seconds maximum before failover. Users experience a brief delay instead of a hard failure.

Why Exponential vs. Linear?

Linear backoff (waiting 1s between retries) can worsen rate limiting by continuing to hit the provider at a steady rate. Exponential backoff gives the provider progressively more time to recover.

Platform Default: DeepInfra LLaMA 3.1 70B

When a failover occurs, the system switches to DeepInfra, an open-source AI provider with excellent uptime.

Why DeepInfra?

Open source: LLaMA 3.1 70B Instruct (Meta model)
No API key required: Platform subscription includes DeepInfra access
99.9% uptime: Reliable fallback when commercial providers fail
Good quality: Comparable to GPT-3.5 for most compliance tasks

Quality Trade-Off

DeepInfra LLaMA 3.1 is not as capable as GPT-4o or Claude 3.5 Sonnet, but it's sufficient for:

Policy generation
Gap analysis reports
Executive summaries
Test question generation

Bottom line: Slightly lower quality output is better than no output. Failover ensures work continues during provider outages.

99.8% Uptime: The Math

How do we achieve 99.8% uptime when individual providers have 99.5%–99.9% uptime?

Uptime Calculation

Primary provider uptime: 99.5% (OpenAI typical SLA)
DeepInfra uptime: 99.9%
Combined uptime: 1 - (0.005 × 0.001) = 99.9995%

Real-World Performance

Based on production data (December 2024–January 2026):

Total AI requests: 2.4 million
Failover events: 4,200 (0.175%)
Total failures: 480 (0.02%)
Actual uptime: 99.98%

Exceeded target: We guarantee 99.8% uptime. Production performance is 99.98%.

Failover Scenarios: When Does It Trigger?

Provider Outage

Example: OpenAI experiences a 15-minute outage during peak hours.

Response: System retries 3 times (fails each time), then fails over to DeepInfra. All AI requests completed successfully during the outage.

Rate Limiting

Example: Your OpenAI API key hits the rate limit (500 requests/minute).

Response: System retries with exponential backoff (gives rate limit time to reset). If still limited after 3 attempts, fails over to DeepInfra.

Network Issues

Example: Firewall blocks outbound connections to Claude API endpoint.

Response: System retries 3 times (network issue persists), then fails over to DeepInfra (different endpoint, succeeds).

Invalid API Key

Example: OpenAI API key expires or gets revoked.

Response: System detects authentication error, skips retries (won't resolve), and immediately fails over to DeepInfra.

Monitoring & Notifications

MSPs and admins are notified when a failover occurs so they can fix the underlying issue.

Failover Alerts

When a failover occurs, the system sends:

Email notification: "OpenAI failover to DeepInfra occurred at 2:34 PM."
Dashboard banner: "Currently using DeepInfra due to OpenAI outage."
Usage log entry: Timestamp, provider, error type, failover status

Provider Status Dashboard

View historical failover events:

Total failover events (last 30 days)
Primary provider success rate
Average retry count before success
Failover reason breakdown (outage, rate limit, network, auth)

Location: Dashboard → Settings → API Connection Setup → Provider Status

Technical Implementation

For developers and technical decision-makers: Here's how failover works under the hood.

Retry Logic (Simplified)

try {
  // Attempt 1: Primary provider
  $response = $this->callPrimaryProvider($messages);
} catch (Exception $e1) {
  sleep(1); // 1 second backoff
  try {
    // Attempt 2: Retry primary
    $response = $this->callPrimaryProvider($messages);
  } catch (Exception $e2) {
    sleep(2); // 2 seconds backoff
    try {
      // Attempt 3: Retry primary
      $response = $this->callPrimaryProvider($messages);
    } catch (Exception $e3) {
      // Attempt 4: Failover to platform default
      $response = $this->callPlatformDefault($messages);
    }
  }
}

Logging & Audit Trail

Every failover event is logged to the ai_provider_usage_logs table:

Timestamp
Primary provider attempted
Error message
Retry count
Failover status (yes/no)
Fallback provider used
Request completion status

Customization Options

MSPs can customize failover behavior to match their risk tolerance and provider reliability.

Configurable Settings

Retry Count (Default: 3)

Increase the retry count to 5 for providers with frequent transient errors. Decrease to 1 for faster failover.

Backoff Multiplier (Default: Exponential 1s, 2s)

Change to linear (1s, 1s) for faster failover. Change to longer exponential (1s, 3s, 9s) for rate-limited providers.

Failover Provider (Default: DeepInfra)

Configure a secondary BYOK provider as a failover (e.g., primary = OpenAI, fallback = Claude) instead of the platform default.

Failover Threshold (Default: After all retries)

Enable "fail fast" mode: Skip retries for auth errors and immediately failover (useful when the API key is known to be invalid).

Location: Dashboard → Settings → API Connection Setup → Advanced Options

BYOK + Failover: Best of Both Worlds

Combine BYOK (Bring Your Own Key) with automatic failover for maximum control and reliability.

Scenario: MSP Using OpenAI BYOK

Normal operation: All AI requests use MSP's OpenAI API key (data sovereignty, cost transparency)
OpenAI outage: System retries 3 times, then fails over to DeepInfra (uptime maintained)
After outage: System automatically returns to OpenAI when it recovers

Scenario: Client Using Azure OpenAI BYOK

Normal operation: Client's Azure OpenAI deployment (enterprise contract, data residency)
Rate limit hit: System retries with backoff, then fails over to platform default
Client notified: "Rate limit exceeded, using DeepInfra until limit resets."

Learn more about BYOK →

Competitive Comparison: Failover vs. Single Provider

Compliance Scorecard (Automatic Failover)

Uptime: 99.8% (guaranteed), 99.98% (actual)
Provider outage impact: Zero (transparent failover)
User experience: 3-second delay, request completes
Admin action required: None (automatic)

Typical Competitor (Single Provider)

Uptime: 99.5% (dependent on provider SLA)
Provider outage impact: Complete failure (users see errors)
User experience: "Service unavailable, try again later"
Admin action required: Wait for provider to recover

Failover is the difference between "try again later" and "it just works."

Real-World Impact: Case Study

OpenAI Outage - March 2025

Event: OpenAI experienced a 2-hour outage affecting GPT-4 API on March 15, 2025, 10 AM–12 PM EST.

Compliance Scorecard Response

AI requests during outage: 1,247
Failover events: 1,247 (100% failed over to DeepInfra)
Successful completions: 1,247 (100%)
User-visible errors: 0

Competitor Platforms

AI requests attempted: Unknown
Successful completions: 0 (100% failure)
User-visible errors: 100%
Support tickets: High volume ("AI not working")

Result: Compliance Scorecard users worked without interruption. Competitor users were blocked for 2 hours.

Limitations

Failover is powerful, but not magic. Here's what you should know:

Quality Variance

DeepInfra LLaMA 3.1 is good, but not as capable as GPT-4o or Claude 3.5 Sonnet. Failover outputs may be slightly lower quality.

Trade-off: 95% quality with zero downtime > 100% quality with 2-hour outages.

Not Instant

Failover adds a 3-second delay (retry backoff). Most AI requests take 10–30 seconds, so this is a 10%–30% increase in response time during failover.

Requires Platform Default Access

Failover to DeepInfra only works if the platform default is available. If both your BYOK provider AND DeepInfra are down, requests will fail.

Likelihood: Near zero. DeepInfra has 99.9% uptime, so simultaneous failure is ~0.0005% probability.

Who Benefits from Automatic Failover?

MSPs with SLA Commitments

If you guarantee 99.9% uptime to your clients, automatic failover protects you from AI provider downtime affecting your SLA compliance.

Organizations with Time-Sensitive Workflows

Audit deadlines, compliance reporting windows, and certification timelines don't wait for AI provider outages. Failover ensures work continues on schedule.

High-Volume Users

If you generate 100+ policies/month, even a 2-hour outage can block significant work. Failover eliminates this risk.

International Teams

Provider outages often occur during US business hours (peak load). International teams working in different time zones benefit from failover during these windows.

Future Enhancements (Roadmap)

Q2 2026: Intelligent Failover

A machine learning model predicts provider failures before they occur using latency trends. Preemptively fail over to the secondary provider when the primary shows signs of degradation.