Belto Infrastructure v1.0

Intelligence at the Edge.
Infrastructure for Local AI.

SlyOS provides the complete toolchain to deploy Small Language Models (SLMs) to your users' devices. Zero latency. 100% Privacy. No per-token fees.

Start Integration
$ npm install @belto/slyos-sdk
WebGPU
● READY
Ollama
● READY
iOS
COMING SOON
Android
COMING SOON

See SlyOS in Action

Running `quantum-360m` entirely in the browser via WebGPU.

localhost:3000/demo
SlyOS Demo Interface
Inference Active: 45ms

It's actually this simple.

No complex Python environments. No server management. Just a 3-step pipeline.

1

Select Model

Choose from our curated Zoo of optimized SLMs (Quantum, VoiceCore) based on your use case.

Selected
Quantum-360M
Text Generation • 200MB
2

Configure

Set quantization levels (q4, q8) and memory limits. We handle the device compatibility check.

Quantization INT4
Cache Strategy Persistent
3

Fetch SDK

Drop the code into your app. The SDK handles downloading, caching, and inference.

import SlyOS from '@belto';
await SlyOS.init({
  model: 'quantum-360m'
});

Why move to the Edge?

Cloud AI is expensive, slow, and a privacy nightmare. SlyOS changes the paradigm.

Zero-Knowledge Privacy

With SlyOS, the prompt never leaves the device. The inference happens 100% locally on the user's NPU/CPU.

  • GDPR & HIPAA Compliant by default
  • No data training on user inputs
  • Immune to server-side leaks

Instant Latency

Eliminate the network hop. No queuing, no server downtime. Responses are generated instantly on-device.

  • Works Offline (Airplane Mode)
  • Zero Server Costs per token
  • Consistent performance

Built for the next generation of apps.

When latency is zero and privacy is absolute, new product categories emerge.

Private RAG & Search

Build apps that search through PDFs, notes, or emails locally. The user's personal data never needs to be uploaded to a vector database in the cloud.

NPC Intelligence

Give game characters dynamic conversations without server costs. Run `quantum-135m` directly in the browser game loop with negligible FPS impact.

Offline Translation

Travel apps that work in airplane mode. Translate voice and text instantly without needing a roaming data connection.

The Model Zoo

Optimized weights for every use case. Click to deploy.

Complete Fleet Observability

Track latency, memory pressure, and model performance across every device in your fleet. Real-time telemetry without seeing user data.

14,203 Active Devices
~ 42ms Avg Latency
Device UUID Environment Model Loaded VRAM Usage Inference Speed Status
dev-8a92-f3... Chrome / WebGPU quantum-360m 412MB / 8GB 52 t/s ONLINE
dev-b211-a9... iOS 17.2 voicecore-base 256MB / 6GB 0.3x RT ONLINE
dev-c440-e1... Android 14 quantum-135m 180MB / 12GB 18 t/s IDLE
dev-f999-z0... Mac / Metal quantum-8b 5.2GB / 32GB 88 t/s ONLINE

Dual Deployment Architectures

SlyOS isn't just for phones. It's an infrastructure layer that adapts to the available compute environment.

Architecture A

Direct Edge Inference

The standard SlyOS flow. A user's single device downloads an optimized SLM (e.g., 360M params) and runs it completely independently.

  • > Ultra-low latency
  • > Works offline
  • > 100% Privacy
SLM (Cached)
User Device
Architecture B

Ad-Hoc Local Grid

Leverage idle office compute. For latency-independent tasks, SlyOS splits a large model (e.g., 8B) across available workstations over the local 1GbE LAN.

  • > utilized idle VRAM/CPU
  • > Runs larger models
  • > Data stays on LAN
Large Model (70B)
Sharded source
Job Result
Recombined output

Technical FAQ

No. SlyOS uses the Cache API and IndexedDB to store model weights persistently on the device. The download (~200MB for base models) happens only once on the first load. Subsequent loads are instant.
The SDK runs a hardware check `SlyOS.checkCompatibility()` before initializing. If the device lacks WebGPU support or sufficient RAM, you can configure a fallback to our cloud API or a graceful degradation message.
Inference is energy-intensive, but we use optimized WebGPU shaders that are significantly more efficient than CPU inference. For short tasks (chat, command classification), the impact is negligible. For long-running tasks, we recommend checking battery status via the SDK.
Models in the Zoo retain their original base licensing and alignment (e.g., Llama 3 safeguards). However, we offer "Quantum-Code" variants that are fine-tuned for strict instruction following with reduced refusal rates for developer use cases.