Pipeline components: Vulkan shader set, orchestration layer, MIDI generation output

Cantele

Active Development

Exploring low-VRAM AI inference via custom Vulkan compute.

The Problem

Running SD 1.5 with ControlNets requires more VRAM than low-end hardware provides. Existing tools load the full UNet into GPU memory at once. OnnxStream demonstrated that block-by-block streaming works on CPU — the question is whether the same approach is viable on GPU via Vulkan compute, and whether the overhead is acceptable.

The Approach

Hand-written Vulkan compute shaders (GLSL → SPIR-V) that process each SD 1.5 UNet block in sequence, releasing VRAM between blocks. No Python, no CUDA — Vulkan runs cross-vendor on NVIDIA, AMD, and Intel. LLM inference coordinates via llama-server with llguidance grammars for constrained generation. MIDI generation uses a Rust reimplementation of SkyTNT — a transformer that composes new MIDI from seed sequences.

The Result

The Vulkan SD pipeline has a complete custom shader set with a 4-phase GPU optimization strategy, but end-to-end validation hasn't been completed. The concept is genuinely unexplored territory: no existing tool does block-by-block Vulkan streaming for SD inference on GPU. The Rust MIDI generation model works as a standalone library. LLM orchestration via llama-server is functional.

Architecture

Orchestration: Central coordinator → Vulkan SD (custom compute shaders, block-streaming) + LLM (llama-server) + MIDI (Rust reimplementation of SkyTNT)

Demo & Screenshots

Pipeline components: Vulkan shader set, orchestration layer, MIDI generation output
Pipeline components: Vulkan shader set, orchestration layer, MIDI generation output
Pipeline components: Vulkan shader set, orchestration layer, MIDI generation output

Tech Stack

Rust

Zero-overhead orchestration. When coordinating real-time inference workloads, the orchestrator cannot be the bottleneck.

Vulkan

Direct GPU compute via hand-written GLSL shaders. Cross-vendor (NVIDIA, AMD, Intel) via SPIR-V. No CUDA dependency, no Python runtime.

LLM

Coordinated llama-server management — model loading, prompt routing, constrained generation with llguidance grammars.

Stable Diffusion

Custom block-streaming architecture for SD 1.5. The research question: can block-by-block Vulkan compute make SD viable on 2GB VRAM?

MIDI

Rust reimplementation of SkyTNT's MIDI generation model — a transformer that composes new MIDI from seed sequences. Same architecture, native speed, embeddable as a library.

Current Status

Early development, Vulkan pipeline on hold. The shader set is written and the block-streaming architecture is designed, but end-to-end SD generation hasn't been validated yet. MIDI generation and LLM orchestration are functional. The Vulkan work resumes after Uncanny Realm's content pipeline stabilizes.

What's Next

Validate the Vulkan SD pipeline end-to-end, ControlNet support, determine integration path into the game engine.