AI – Deep Core Labs

Green Difference Studio — Free Online Green Screen & Chroma Key Tool in Your Browser

Miro Hristov — Mon, 16 Mar 2026 01:37:40 +0000

Open Green Difference Studio - Remove Green Screen OnlineOpen Green Difference Studio - Remove Green Screen Online

Open Demo Online | Source on GitHub

So a couple weeks ago Corridor Crew dropped their video about CorridorKey — an open-source, AI-powered chroma keyer that uses a transformer network to solve the green screen “unmixing problem.” The thing is genuinely impressive. It takes a raw green screen frame and a rough alpha hint, then predicts true foreground color and a clean linear alpha for every pixel, including all the nightmare stuff like motion blur, hair, and out-of-focus edges. They trained it on procedurally generated 3D renders with mathematically perfect alpha data. It outputs 16-bit and 32-bit EXR files for Nuke and DaVinci. Serious tool for serious work.

I actually installed a quanted version EZ-CorridorKey on my 4080 Super workstation and it does work. The keying results are legitimately great — better than anything traditional can do on difficult footage. But the premask step is a pain. You need to feed it a decent black-and-white outline of your subject, and getting that right is its own little project. For clean studio footage it’s fine, but the workflow isn’t exactly “drop a file and go”. The included GVM AUTO / SAM2 / MatAnyone2/ VideoMaMa etc. produced very underwhelming results for me and the tool crashed quite a bit.

That got me thinking. What if you just want to pull a quick key on a talking-head video and export it with alpha? What if you don’t want to install anything at all? What if you’re on a laptop with no dedicated GPU?

That’s how Green Difference Studio happened.

Standing on shoulders

I should give credit where it’s due. The original chroma key shader that got this project started came from Urban Pixel Lab’s Realtime Greenscreen Keyer (GitHub). Their WebGL shader was the foundation — the hue-based keying approach, the basic spill suppression logic, the general structure of doing chroma math in a fragment shader. From there it got extended pretty heavily with sampled key colors, curve-based threshold falloff, despill depth, choke/feather morphology, and all the other controls, but it wouldn’t exist without that starting point.

The whole thing was vibe-coded

I’m not going to pretend this was some carefully architected project with a Jira board and sprint planning. I opened Claude Code, described what I wanted, and started iterating. Every feature in the app was built through conversation — me describing what I needed, sometimes yelling at the screen when the mute button wouldn’t toggle (SVG hidden attribute, never again), and watching the code take shape in real time.

The shader pipeline, the tracker system, the background frame cache, the export pipeline — all of it came from back-and-forth with an AI pair programmer. Some sessions were smooth. Others involved me typing in all caps because the video was blasting audio during frame extraction for the third time. That’s vibe coding. You ride the wave and sometimes the wave rides you.

No Figma mockups. No PRD. No architecture diagram. Just “I want this thing to exist” and then making it exist, one conversation at a time.

What it actually does

Green Difference Studio runs entirely in your browser. You drop a video in, and it keys out the green screen in real time using a WebGL fragment shader powered by Three.js. No server, no upload, no waiting for a cloud GPU. Everything stays on your machine.

The keying controls are what you’d expect from a decent compositor — hue range, saturation floor, light range, edge feather. There’s spill suppression with despill lift to recover natural skin tones. You can preview the alpha channel to check your matte quality, and use choke/feather to clean up edges.

But the part I’m most happy with is the tracker system. You can place tween trackers (static points you drag per frame) or mouse trackers (hold mouse on the subject while the video plays, release to stop tracking). Each tracker can be set to Keep or Discard mode with flood-fill-based alpha masking. There’s an “auto invert remaining” toggle that makes everything outside the tracked region transparent (or opaque, depending on mode). It’s not automatic motion tracking — that’s on the roadmap — but it’s surprisingly usable for isolating subjects in tricky shots.

Export gives you WebM with embedded alpha channel, a standalone grayscale matte, or PNG for single frames. The frame cache builds progressively in the background after upload, so you’re never staring at a loading bar. You see the first frame immediately and start working while thumbnails populate the timeline behind the scenes.

Why browser-based matters

CorridorKey requires a minimum 24GB VRAM GPU. That’s a $1,500+ graphics card. It outputs EXR sequences meant for professional compositing software that costs hundreds or thousands of dollars a year. Even with EZ-CorridorKey making the install easier, you’re still dealing with Python environments, model downloads, and the premask workflow.

Green Difference Studio requires Chrome. That’s it.

It won’t give you the same quality on difficult shots — ML-based unmixing is fundamentally more capable than traditional threshold-based keying for things like hair detail and translucent materials. But for the vast majority of green screen footage — talking heads, product shots, simple VFX work — a well-tuned traditional keyer running at GPU speed in a browser tab gets the job done. And it gets it done right now, on whatever laptop you happen to have.

The tech under the hood

The rendering pipeline is a Three.js fragment shader that does all the chroma math on the GPU. Spill suppression, edge feathering, alpha generation — it’s all happening in GLSL. The tracker flood fill runs on the CPU (Web Worker offloading is on the TODO list), and export uses the WebCodecs API for hardware-accelerated encoding with a MediaRecorder fallback for alpha-channel WebM.

Timeline thumbnails are built from a canvas-based frame cache that populates asynchronously using a separate hidden video element — this was one of the trickier problems to solve, because you can’t seek two different positions on the same video element simultaneously without the browser fighting you.

Other dependencies: GSAP for smooth tracker animations, iro.js for the color picker, noUiSlider for the range controls, and webm-muxer for standalone matte export.

What’s next

The README has a proper roadmap, but the highlights:

Mask tools — brush, shape, polygon, and lasso masks for garbage mattes
Automatic motion tracking — Lucas-Kanade or correlation-based point tracking
Image sequence export — PNG+Alpha and JPG matte sequences
Better despill — edge-aware spill suppression for hair and translucent materials
Undo/redo — full history stack
Web Worker flood fill — the CPU-heavy tracker math should be off the main thread

And the dream entry at the bottom of the list: CorridorKey in the browser. A transformer model running in WebGPU doing neural green screen removal with no install. Probably not happening tomorrow. But WebGPU is maturing fast, and ONNX runtime for web is getting better every month. One day, maybe.

Background:

This transparent WebM was generated with an AI text-to-video model prompted to render on a green screen, then keyed with Green Difference Studio — completely free, right in the browser. Works great with any text-to-video or image-to-video tool if you prompt it to shoot on a green screen. Free transparent videos, no need to wait for the big guys to support alpha channel.

About the dream entry at the bottom of the list: CorridorKey in the browser. A transformer model running in WebGPU doing neural green screen removal with no install. Probably not happening tomorrow. But WebGPU is maturing fast, and ONNX runtime for web is getting better every month. One day, maybe.

And if Corridor Crew ever reads this — thanks for the inspiration. CorridorKey is genuinely amazing work and I hope it keeps pushing the industry forward. This little browser tool exists because you made me want to build something.

Open Wake Word on the Web

Miro Hristov — Sat, 12 Jul 2025 03:47:47 +0000

OpenWakeWord - Web Demo

How I Ported a Python Wake Word System to the Browser When the LLMs Gave Up

I started this project with a goal that seemed simple on paper: take openWakeWord, a powerful open-source library for wake word detection, and make it run entirely in a web browser. And when I say “in the browser,” I mean it. No tricks. No websockets streaming audio to a Python server. I wanted the models, the audio processing, and the detection logic running completely on the client.
My initial approach was to “vibe-code” it with the new generation of LLMs. I fed my high-level goal to Gemini 2.5 Pro, o4-mini-high, and Grok 4. They gave me a fantastic head start, building out the initial HTML, CSS, and JavaScript structure with impressive speed. But after dozens of messages just refining the vibe, we hit a hard wall. The models would run, but the output score was just a flat line at zero. No errors, no crashes, just… nothing.
This is where the real story begins. The vibe was off. Vibe coding had failed. I had to pivot from being a creative director to a deep-dive detective. It’s a tale of how I used a novel cross-examination technique with these same LLMs to solve a problem that each one, individually, had given up on.

TL;DR: The `openWakeWord` JavaScript Architecture That Actually Works

For the engineers who just want the final schematics, here is the stateful, multi-buffer pipeline required to make this work.

Pipeline: [Audio Chunk] -> Melspectrogram Model -> Melspectrogram Buffer -> Embedding Model -> Wake Word Model -> Score
Stage 1: Audio to Image (Melspectrogram):
- Audio Source: 16kHz, 16-bit, Mono PCM audio.
- Chunking: The pipeline operates on 1280 sample chunks (80ms). This is non-negotiable.
- Model Input: The chunk is fed into melspectrogram.onnx as a float32 tensor.
- Mandatory Transformation: The output from the melspectrogram model must be transformed with the formula output = (value / 10.0) + 2.0.
Stage 2: Image Analysis (Feature Embedding):
- Melspectrogram Buffer: The 5 transformed spectrogram frames from Stage 1 are pushed into a buffer.
- Sliding Window: This stage only executes when the mel_buffer contains at least 76 frames. A window is sliced from the start of the buffer.
- Model Input: This window is fed into embedding_model.onnx as a tensor.
- Window Step: After processing, the buffer is slid forward by 8 frames (splice(0, 8)).
Stage 3: Prediction:
- Embedding Buffer: The 96-value feature vector from Stage 2 is pushed into a second, fixed-size buffer that holds the last 16 embeddings.
- Model Input: Once full, the 16 embeddings are flattened and fed into the final wake word model as a tensor. This [batch, sequence, features] shape is the critical insight that resolved a key error.

The Unvarnished Truth: My Journey into Debugging Hell

After the initial burst of productivity, all three LLMs hit the same wall and gave up. They settled on the same, demoralizing conclusion: the problem was floating-point precision differences between Python and the browser’s ONNX Runtime. They suggested the complex math in openWakeWord was too sensitive and that a 100% client-side implementation was likely impossible.
Something about that felt fishy. The separate VAD (Voice Activity Detection) model was working perfectly fine. This felt like a logic problem, not a fundamental platform limitation.
This is where the breakthrough happened. I realized “vibe coding” wasn’t enough. I had to get specific. I decided to change my approach and use the LLMs as specialized, focused tools rather than general-purpose partners:

The Analyst: I tasked one LLM with a single, focused job: analyze the openwakeword Python source code and describe, in painstaking detail, exactly what it was doing at every step.
The Coder: I took the detailed blueprint from the “Analyst” and fed it to a different LLM. Its job was to take that blueprint and write the JavaScript implementation.

This cross-examination process was like a magic trick. It bypassed the ruts the models had gotten into and started revealing the hidden architectural assumptions that had been causing all the problems.

The First Wall: The Sound-to-Image Pipeline

The “Analyst” LLM immediately revealed my most basic misunderstanding. I thought I was feeding a sound model, but that’s not how it works. These models don’t “hear” sound; they “see” it.
Aha! Moment #1: It’s an Image Recognition Problem. The first model in the chain, melspectrogram.onnx, doesn’t process audio waves. Its entire job is to convert a raw 80ms audio chunk into a melspectrogram—a 2D array of numbers that is essentially an image representing the intensity of different frequencies in that sound. The subsequent models are doing pattern recognition on these sound-images, not on the audio itself. This also explained the second part of the puzzle: the models were trained on specifically processed images, which is why this transformation was mandatory:

// This isn't just a normalization; it's part of the "image processing" pipeline
// that the model was trained on. It fails silently without it.

for (let j = 0; j < new_mel_data.length; j++) {
  new_mel_data[j] = (new_mel_data[j] / 10.0) + 2.0;
}

The Second Wall: The Audio History Tax

With the formula in place, my test WAV file still failed. The “Analyst” LLM’s breakdown of the Python code’s looping was the key. I realized the pipeline’s second stage needs a history of 76 spectrogram frames to even begin its work. Each 80ms audio chunk only produces 5 frames, meaning the system has to process 16 chunks (1.28 seconds) of audio before it can even think about generating the first feature vector. My test file was too short.

// This logic checks if the audio is long enough and pads it with silence if not.

const minRequiredSamples = 16 * frameSize; // 16 chunks * 1280 samples/chunk = 20480

if (audioData.length < minRequiredSamples) {
  const padding = new Float32Array(minRequiredSamples - audioData.length);
  const newAudioData = new Float32Array(minRequiredSamples);
  newAudioData.set(audioData, 0);
  newAudioData.set(padding, audioData.length);
  audioData = newAudioData; // Use the new, padded buffer
}

The Third Wall: The Treachery of Optimization

The system came to life, but it was unstable, crashing with a bizarre offset is out of bounds error. This wasn’t a floating-point issue; it was a memory management problem. I discovered that for performance, the ONNX Runtime for web reuses its memory buffers. The variable I was saving wasn’t the data, but a temporary reference to a memory location that was being overwritten.

// AHA Moment: ONNX Runtime reuses its output buffers. We MUST create a *copy*
// of the data instead of just pushing a reference to the buffer.

const new_embedding_data_view = embeddingOut[embeddingModel.outputNames[0]].data;
const stable_copy_of_embedding = new Float32Array(new_embedding_data_view);
embedding_buffer.push(stable_copy_of_embedding); // Push the stable copy, not the temporary view.

The Final Wall: The Purpose of the VAD

The system was finally stable, and I could see the chart spike to 1.0 when I spoke the wake word. But the success sound wouldn’t play reliably. This was due to my most fundamental misconception. I had assumed the VAD’s purpose was to save resources. My thinking was: “VAD is cheap, the wake word model is expensive. So, I should only run the expensive model when the VAD detects speech.”
This is completely wrong.
Aha! Moment #4: The VAD is a Confirmation, Not a Trigger. The wake word pipeline must run continuously to maintain its history buffers. The VAD’s true purpose is to act as a confirmation signal. A detection is only valid if two conditions are met simultaneously: the wake word model reports a high score, AND the VAD confirms that human speech is currently happening. It’s a two-factor authentication system for your voice. This led to the final race condition: the VAD is fast, but the wake word pipeline is slow. The solution was a VAD Hangover—what I call “Redemption Frames”—to keep the detection window open just a little longer.

// These constants define the VAD Hangover logic

const VAD_HANGOVER_FRAMES = 12; // Keep speech active for ~1 second after VAD stops

let vadHangoverCounter = 0;
let isSpeechActive = false;

// Later, the final check uses this managed state:
if (score > 0.5 && isSpeechActive) {
 // Detection is valid!
}

The Backend Betrayal: A Final Hurdle

With the core logic finally perfected, I implemented a feature to switch between the WASM, WebGL, and WebGPU backends. WASM and WebGPU worked, but WebGL crashed instantly with the error: `Error: no available backend found. ERR: [wasm] backend not found`.
The issue was that the melspectrogram.onnx model uses specialized audio operators that the WebGL backend in ONNX Runtime simply does not support. My code was trying to force all models onto the selected backend, which is impossible when one is incompatible. The solution was a hybrid backend approach: force the incompatible pre-processing models (melspectrogram and VAD) to run on the universally-supported WASM backend, while allowing the heavy-duty neural network models to run on the user’s selected GPU backend for a performance boost. I’ve left the WebGL option in the demo as a reference for this interesting limitation.

The Final Product

This journey was a powerful lesson in the limitations of “vibe coding” for complex technical problems. While LLMs are incredible for scaffolding, they can’t replace rigorous, first-principles debugging. By pivoting my strategy—using one LLM to deconstruct the source of truth and another to implement that truth—I was able to solve a problem that a single LLM, or even a committee of them, declared impossible. The result is a working, robust web demo that proves this complex audio pipeline can indeed be tamed, running 100% on the client, in the browser, no Python backend required.