I’m a producer first, and an audio engineer second.
This is the story of how I took Cryo Mix from a flatlined MVP to a scaling AI platform by prioritizing the artist's workflow over raw technical capability.
In v1, Cryo Mix was a "black box." You uploaded a file, waited 5 minutes, and got a result. While the backend DSP (Digital Signal Processing) was impressive, the user experience was alienating.
Producers want control. I hypothesized that by exposing the controls and visualizing the data, we could build trust with professional users.
The original linear "fire and forget" uploader that led to high churn.
I led the design of the V2 "Workspace." By shifting to a non-linear timeline, we allowed users to iterate on their sound, effectively closing the gap between AI automation and manual DAW control.
Genre isn't a label. It dictates the mixing algorithm. I designed a new modal to capture this metadata upfront.
By forcing this categorization early, we drastically improved the relevance of the initial AI processing, reducing the "edit time" for users by 40%.
A major friction point was vocabulary. A user says "make it warm," but the engine understands "low-mid boost." I designed semantic questionnaires to map these abstract creative goals to concrete DSP parameters. Users of all skill levels have a tool that understands them.
Most "AI" audio tools are just ChatGPT wrappers. We built something different. I architected a specialized RAG (Retrieval-Augmented Generation) system using CLAP embeddings to bridge the gap between subjective language ("Make it punchy") and objective DSP execution.
Managing state in an infinite chat context is difficult. I engineered a strict protocol where Nova recognizes [APPLIED] markers to "garbage collect" previous context, ensuring the AI focuses only on the current mixing task without getting confused by old requests.
The LLM doesn't touch the audio directly. It outputs a JSON intent which my backend maps to a chain of 14+ discrete audio effects (Denoisers, EQ, Compressors).
Standard text embeddings fail at audio. I implemented CLAP (Contrastive Language-Audio Pretraining) to create a shared vector space where "Warm Guitar" (text) and an actual .wav file of a warm guitar (audio) are mathematically close.
Data-Driven Tuning: Tracking 4,793 total embeddings to identify cluster gaps in our mixing knowledge base.
Iterative Logic: Analyzing failed queries (e.g., "maximize impact") to patch the knowledge base and improve match rates.
Good product design drives retention, but technical blockers kill acquisition. My role involved auditing our entire funnel to find where demand was leaking.
I noticed a massive anomaly in our analytics: India was our #2 traffic source, but had near-zero conversions.
The Discovery: It wasn't a pricing issue. It was regulatory. The RBI's e-mandate on recurring subscriptions was blocking Stripe payments.
The Fix: I pivoted our stack to use Paddle, enabling local UPI payments. This seemingly "boring" backend change unlocked 80% of our serviceable market in Asia.
Transaction Success Rate (Pre vs Post UPI)
We realized we were competing with distributors who wanted to offer AI mastering themselves. I led the initiative to package our engine as an API product, allowing partners to "white label" our tech.
To prevent future misalignment, I established a private Discord beta group. This created a continuous feedback loop, ensuring features like "Advanced Settings" were validated by power users before hitting the roadmap.