Back to Projects
Acoustic Artistry
CompletedPythonGradioTensorFlow

Acoustic Artistry

An AI-powered application that converts your voice descriptions into stunning images using Stable Diffusion and speech recognition technology.

Timeline

3 weeks

Role

ML Engineer

Team

Solo

Status
Completed

Technology Stack

Python
Gradio
TensorFlow

Key Challenges

  • Audio feature extraction
  • Prompt engineering for Stable Diffusion
  • Mapping audio characteristics to visual elements
  • GPU memory management
  • Model inference optimization

Key Learnings

  • Stable Diffusion pipeline
  • Voice to text processing using SpeechRecognition and Google Speech API
  • Gradio interface building
  • Using Hugging Face Diffusers
  • End to end integration of multimodal AI systems

Acoustic Artistry: Audio to Album Art

Overview

Acoustic Artistry is an AI powered voice to image generator that converts spoken descriptions into visually rich images using Stable Diffusion. It combines speech recognition with generative AI to transform natural language prompts into high quality artwork through an interactive Gradio web interface.

Key Features

  • Voice to Image Conversion: Record voice input which is converted to text using speech recognition before generating images.
  • Text Prompt Support: Alternative text input option for direct prompt based image generation.
  • Stable Diffusion v1.5 Integration: Generates high quality AI images using diffusion models via Hugging Face Diffusers.
  • Customizable Image Settings: Adjustable image dimensions and generation parameters for improved output control.
  • Interactive Gradio Interface: Responsive web UI for seamless recording, prompt editing and image preview.
  • Real Time Processing: Fast speech to text conversion and optimized image inference pipeline.

Why I Built This

I wanted to explore the interaction between voice interfaces and generative AI. The idea was to reduce friction between imagination and creation by allowing users to simply describe an idea aloud and instantly visualize it through AI generated imagery.

Future Plans

  • Real time streaming voice input instead of single recordings
  • Advanced prompt enhancement using LLM based refinement
  • Style presets and reference image conditioning
  • Cloud deployment with GPU acceleration
  • Mobile optimized interface for broader accessibility

Designed & Developed by Ujwal
© 2026. All rights reserved.