Gen AI: Audio Generation from Imported Photos

less than 1 minute read

An end-to-end pipeline that takes an uploaded photo, generates a meaningful narrative using LangChain and ChatGPT, and converts it to spoken audio — all served through a Streamlit interface.


How It Works

The pipeline runs in three stages:

1. Image → Text — A HuggingFace model (salesforce/blip-image-captioning-base) analyses the uploaded image and produces a short caption describing its content.

2. Text → Story — LangChain and ChatGPT expand the raw caption into a richer, more meaningful narrative.

3. Story → Audio — A second HuggingFace model (kan-bayashi/ljspeech_vits) converts the generated text into spoken audio.

App screenshot


Prerequisites

Install the required dependencies:

pip install -r requirements.txt

Running the App

streamlit run app.py

Sample Output

Listen to a generated audio sample: gen_audio_from_photo.flac


Repository

github.com/uday160386/image-audio-hf-openai