Gen AI: Audio Generation from Imported Photos
An end-to-end pipeline that takes an uploaded photo, generates a meaningful narrative using LangChain and ChatGPT, and converts it to spoken audio — all served through a Streamlit interface.
How It Works
The pipeline runs in three stages:
1. Image → Text — A HuggingFace model (salesforce/blip-image-captioning-base) analyses the uploaded image and produces a short caption describing its content.
2. Text → Story — LangChain and ChatGPT expand the raw caption into a richer, more meaningful narrative.
3. Story → Audio — A second HuggingFace model (kan-bayashi/ljspeech_vits) converts the generated text into spoken audio.

Prerequisites
Install the required dependencies:
pip install -r requirements.txt
Running the App
streamlit run app.py
Sample Output
Listen to a generated audio sample: gen_audio_from_photo.flac