Gen AI: Audio Generation from Imported Photos

less than 1 minute read

An end-to-end pipeline that takes an uploaded photo, generates a meaningful narrative using LangChain and ChatGPT, and converts it to spoken audio — all served through a Streamlit interface.

How It Works

The pipeline runs in three stages:

1. Image → Text — A HuggingFace model (salesforce/blip-image-captioning-base) analyses the uploaded image and produces a short caption describing its content.

2. Text → Story — LangChain and ChatGPT expand the raw caption into a richer, more meaningful narrative.

3. Story → Audio — A second HuggingFace model (kan-bayashi/ljspeech_vits) converts the generated text into spoken audio.

App screenshot

Prerequisites

Install the required dependencies:

pip install -r requirements.txt

Running the App

streamlit run app.py

Sample Output

Listen to a generated audio sample: gen_audio_from_photo.flac

Repository

github.com/uday160386/image-audio-hf-openai

Share on

X (formerly Twitter) Facebook LinkedIn

Configuring Wifi in ESP32 WORM using code

1 minute read

Published: June 15, 2024

Recently, I have been delving into a specific use case that involves consuming a voice REST endpoint using the ESP32 microcontroller. This task requires not only utilizing the capabilities of the ESP32 but also ensuring that the device is connected to a Wi-Fi network for seamless communication with the endpoint.

Data mocking using Faker

2 minute read

Published: May 31, 2024

Ideally, test data is of priority and the project teams always face an issue in getting the relevant and realistic test data for pre-production activities. More issues(refresh of data; data manipulations etc.,) arise, when programs consume data from a shared environment. Sometimes, requirements of data varies and a new set of data should be replicated through external tools and technologies. Many commercial data mocking/stubbing tools are available in the market, but as a open source lover, I recommend using Faker library.