
Visurai — Visual Learning Copilot
AwardAI-powered visual storytelling tool for dyslexic and visual learners — converts text into storyboards with narration in real time.
2025-10-17
Visurai — Visual Learning Copilot
Turn text into a narrated visual story: scenes, images, and audio — in seconds.
🏆 Built at the Good Vibes Only AI/ML Buildathon @ USC (2025)
Service Link
https://visurai-story-maker.lovable.app/
Overview
Project Demo : https://drive.google.com/file/d/16_YFVfVJoDPQqLkXXaRXSv_Dyr98bxey/view?usp=sharing



Visurai helps dyslexic and visual learners comprehend material by converting text into a sequence of AI-generated images with optional narration.
Paste any text and get:
- A title and segmented scenes that preserve key facts and names
- High-quality images per scene (Flux via Replicate or OpenAI gpt-image-1)
- Per‑scene TTS audio and a single merged audio track with a timeline
- Optional OCR to start from an image instead of text


Features
- Context‑aware scene segmentation and detail‑preserving visual prompts (GPT‑4o)
- Image generation providers:
- Narration:
- Live progress via SSE (/generate_visuals_events)
- OCR routes: generate from image URL or upload
- Absolute asset URLs using PUBLIC_BASE_URL (e.g., ngrok) for frontend access
Architecture
Text / Image → OCR (optional)
↓
Scene segmentation (GPT‑4o)
↓
Detail‑preserving visual prompts
↓
Image generation (Replicate Flux or OpenAI gpt‑image‑1)
↓
TTS per scene → ffmpeg concat → single audio + timeline
↓
Frontend (React) consumes JSON, images, audio, and SSE
Repository Structure
good-vibes-only/
├── backend/
│ ├── main.py # FastAPI app (SSE, OCR, TTS, visuals)
│ ├── image_gen.py # Image provider adapters (Replicate/OpenAI)
│ ├── tts.py # OpenAI TTS + ffmpeg merge
│ ├── settings.py # Pydantic settings + .env loader
│ ├── pyproject.toml # Backend deps (use uv/pip)
│ └── uv.lock
└── frontend/ # React app that calls the backend
Prerequisites
- Python 3.10+ (tested up to 3.13)
- ffmpeg installed (required for merged audio)
- Provider keys as needed:
Backend — Quick Start (run from repo root)
From the repo root:
# 1) Install deps (using uv)
uv sync && cd ..
# 2) Create backend/.env with your keys and config (see below)
# 3) Run the API from the repo root
uv run uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reload
backend/.env (example)
# LLM
OPENAI_API_KEY=sk-...
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini
# Image provider (replicate | openai)
IMAGE_PROVIDER=replicate
REPLICATE_API_TOKEN=r8_...
REPLICATE_MODEL=black-forest-labs/flux-1.1-pro
REPLICATE_ASPECT_RATIO=16:9
# OpenAI Images (if IMAGE_PROVIDER=openai)
OPENAI_IMAGE_MODEL=gpt-image-1
OPENAI_IMAGE_SIZE=1536x1024 # allowed: 1024x1024, 1024x1536, 1536x1024, auto
# TTS
TTS_PROVIDER=openai
TTS_MODEL=gpt-4o-mini-tts
TTS_VOICE=alloy
TTS_OUTPUT_DIR=/tmp/seequence_audio
# Absolute URLs for frontend (ngrok/domain)
PUBLIC_BASE_URL=https://<your-ngrok-subdomain>.ngrok-free.dev
# CORS (optional – include your frontend origin when using credentials)
CORS_ORIGINS=https://<your-ngrok-subdomain>.ngrok-free.dev
Verify
# Health
curl -sS <http://127.0.0.1:8000/health>
# One image (provider-dependent)
curl -sS <http://127.0.0.1:8000/generate_image> \\
-H "Content-Type: application/json" \\
-d '{
"prompt": "Clean educational infographic showing 1 AU ≈ 1.496e8 km. Label Earth and Sun. High contrast."
}'
# Visuals + merged audio
curl -sS <http://127.0.0.1:8000/generate_visuals_single_audio> \\
-H "Content-Type: application/json" \\
-d '{ "text": "The Sun is a G-type star...", "max_scenes": 5 }'
Frontend — Quick Start (pnpm)
Configure your frontend to call the backend base URL (e.g., PUBLIC_BASE_URL).
Typical React workflow:
cd frontend
pnpm install
pnpm dev
Ensure your frontend uses absolute URLs from the backend responses (e.g., image_url, audio_url), which already include the PUBLIC_BASE_URL when set.
If your frontend needs an explicit base URL, set it (e.g., Vite):
# .env.local in frontend (example)
VITE_API_BASE=https://<your-ngrok-subdomain>.ngrok-free.dev
Engine Switch: LangGraph vs Imperative
The backend can run either:
- Imperative flow (default): sequential segmentation → prompts → images
- LangGraph flow: graph-based orchestration
Enable LangGraph by setting an env var and restarting the server:
export PIPELINE_ENGINE=langgraph
uv run uvicorn backend.main:app --host 0.0.0.0 --port 8000 --reload
Endpoints are the same (e.g., POST /generate_visuals), but execution uses the graph.
API Highlights
- POST
/generate_visuals→ scenes with image URLs and a title - POST
/generate_visuals_with_audio→ scenes + per‑scene audio URLs + durations - POST
/generate_visuals_single_audio→ mergedaudio_url, total duration, timeline, scenes - GET
/generate_visuals_events→ Server‑Sent Events stream for progress - POST
/visuals_from_image_urland/visuals_from_image_upload→ OCR then visuals
Troubleshooting
- Audio fails to load after revisiting a story
- OpenAI Images error: invalid size
- Replicate credit errors
- Mixed content blocked
- CORS
License
MIT License © 2025 Visurai Team
Made with care for learners who think in pictures.