I built a real AI video processing SaaS from Senegal no GPT wrappers, just HuggingFace + OpenCV + YOLO + Detectron2+Medidapie+ Celery
## The problem I was solving Every creator I know spends 3-4 hours manually cutting one video into clips for TikTok and Instagram. The algorithm rewards volume — not perfection. Post 20 clips, maybe 2 go viral. Post 1 perfectly edited video, maybe 0 do. So I built ClipFarmer. Not a GPT wrapper — real computer vision This is the part I want to be clear about. Most "AI tools" people encounter — especially in West Africa — are scams. Someone charges you to access ChatGPT through a Telegram bot and calls it "AI formation." ClipFarmer uses actual machine learning models running on the processing pipeline: Whisper (HuggingFace) — automatic speech recognition for subtitle generation. Runs locally on the worker, no API call, no per-minute billing. YOLO + OpenCV (cv2) — scene detection and object tracking. Used to find the best cut points in a video — not just splitting at fixed intervals but finding where scenes actually change. Detectron2 — instance segmentation. Powers background removal and masking effects directly on video frames. MediaPipe — pose and face landmark detection. Used for smart reframing — keeping the subject centered when converting 16:9 to 9:16 vertical format for TikTok. OpenCV (cv2) — the backbone of all frame-level processing. Every effect, every transition, every crop runs through cv2 pipelines. These aren't API calls to someone else's model. They run on our workers. The effects and transitions pipeline This was the hardest part to build. Each effect is a cv2 pipeline that processes frames individually and reassembles them into a video. Things like: Color grading (dark moody, vintage grain, RGB split) CRT scanline overlay Motion blur Skeleton overlay (MediaPipe pose) Background removal (Detectron2 masks) Transitions between clips use frame blending and optical flow — not simple cuts or crossfades. The whole thing runs as a Celery chord: workflow = chord( spliter_clip.s(job.job_id, input_path), workflow_tasks_parallel.s() ) task_result = workflow() Split first → then effects + subtitles + transitions run in parallel on the clips → reassemble. The stack Backend: FastAPI + Celery + RabbitMQ + Redis AI/CV: Whisper + YOLO + Detectron2 + MediaPipe + OpenCV Storage: MinIO (self-hosted S3-compatible, presigned uploads) Frontend: React + Vite + TailwindCSS Database: PostgreSQL + SQLAlchemy async Deployment: Docker Compose on a VPS Each AI model runs in its own conda environment inside the worker container — Whisper, Detectron2, and MediaPipe have conflicting dependencies so isolating them was non-negotiable. The African creator angle In Senegal and West Africa: Mobile money (Wave, Orange Money) is how people pay Credit cards are rare Most AI tools people see are scams or inaccessible ClipFarmer accepts Wave and Orange Money natively. And it runs real models — not a chat interface pretending to be a video tool. What I learned Conflicting ML dependencies are brutal. Whisper, Detectron2, and MediaPipe cannot share a Python environment cleanly. The solution was separate conda envs and subprocess calls between them from the main worker. Presigned uploads are mandatory for video. Having the client upload directly to MinIO instead of streaming through FastAPI was the difference between a server that crashes on large files and one that handles them fine. cv2 frame processing is slow without batching. Processing frames one by one destroyed performance. Batching frame reads and writes cut processing time significantly. Docker networking will humble you. My Celery worker couldn't reach RabbitMQ because the FastAPI container was missing RABBITMQ_URL — cost me an afternoon of traceback reading. Where it is now Live at clipfarmer.site Free credits to try it out. Mobile payment for West African creators. I'm curious — has anyone else built cv2 processing pipelines at scale? The frame batching and memory management on long videos is still something I'm optimizing. What would make you switch from manual editing?
Loading comments…