Multimedia Building Blocks

Disclosure: Some links in this article are affiliate links. AI Maestro may earn a commission if you make a purchase, at no…

By AI Maestro June 9, 2026 4 min read
Multimedia Building Blocks

For creators, the 3D generation workflow is finally dead code

Makers and artists no longer need to wrestle with complex SDKs or manage GPU clusters to build multimedia experiences. The era of assembling polished, monolithic software is over. Instead, the most effective path forward is a “building block economy” where small, well-documented components are stitched together by intelligent agents. As Mitchell Hashimoto noted, while AI is capable of building everything from scratch, it excels at gluing proven pieces together.

This shift is transforming how multimedia software is constructed. The difficulty was never the underlying models for image generation, video creation, or 3D reconstruction. The hurdle was always the integration layer: handling different input formats, polling for results, and managing weights. Now, every state-of-the-art model on the Hugging Face Hub acts as a documented, callable block. Agents can assemble these primitives exactly as developers once glued npm packages together.

How a coding agent built a Paris gallery without human intervention

I tasked a coding agent with creating a website showcasing Parisian monuments as 3D Gaussian splats. I did not open an image generator. I did not touch a 3D reconstruction tool. The agent generated every asset by calling two Hugging Face Spaces directly and wiring them into a cinematic viewer.

This live static Space demonstrates the capability. It serves as a preview of how future multimedia software will be built.

The mechanics of the pipeline

The Hub hosts thousands of open-weights models, most deployed as interactive Spaces. Crucially, every Gradio Space now exposes a plain-text agents.md file. This file tells an agent exactly how to call the service, providing the schema URL, call and poll templates, file upload instructions, and authentication hints.

Retrieving this file via a simple curl command returns everything needed in one shot:

API schema: GET …/gradio_api/info
Call endpoint: POST …/gradio_api/call/v2/{endpoint} {“param_name”: value, …}
Poll result: GET …/gradio_api/call/{endpoint}/{event_id}
File inputs: POST …/gradio_api/upload -F “files=@file.ext”
Auth: Bearer $HF_TOKEN

There is no client library required and no hardcoded integration needed. An agent reads this metadata and can drive the Space end to end. Once an HF_TOKEN is set, the process begins.

The real unlock comes from chaining: the output of one Space becomes the input to the next. The pipeline behind the Paris gallery works as follows:

  • Image generation: An image-generation Space turned each monument into a clean, dark-background “specimen” shot. It even created a little diorama on a plinth for the Eiffel Tower. Prompt in, image out.
  • 3D reconstruction: The Space VAST-AI/TripoSplat reconstructed a 3D Gaussian splat (.ply) from each single image. Image in, 3D out.

The six source images generated by the agent were all isolated on black, ready for single-image 3D reconstruction.

From there, the agent performed the “glue” work. It noticed TripoSplat outputs were Y-down and flipped them upright. It auto-framed each monument, compressed the .ply files to .ksplat (approximately 3× smaller for faster loading), and built a Three.js viewer with a scroll-to-switch and drag-to-rotate UI. Finally, it deployed the whole thing as a static Space.

The only human inputs were taste-level adjustments: “make it zoomed out,” “replace the obelisk with something better for splatting,” and “the transition lingers too long.”

Several of those steps involved the agent reacting to reality. A wide glass pyramid splats poorly. A thin obelisk is dull. A single-view reconstruction infers the back. This is exactly the “outsourced R&D, fast iteration” loop Hashimoto predicted, except the R&D was a conversation.

Why this matters for the industry

Models are becoming composable. A state-of-the-art splat model and a state-of-the-art image model from different organisations can be chained with zero integration code. The Hub’s open-weights catalog effectively becomes a library of callable multimedia primitives.

Agents prefer what is documented and reachable. The agents.md file makes a Space trivially reachable, so an agent will pick it over a model it has to set up by hand. This mirrors the dynamic Hashimoto flags for open-source libraries.

The barrier was integration, and it is largely gone. “Turn a prompt into a rotating 3D monument” used to be a project. Here, it is merely a step in a pipeline.

To replicate this, point your own agent at a Space’s agents.md and let it cook:

# image generation
curl https://huggingface.co/spaces/ideogram-ai/ideogram4/agents.md
# single-image to 3D gaussian splat
curl https://huggingface.co/spaces/VAST-AI/TripoSplat/agents.md

Paste either link into your coding agent (Claude Code, etc.), set your HF_TOKEN, and ask it to build something. The full, reproducible pipeline for this gallery, including the scripts that hit those two agents.md endpoints, lives in the Space repo. The building blocks are sitting right there on the Hub. The agents already know how to glue.

Key takeaways

  • Integration complexity is collapsing. Agents can now chain disparate state-of-the-art models—from image generation to 3D reconstruction—without writing custom glue code or managing environments.

  • The agents.md standard is the critical enabler. By exposing a plain-text API schema, Hugging Face Spaces have become directly addressable by AI agents, removing the need for manual SDK setup.

  • Human roles are shifting from implementation to curation. The workflow now focuses on taste-level feedback and prompt engineering, while the agent handles the technical execution and error correction.

Stay ahead of AI. Get the most important stories delivered to your inbox — no spam, no noise.

Name
Scroll to Top