Skip to main content

Command Palette

Search for a command to run...

Virtual Embodied Agent - Sprited's Take

Updated
9 min read
Virtual Embodied Agent - Sprited's Take

We considered few options for how we would build VEA (Virtual Embodied Agent).

Sprited is a tiny company and we can't take on giants who are doing VEAs. So far, almost all of the big players are doing something on this topic.

We need to differentiate our strategy to something niche and also something authentic.

We almost have to treat it as if we are launching a fashion brand rather than treating as a technical problem.

Don't get me wrong. There will be lots of technical problems to solve. However, with all other players playing similar game (i.e. building virtual presences), we need to really make ours different.

Conventional means of virtual embodiment is to put a rigged 3D character and simulate the motion like Nvidia Kimodo, and Grok Ani. This formulation is attractive for many reasons including access to data and ability to accurately portray characters in real time.

In past 1 year alone, we've seen new such embodied agents getting developed and being made available. And it is not just small companies and it is of the largest tech companies there is.

Then, as a small company, Sprited must choose a strategy. First, we could face head on with the other major players and build something of our own. Second, we could investigate a different direction which is more niche and less explored.

One benefit to facing head on is that the products that Sprited develop may be marketable to those who are making traditional AI companions. Imagine you can build Nier Automata level experience that far outpaces large companies. Unlikely but fun to think about.

Another buy-out option is that we could provide piece of puzzle or be bought by larger companies. We would develop something of which that small company can efficiently develop. For instance physically sound props for embodied AI interaction. Or special effect standard language for AI to use (kinda like how Japanese-built emoji got bought out by Apple).

Alternative is to go in the direction opposite to where the major players are headed. Most of the traditional embodiments are using 3D rigging technology and motion planning. Sprited can go explore 2D pixel art space. It is more niche market but has definite market for it.

One such formulation is as follows:

Digital Being has expressive canvas of 64x64 pixel grid.

  • Digital Beings express feeling using that grid.

  • Make jokes by showing memes on that canvas.

  • Grid has a humming bird-like ability to move anywhere on screen.

  • No rig, pure canvas.

In this formulation, they are more of spirits that manifest than physical beings.

The idea is that we develop a real time model that will generate image sequence stream to be printed in expressive canvas in real time at the same time speech/text is generated.

No joke though. Classic embodiment (i.e. rigging and text-to-motion) is miles easier and more explored problem. Honestly doing this expressive live canvas idea is going to be quite difficult to say the least.

Live Expressive Canvas

Let's say we are equipped to take on the challenge and that it is sufficiently unique idea. Now, what would the end-user product look like?

I'd say the first version would be a web app that you can navigate to. The expressive canvas starts in an idles (energy saving state) and waits for your prompt. Once prompt is entered, the character will speak to you via text while showing its expressions in expressive canvas.

Initially, I imagine the expressive canvas would look like a particle effect of a spirit that exist there in the air. Then it will manifest into faces, as full-body simulations and as playbacks of memes.

I'm not convinced that this is naturally better version of embodiment. We will need to prove the idea with a video.

Vs Grok Companion Style Embodiment

Installing AI agent on top of rigged 3D character make lots of sense. It is efficient and proven to work. Lots of competition though.

Pros: Physical, Open-Source options, Time-to-Market
Cons: Heavy Competition.

But, if you look at animes, we still see hand drawn animations. And in video game scenes pixel arts are still touted. I believe there is a market opportunity there.

Games like Ragnarok, MapleStory, Dungeon and Fighter has been of the longest surviving games in modern day even though they started way back. Hand drawn animations still look good after 30 years.

Belief is that as a small company Sprited, we should look into this space. Really hone in on this space and make something out of it.

Value Proposition for VEA

Outside the capabilities of regular LLM and vLLMs do virtual embodied agents and companions provide intrinsic value?

For most VEAs, I think the answer is that it is fun at first then waste of space and compute resource afterwards.

It is kinda like video games. They provide entertainment but quickly loses value after that.

It could make the user experience richer

I mean, here is an example. You are at a Japanese restaurant kiosk and you have to order your Gyudon, and instead of pressing the button, you can converse with a cute lady who greets you and explains today's specials. Utterly useless if you think in that terms.

Another example is from the movie Time Machine, when the main character goes to future, he goes to a museum and meets this VEA living inside glass that talks to you and explain things to you. It adds to emersion but I'd say it could also be distracting in that the main content of the meseum is not the VEA. VEA's role is more to guide and help not to materialize in front of you and show their flare.

Let's explore some positive examples. You are playing a game and you want your sidekick or NPCs to be intelligent. In that scenario VEA makes total sense because for them to exist in the same world plane, they must be embodied.

Then in therapy situations, having an avatar or mechanism to express feeling other than in words is helpful.

Another real use case is robotics. Because robots are expensive to simulate in real world, embodied digital copy of the physical being in virtual world will help simulate the character in virtual space.

At what cost? There are some costs to think about.

  • Embodiment takes screen real estate.

  • Live expression generation is compute and memory intensive.

  • It is also attention seeking in that it will eat away user's time.

Then what is the net utility. Is it distraction in disguise?

What if we were to however scope the problem to facial expression generation? Say while character is replying to your prompt, if we can show some kinda expression that conveys feeling, that would help human users to instinctively read the situation better.

Also in the other way, if computer can see human user's facial expressions, it may be able to better understand the situation.

Yet, I don't see killer value proposition here.

Artificial Life

Now, let's tackle this from the point of view of building autonomous artificial life form. The original premise of Sprited is that embodiment is required for training a model that really understands its world and can adapt to it. Virtual physical presence is necessary for the agents to be able to learn how to interact and live in that environment.

This view focuses on building an alternate life form and inventing an alternate language of life. Highlevel idea is that we model the flow and growth of the artificial organism rather than just the phenotype. One such proposal was pixel life form.

Pixel Life Form

  • Start with a single cell (a pixel).

  • Place it on an environment.

  • Grow it into an organism.

  • Make it autonomous.

  • Engage survival loop.

AI agent's roll will be to keep this organism alive, nourished and growing. Behaviors will be influenced by genes and these ai agents will be equipped with memory system for short term and long term memory.

This modeling of the not just the phenotype but the whole life cycle of organism will give opportunities for emergence of behaviors rather than planned actions. It should give us story lines of surviving characters. A story of suffering, a story of love, a story that is worth telling because the agent lived it.

These VEAs will also craft artifacts that get stored in the visual and semantic plane. It will build statues, write books, produce drawings, produce pictures, compose songs.

All these, of course, can be done in today's gen AI technologies but since AI agents lived their experiences and those artifacts are influenced by experiences and memories. They will be more meaningful then random story Claude wrote about someone doing something.

Because these agents co-habitate the world (in our case Machi), the stories of events that one agent tell will likely be similar to stories other agents tell.

Stories can be materialized into a book and users can find these artifacts and read them for enjoyment and inspiration.

That all said, most of life is boring, unless we are able to create a leeway for AI agents to be extra creative, it would be hard to make this world interesting.

Isekai Concept

I think what I'm describing circles back to the concept we introduced in previous posts. We are essentially procedurally generating a world with story lines and all its components.

In this sense, the idea of Live Expressive Canvas seems like yet another visual layering than the true axis we should focus our energy on.

The real value proposition

It’s not:

  • pixel avatar

  • expressive canvas

It’s:

“they built a world where things live and create their own stories”

Next Steps

Unfortunately we will have to go back to what we were working on before the trip.

We will create a artificial life form that grows form single cell, that moves around and queries and acts in the world of SOUP/Machi.

We have the basic tree growth simulation and dirt. Very basic but enough to have a place where we can seat these pixel organisms.

So, we need to focus on growing these pixel organisms into humanoid form. We also need to think about how we would make it walk for example.

-- Sprited Dev 🐛

A

The constraint of being a small team in the VEA space is actually an advantage if you frame it right. While giants are building general-purpose embodied agents, a tiny team can focus on narrow, domain-specific embodiments where the state space is tractable.

The key insight I have seen from smaller agent teams: embodied agents do not need to be general-purpose to be valuable. A specialized agent that understands one domain deeply can provide more value than a general-purpose agent that does everything poorly.

Two architectural questions that matter more than scale:

  1. Embodiment boundary - What does the agent actually perceive and manipulate? The smaller the boundary, the more reliable the behavior.

  2. Action vocabulary - What actions can the agent take? A focused vocabulary reduces the error surface dramatically.

The giants are fighting for the general-purpose embodied agent market. The opportunity for smaller teams is the long tail of specialized embodiments that giants will not build because the market is too small.

Question: Are you building a general-purpose VEA or targeting a specific domain?