Machi To SpriteDX Bridge - Natural Language Tiles

Starting a document to bridge the gap between SPRITEDX project and MACHI.

At a high level:

MACHI is machine-driven virtual world where embodied AI agents agents live.
SPRITEDX is a tool that generates sprite animation assets for such world.

Let's imagine how they could form a symbiotic relationship. From now on we will use SDX to shorthand SPRITEDX.

MACHI worlds characters can be generated by SDX.
Assets in MACHI can be generated using SDX.
Levels in MACHI can be generated using SDX.

Question arises:

SDX uses Comfy under the hood, thus you could technically do the same in ComfyUI. What is the added utility of having SDX?

That's a fair question. SDX's purpose is solidify and make it a one-click (or one-line) command to generate characters/assets that are consistent with the settings of a world.

If ComfyUI is a tool that broadens and expands the search domain, SDX does the opposite to contract and focus the search space.

In a coffee analogy, SDX is a distilling machine that puts ingredients though a process that will make it a particular flavor with consistency and accuracy.

VISION: The vision of SDX is to be able to generate ANYTHING with world-style-coherency.

The world-style-coherency in this context is that the generated assets can be juxtaposed next to each other without feeling disconnected.

Current reality is that it is able to generate characters with coherency. And we believe we can do the same for props and other assets. However, each different assets and prop classes require separate templates designed and template quality often controls the quality and coherency of the output.

Imagine that you have character template that has cartoony pastel tone, and a bike template that has a vibrant metallic tone. Those two juxtaposed together will produce a incoherent composition.

The process of making these templates although very doable is rather labor-intensive at this time because of trial and error.

WHAT DOES MACHI NEED?

Machi project is a 2D block world simulation. Generated characters will need to be able to be able to move around and do things.

Moving around was the first piece we get from SDX and it is quite efficient at making such animated sequences. But what about doing things?

In traditional interactive media, "doing things" usually mean some animation states. Picking up an object, attacking, etc.

We can carefully pick these "animation states" and generate them as part of SpriteDX pipeline. Often the ability for a company to do this at scale has been a moat (Ragnarok Online, MapleStory, Dungeon and Fighter, Metal Slug). It is expensive process but it builds a effort-driven fortress which allows those companies to have hard earned competitive advantage in a long run. SDX facilitates this process.

In such way, SDX can help with MACHI. It will generate the world full of living and non-living objects.

Living Objects will include intelligent moving beings. These will include player characters, AI characters, pets, etc.

Non-Living Objects will include rest of the things like ridable, breakable objects, throwable, crates, etc.

FOCUS: Because SDX's first customer and possibly only customer for some time is going to be MACHI, we believe the focus should go not into supporting multiple different WORLDS but rather supporting ONE SINGLE COHERENT WORLD.

Let's assume that SDX's sole purpose is to facilitate creation of MACHI. Then, the focus switches from feature completeness to world building.

Then, the question reduces to what does MACHI need:

In the MACHI universe, we need:

characters
shrub - trees, bushes, grass, etc.
tiles
throwable
ridable
ability to control character's poses and re-render. Imagine riding on a bike.
4 directional character movement generator
8 directional character movement generator
crawl animation
climb animation
pick-up animation
world-consistent emoji generator
switching to fantasy-world (from japanese machi vibe)
tile generator that uses latent map (from natural language to latent describing that tile) and neighboring tiles to generate believable tile rendering.

Out of these, I think the first may be the ability to generate tiles.

Without the ground, the characters don't have anywhere to stand on.

Let's make this our first goal. Give characters somewhere to stand on.

NATURAL LANGUAGE TILE RENDERER

Goal of NLT is to be able to describe in words what the tile should be then it will generate a tile image.

GIVEN:

Natural Language Tokens
Neighboring tile's latent map.

WE NEED:

RGBA values of each pixel in the tile.

FIX:

We fix the tile's dimension to be 64x64. We believe this dimension should give enough expressive power without training speed.

LIMITATIONS 1 - PART OF TILE GROUP

Say we want set of neighboring tiles to represent "Hollywood," just saying "Hollywood sign" for each tile won't work. Because the tile will try to make each individual tile as Hollywood tile rather than being portion of big HOLLYWOOD sign. This limitation signals for better architecture of the model.

One potential solution is to generate the tile in larger patches which can take in the whole HOLLYWOOD sign. I think this is a novel idea to study in future rounds but let's fix it to just local prediction.

We know there are limitations to this design but let's still work on this LOCALLY SOUND version.

QUESTION: I get that we can generate tiles, but how would you make it consistent with overall world's style and vibe?

Borrowing from the bike example, if character is using pastel tone, and tiles are in full vibrancy, now we have incoherent system.

APPROACH: Even for the tile making, we need a way to control the style of the tiles. For characters, we've done FILL IN THE BLANK template inference. We will need to do something equivalent here. We may not do FITB-templates here but we can provide style references as input.

ADDITIONAL INPUT: So the additional input is going to be "style references" Perhaps this could be a template that shows various tile examples.

Artistic Goal is that these generated tiles are (1) in-context while still remaining (2) unique.

For DATA, we start simple. Manual dataset preparation is as follows:

Draw on Photoshop (or outsource or GENERATE) a 2D map (side scrolling map)
Slice them into uniformly sized blocks.
Annotate natural language for each of the tiles.
Repeat from 1, until we have enough.

For MODEL, we have a few choices. Largely there are two options.

Option 1: Use pre-trained model. We can frame this problem as in-painting problem and just make the model fill in the blank with a prompt. This is easiest to implement but also most compute heavy. In this case, we would not even need to prepare any data. This means we could use this method to generate plausible data too.
Option 2: Build our own custom in-painting model. The model will see neighboring tiles and a style reference. Then, it will draw in the tile conditioned on the prompt. The model will probably be significant faster and lighter paving the way of realtime application. Imagine AI agent updating the natural language of the prompt in real-time and being able to see the changes real-time.

Next Steps:

[ ] We will experiment with OPTION 1 above and see what kind of results we get at what performance.

Appendix:

What does this mean for the animation generation for characters.

THIS IS BIG. BUT V2 IDEA.

Instead of generating preset animation (or in additional generating preset animation states), we generate on the fly the animation as intent comes up.

That is, if character wants to sit down on top of a chair in the world, instead of generating that motion preset, we generate on-the-fly as the animation is needed. Then keep it as library of things it can do.

Another example is smoking. Not every character will smoke, but say this guy character wants to smoke but has no motion for that.

Depending on the desire level, if it goes above some threshold, a animation generation even happens and it generates it. Then it maps it to: [action:SMOKE] token. Then whenever the character wants to smoke it can run that animation sequece.

Also, as character evolves and gets attention, they will be able to add more animation states to their repository. That is if the character has a [action:RUN] move, it can learn a variants of that move and remember it.

Perhaps it can be like [action:RUN2] or `[action:RUN:alt2] or something.

That way, the character will behave more naturally.