Stage 1 - Prompt Engineering Experiment

We recently had promising results when we used XML formatted scene-graph to define the multi-shot animation prompt for Seedance 1 Pro (blog post).

Today, I just want to see if similar thing can be done for Stage 1, character reference generation.

Stage 1 generates characters using FLUX1.Pro [Fill] model which takes in masked image and a prompt that describes what goes into the masked region.

Since non-masked parts already provides the style and overall look and feel and color scheme, there is no need to provide too much extra into this prompt when it comes to style and scale.

However, what it would be a nice experiment to see what we can do inside this prompt box to allow for more controlled generation. In the past, we had some issues when trying to fill more than 1 mask so we had just use one mask box to make it more controllable. If we figure out how to make prompt more effective, we may be able to figure out a way to generate more than one sample on one inference.

(Let’s time box this research into 1 hour research. I got midterms to prepare for).

Goal

Goal is to make the generation more controllable and consistent.

We also want an indirect side effect of higher quality (i.e. higher visual asthetics) generations.

Baseline

Current prompt is rudimentary. For all these experiments, I’m using a fixed seed number 42.

Prompt: Girl character holding a bat

And produces, yeah, a character holding a bat but not really what I want. I want a girl character that is holding a bat over her shoulder and wearing a baseball cap and wearing a uniform with some number sign on.

Trial #1

Prompt: girl character that is holding a bat over her shoulder and wearing a baseball cap and uniform with some number sign on

That’s better, but I want the uniform to be white and the number to be Red. And character is not really holding the bat over her shoulders.

Trial #2

Prompt: girl character that is holding a baseball bat over her shoulder and wearing a baseball cap and a white baseball uniform with some number sign.

Yup, that’s closer to what I want. But something is missing. It feels like I am asking a character to put on a costume. But the character looks very unhappy. Perhaps she didn’t have breakfast in the morning.

Trial #3

Prompt: cheerful girl character that is holding a baseball bat over her shoulder and wearing a baseball cap and a white baseball uniform with some number sign.

Okay, now she has that little bit of smirk there. Yup. I’m fixing the seed number so generations tends to be very similar with minor tweaks.

Problem

The main problem is unless we tell the model to do something, it won’t naturally figure out what we want.

That is, if I just type in “a girl holding a bat,” while imagining “a cheerful girl wearing a white baseball uniform with a number sign holding a bat,” the model won’t really get that latter part.

Is this a problem? Yes and no. It is, in a way, a feature: if user does not provide information, the model will just fill in the gaps randomly sampling from whatever distribution that is known to the model.

However, when those extra information is not provided, lots of what gets drawn may end up being controlled largely by the “seed number.” Let’s variate the seed number and see what happens.

Variating Seed Numbers

Let’s go back to the baseline prompt.

Prompt: Girl character holding a bat
Seed: 1

Prompt: Girl character holding a bat
Seed: 2

Notice incrementing the seed number has an interesting effect where the difference is significant but still keeps some of the same characteristics like keeping the same bat placement. In theory, they should be totally different because the latent image that Seed 1 generates should be independent from Seed 2. Perhaps, this is just conincident.

Prompt: Girl character holding a bat
Seed: 3

Again very interesting that the color pink is kept between Seed 2 and Seed 3 but yeah probably another coincidence.

Prompt: Girl character holding a bat
Seed: 4

Prompt: Girl character holding a bat
Seed: 5

Prompt: Girl character holding a bat
Seed: 49159231

Analysis

In almost all of the cases, I wasn’t able to get a girl on a uniform with a baseball hat. Why is that?

If I recall, FLUX models do not support negative prompts. Does it have anything to do with this? I think it does. I think if things are not mentioned in the prompt, the model assumes it is not there.

If that were the case, we would need to mention description of every element that goes into the picture.

BFL Recommendations

Let’s first study what BFL recommends. BFL recommends following recipe.

Subject + Action + Style + Context

Subject: The main focus (person, object, character)

Action: What the subject is doing or their pose

Style: Artistic approach, medium, or aesthetic

Context: Setting, lighting, time, mood, or atmospheric conditions

Structured descriptions beat keyword lists

FLUX responds best tostructured descriptionsthat mix natural relationships with direct specifications.

Disconnected keywords (weak): “Woman, red dress, beach, sunset, happy, smiling, waves, golden light”

Overwritten prose (bloated): “A joyful woman … warm sunset light illuminating her smile”

Structured (best): “A joyful woman in a flowing red dress walks along a sandy beach, golden hour, gentle waves, warm lighting”

Takeaways:

Word Order: Put the information in the order of Subject, Action, Style and Context.
Length: 30-80 words are the sweet spot
Front-load: what matters.
Focus: on describing what you want rather than what you don’t want (no negative prompts like “no …“).

What does it mean for us?

Looks like BFL really loves natural language with a good ordering. So, XML style prompting may not work very well in this case.
Using things like “no wing“ should be discouraged since they have adverse impact.

XML Prompting?

Let’s try some XML style prompting we’ve done for Stage 2 and see if it has any positive impacts. According to BFL this shouldn’t really work because BFL recommends natural language prompts.

Iteration #1

<Scene>
  <Character
    name="Eliana"
    action="stand"
    costume="white-baseball-uniform-with-cap"
    emotion="cheerful"
    alt="character is standing"
    style="shadow: none;"
  >
    <bat anchor="right-hand">
  </character>
</Scene>

For it being first trial, I’d say they work surprisingly well.

In a way, this goes against what BFL recommends on their website—they recommend us using natural language prompts that describes the essence rather than a structured scene graph like this.

Why do you think it works? Well, modern diffusion models use text encoders to first encode the text into a text encoding. And these text encoders like CLIP and T5 are trained on massive amount of data on web. And lots of data on web are structured text like XML, HTML and JSON. So, even though the FLUX was not trained using XML scene graph notation we are using here, the encoders are able to encode the scene graph effectively.

Let’s iterate and refine the XML scene graph.

Iteration #2

<scene>
  <character
    name="Eliana"
    action="standing, holding bat over shoulder"
    costume="whate baseball uniform with red number 6 and white cap"
    emotion="cheerful"
    description="character is standing holding her bat on the right shoulder"
    style="shadow: none;"
  >
    <item type="bat" anchor="right shoulder" material="wood" />
  </character>
</Scene>

Trying this on ComfyUI using FLUX.1 Fill Dev model to reduce spending.

Parameters Used:

Seed: 42
Guidance: 60
Steps: 20
Sampler: Eular
Scheduler: Normal
Denoise: 1.00

This didn’t work very well. We are seeing lots of artifacts. Low number of steps seems to have problem. Let’s try to adjust the settings a bit.

Seed: 42

Guidance: 60

Steps: 50

Sampler: Eular

Scheduler: Normal

Denoise: 1.00

It gives crisper image but still the same malformed composition.

It’s almost like the inpainting is trying to fight with prompt description and they are ending up just not agreeing with each ohter.

Let’s reduce the Guidance since it is set too high. Keeping the steps to 20 for performance.

Seed: 42

Guidance: 20

Steps: 20

Sampler: Eular

Scheduler: Normal

Denoise: 1.00

That’s much better. Let’s try different numbers to see its impacts.

Guidance: 0

Guidance of 0 actually gives malformed data. This is quite intriguing since I thought guidance is zero means the prompt has no impact. So, I expected to not see prompt having any impact. I guess that may not be the case.

Guidance: 100

Guidance of 100, produces something very similar to Guidance 20. Not quite sure what is going on.

Here are some more generations varying the guidance value.

Impact of guidance seems rather peculiar. It wraps around almost like a cosine wave…

ChatGPT says FLUX predicts two fields—unconditional and text-conditioned—and sampler mixes them.

When we have a low guidance (~0), prompt barely influences the masked region.

Let’s pause here since I’m way over my time budget on this research. Here are few takeaways:

XML Scene Graph Prompting seems to work. We should continue on this investigation.
Guidance somehow shows periodic behavior. We need to figure out what is the best value to use for our use cases. We can probably look into what CFG mixing is and how it works under the hood.

—Sprited Dev 🌱