SpriteDX - Shot Detection 2

We generate multi-shot animation using Seedance 1 Pro, and it results in a video file that concatenates multiple shots. Since there is no information about the shot boundaries (i.e. where the shot ends and starts), we need to detect the shots ourselves.
So far, we used PySceneDetect for splitting the shots.
- content-aware scene detection (
detect-content): uses differences in HSL colorspace combined with filtering to detect shot changes (fast cut)
With reasonable parameters this generates k=3 shots, but sometimes it creates 2 or 4 shots because either:
Generation Error: Seedance 1 Pro model produced more or less than 3 shots.
Detection Error: PySceneDetect under/over-detected shots.
When we run into these situations, we end up with erroneous outputs.
Problem Definition
Given:
A set of shot labels
S = {s_1, s_2, …, s_k}(i.e. {Greet, Idle, Run}, k=3)A video represented as a sequence of frames
F = (f_1, f_2, …, f_n)A boundary score function
b(t): Score for a cut between frames t and t+1.A semantic score matrix
MinR^(nk), whereM_{t,i} = p(s_i | f_t, desc(s_i)). That is given the frame and a descriptor/promptdesc(s_i)for shot i, how well doesf_t“belong” to shots_i.
Constraints / Assumptions:
Piecewise constant labeling: The frame labels
y_t \in Schange only at shot boundaries. That is, between boundaries, all frames share the same label.Exactly k-1 transitions: There are precisely k-1 boundary points dividing the frames into k contiguous segments. This assumes that there is no Generation Error discussed previously.
Fixed label ordering: The labels must appear in a predetermined order (e.g. s_1 → s_2 → … → s_k).
Minimum / maximum lengths: Enforce that each segment must have at least l_min and at most l_max frames (to avoid degenerate splits).
Objective / Score to Optimize:
Define a global score/objective for any valid labeling y = (y_1, …, y_n):
$$Score(y) = \sum_{t=1}^{n} Emission(t, y_t) + \lambda \sum_{t=1}^{n-1} \mathbf{1} \left[y+t \neq y_t+1\right]b(t) - Penalty(y)$$
Emission(t,i)is a (log) likelihood or score that framef_tis in labels_i. We can chooseEmission(t,i) = log M_{t,i}or more complex scoring combining embedding distances, etc.The boundary bonus term rewards placing a transition at t if b(t) is high — i.e. it aligns with a strong visual cut.
Lamda is a weight balancing how much you trust the boundary signal vs semantic signal.
Penalty(y) encodes constraints / regularization:
Transition constraints (e.g. disallow going backwards, skipping labels, or extra transitions).
Length penalties (if segments too short or too long)
Possibly a cost or penalty for “fuzzy” transitions (if transitions occur in low b(t) zones).
Your goal is:
$$\hat{\mathbf{y}} = argmax_y Score(\mathbf{y})$$
Interestingly this seems very much like an Hidden Markov Model which I just studied in NLP course. Let’s formulate it.
Problem Formulation as an HMM
Observations and States
Frames: F = (f_1, …, f_n) — per-frame visual features or logits.
Shot labels (hidden states): S = {s_1, …, s_k, s_STOP} (e.g. Greet, Idle, Run, STOP)
At time t, the hidden state isy_t \in S.
Our goal is to infer the most likely state sequence that segments the video into contiguous shots.
HMM Components
An HMM is defined by:
Initial distribution: pi_i = P(y_1 = s_i)
Transition matrix: A_ij = P(y_t = s_j | y_{t-1} = s_i)
Emission model: B_t(i) = P(f_t | y_t = s_i).
What are we maximizing?
We want the most likely state path that ends in STOP given the frames. Equivalently, we maximize the joint log-likelihood of the observations and a STOP terminated path.
Feature Extraction
We will use CLIP embedding.
f_i = CLIP_encode_image(frame_i)
Emission
For each label, we need log emission probabilities.
log P(f_i | Greet)
Let’s just formulate P(f_i | Greet):
It is the likelihood that the first observed frame f1 would be generated if the hidden state at that time were Greet.
f_i = normalized CLIP image embedding for frame i
d(s_j) = normalized CLIP text embedding for description of state s_j
Then the similarity score is
sim(f_i, d(s_j)) = (f_i dot d(s_j) / (||f_i|| ||d(s_j)||)
and log P (f_i | s_j) = 1/ tau * sim(f_i, d(s_j)) + const where tau controls how sharp the scores act as probabilities.
This is how far I’ve got to formulating the problem. We so far defined the emission probabilities and next time, I will define the transition probabilities and probably hand tune it.
— Sprited Dev 🌱




