SpriteDX - Reading BiRefNet Paper

Paper: https://arxiv.org/pdf/2401.03407

Terminologies:

DIS: Dichotomous Image Segmentation
HRSOD: High-Resolution Salient Object Detection, also a dataset.
SOD: Salient Object Detection
COD: Concealed Object Detection
UHRSD: Dataset
HRS10K: Dataset
DIS5K: Dataset
ASPP: Atrous Spatial Pyramid Pooling
Deformable Convolution: Deformable Conv adds learned offsets to each sampling points.

Abstract presents BiRefNet for DIS (dichotomous image segmentation) task.

It is comprised of LM (localization module) and RM (reconstruction module).
They utilize “BiRef” where hierarchical patches of image provide the source reference, and gradient maps. → Not sure what they are talking about.

Introduction talks about prior works and their observations.

“Non-salient features in image object can be well reflected by obtaining gradient features through derivative operations on the original image.”
In addition, they introduce what they call GT features“ for side supervision.
They name “incorporation of the image reference and the intorduction of both the gradient and GT references as bilateral reference“ → Why?
BiRefNet handles DIS task with separate localization and recon.
For LM, they extract hierarchical features from vision transformer backbone, which are combined and squeezed to obtain coarse predictions in low res in deep layers.
For RM, “inward and outward ref” is used…→ “source image” and the “gradient map” are fed into the decoder at different stages.
Instead of resizing the original image to lower-res, they keep the original resolution for in-tact detail features in inward reference and adaptively crop them in to patches for compatibility with decoding features.

Related Works highlight related works:

HRSOD is basically equivalent to background matting.
Interestingly HRSOD Dataset uses pixelated masks which may be beneficial for us.
Zeng et al. employed a global-local fusion of the multiscale input in their network → A photoshop analogy is like user zooming into a small part of the image while having navigator shows low res image with a red box showing where user is viewing.
Pyramid Blending is technique used to lower computational cost.
For challenging problems like COD, image priors like frequency, boundary, gradients, etc, are used as auxiliary guidance to train COD models. → Can’t model just learn it by themselves? It is so easy for the model to learn, right? Or we could just design focal loss such that it encourages the model to build those priors early on. But who am I talking.
For higher-resolution, there are some benefit to detecting targets. → not sure what this means…
There was a work (Yin et al.) to do progressive refinement with masked separable attention. → I wish I understood things just by reading these lines but yeah…
High-resolution DIS is “newly proposed task“ they say, and it focuses more on the complex slender structure of target objects in high-res images.
DIS5K dataset is created to help with those challenges.
Zhou et al. embedded a frequency prior to their DIS network to capture more details. → This is interesting. I guess this is helpful when looking at hair strains or bike pokes or more regular fine structures with patterns.
Pei et al. did label decoupling strategy to the DIS task and achieved competitive segmentation performance in the boundary areas of objects. → wish it told what this decoupling is. There are too many paper to read and I’m only a mere mortal.
Yu et al. used patches of HR images to accelerate the training in a more memory efficient way.
Unlike above, BiRefNet does not compress or resize images.
Yu et al. used Low-Res alpha → High-Res alpha.
BASNet progressively refine.
CRM continuously aligns the feature maps with the refinement target to aggregate detail features.
ICNet downscale original images and feed them into decoder.
Tang et al. cropped patches on the boundary to further refine them.
In LapSRN, Lapacian pyramids are generated to help with image recon at high res.
These models are not guided to focus on certain areas. BiRefNet folks introduce “gradient supervision“ in there outward reference to guide features sensitive to areas with richer fine details. → IMO, this guidance idea sounds very human. It seems to be doing what human would do. But from machine’s eye, I don’t think this is really necessary. I do think progressive enhancement by doing overall pass and detail pass idea is great but do we really need this concept of supervision? I mean Deep Learning is great because we don’t have to design that workflow ourselves.
My 2 Cents: BiRefNet creates spatial guidance. I think having the model have temporal room like Flow models (ODE) would be better. That is, at small t, they have the broad strokes, than at larger t, they have finer brush. Brush sizes don’t need to be defined or visible outside the model, the idea is that model will learn to operate with whatever brush they seem fit at that stage. In small time step t, finer details won’t be penalized much, then in large time step t, finer details will be penalized heavily. Spatial guidance and gradients and image priors all can be learned by the model but we just need to give more room for them to do so effectively.

Localization Module

Transformer Encoder (SWIN) extracts features at different stages. → SWIN comes up very frequently. I probably should read up on it.
I didn’t really understand the jargons here.
ASPP modules? Atrous Spatial Pyramid Pooling.
- ASPP runs several convolutions in parallel on same feature map
  - 1×1 conv (local)
  - 3×3 conv with dilation = 6
  - 3×3 conv with dilation = 12
  - 3×3 conv with dilation = 18
- Then concatenates all of them an fuses with 1×1 conv.
- Seems like a technique to give the block larger even receptive field (using nearest neighbor downsampling) and more…
- Why is this different from using 5×5 kernel with dialation?
- From what “I“ can tell, I think the types of maps it can learn and types of dialated conv can learn is different.
- Let’s leave it at that.

Reconstruction Module

Receptive Field (RF) has been a challenge
To achieve balance, they use reconstruction block (RB) in BiRef block as a replacement for vanilla residual blocks.
In RB, they use deformable convolutions with hierarchical receptive fields

and adaptive average pooling layer to extract features with RFs of various scales. Then these features extracted by different RFs.
My 2 cent: These SOTA models are cool, and to really test the “capability of our custom model”, I think we need to benchmark it against these models. Only then we will have comparable metrics.

Bilateral Reference

High Res reference is important.
Most segmentation methods use encoder-decoder structure with down-sampling and up-sampling. High res info is lost they argue.
BiRef consists of InRef and OutRef.
InRef adaptively crops.
Images of original high res are cropped to patches P of consistnt size with output features of the corresponding decoder stage. These patches are stacked with the original feature to be fed into the RM. Existing methods with similar techiques either add I only at the last decoding stage or resize I to make it applicabel with original features in lowres.
InRef supplies necessary HR information at every stage.
My Takeaway: My take away is that Cropped HR information is passed as a feature. This gives full picture to deep stages I presume. This is nice. So, the deep stages or deeper layers can utilize it. This seems rather powerful.
Then goes to recon block (RB)
- DFConv 1×1
- DFConv 3×3
- DFConv 7×7
- AvgPool
- Concat above → Conv 1×1
OutRef uses gradient labels.
- Gradient Label in this context is edge maps computed directly from the ground truth segmentation (or image), not learned annotations.

Objective Function

In HR seg, using BCE loss usually resutls in details.
They use hybrid loss. hey use BCE, IoU, SSIM, CE → SSIM. What is this actually?
- SSIM measures:
  - Luminance
  - Contrast
  - Structure (correlation, pattern)
- Takeaway: I should use SSIM.
L = L_pixel + L_region + L_boundary + L_semantic
L = 30 x L_BCE + 0.5 x L_IOU + 10 x L_SSIM + 5 x L_CE
My 2 Cents: They use semantic loss (L_CE) what does this even mean in the context of DIS? And why do they call it semantic loss.
Do they perhaps categorize whether something like fine-grain vs corse grain?
Feature Idea: Given a picture, we can repeat these processes.
- detect closest-item-to-camera segmentation then separate as layers.
- on the bottom layer, inpaint.
- then detect cloest-item-to-camera segmentation. If there is nothing worthwhile, stop. If detected, then separate layers.
- then on new bottom layer inpaint.
- then repeat the process.
- → guess it has a name: “layer-peel”
- I think we should train a sample model. this is probably a big reserach area.

Training

LM converges quickly they say (200epochs).
400epochs for fine parts. but too much compute.
multi-stage supervision can be accelerate (70% less epochs)
Fine tuning with only region-level losses can easily improve the practical use. → ?
My 2 cent: In my mind, I think they are providing too much information to the model. Does the model really need all that information? Feels like too much extra redundancies making it little heavy handed.
Test Sets:
- DIS5K
- HRSOD
- DUTS-TE
- DUT-OMRON
Eval Metrics
- S-measure: SSIM
- F-measure: DISC?
- E-measure: ?
- MAE: mean abs error
- HCE: human correlation efforts → I wonder what this is? Oh, this one is fun. Let’s use it.

Implementation Details

All images are resized to 1024×1024 (bilinear)
Horizontal flip is the only data augmentation used in training process → Ooooooooops. Poop alert!!!
Number of cateogries C is set to 219 as given in DIS-TR → hmm, perhaps DIS-TR has category info?
training DIS/HRSOD/COD task for 600/150/150 epochs. → so trains more stuff.
the model is finetuned over IoU loss for the last 20 epochs. → why only at the end. hmm, why don’t they just adjust the weights.
It was trained on 8 A100 GPU with batch size 4…… → seems expensive for what it does. (320gb)

Ablation Study

Ablation study is a controlled experiment where you remove or disable parts of a model/system one at a time to see which compoents acutally matter.

Hey, ran out of time. I will continue this later.

— Sprited Dev 🐛