Object Co-segmentation

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
Example video frames and their object co-segmentation annotations (ground truth) in the Noisy-ViDiSeg[1] dataset. Object segments are depicted by the red edge.

In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames[2][3].


It is often challenging to extract segmentation masks of a target/object from a noisy collection of images or video frames, which involves object discovery coupled with segmentation. A noisy collection implies that the object/target is present sporadically in a set of images or the object/target disappears intermittently throughout the video of interest.

Dynamic Markov Networks-based Methods[edit]

The inference process of the two coupled dynamic Markov networks to obtain the joint video object discovery and segmentation[1]
A joint object discover and co-segmentation framework based on coupled dynamic Markov Networks[1]

A joint object discover and co-segmentation method based on coupled dynamic Markov Networks has been proposed recently[1], which claims significant improvements in robustness against irrelevant/noisy video frames.

CNN and LSTM-based Methods[edit]

Overview of the coarse-to-fine temporal action localization in [4]. (a) Coarse localization. Given an untrimmed video, we first generate saliency-aware video clips via variable-length sliding windows. The proposal network decides whether a video clip contains any actions (so the clip is added to the candidate set) or pure background (so the clip is directly discarded). The subsequent classification network predicts the specific action class for each candidate clip and outputs the classification scores and action labels. (b) Fine localization. With the classification scores and action labels from prior coarse localization, further prediction of the video category is carried out and its starting and ending frames are obtained.
Flowchart of the spatio-temporal action localization detector Segment-tube[4]. As the input, an untrimmed video contains multiple frames of actions (e.g., all actions in a pair figure skating video), with only a portion of these frames belonging to a relevant category (e.g., the DeathSpirals). There are usually irrelevant preceding and subsequent actions (background). The Segment-tube detector alternates the optimization of temporal localization and spatial segmentation iteratively. The final output is a sequence of per-frame segmentation masks with precise starting/ending frames denoted with the red chunk at the bottom, while the background are marked with green chunks at the bottom.

In action localization applications, object co-segmentation is also implemented as the Segment-Tube spatio-temporal detector[4]. Inspired by the recent spatio-temporal action localization efforts with tubelets (sequences of bounding boxes), Le et al. present a new spatio-temporal action localization detector Segment-tube, which consists of sequences of per-frame segmentation masks. This Segment-tube detector can temporally pinpoint the starting/ending frame of each action category in the presence of preceding/subsequent interference actions in untrimmed videos. Simultaneously, the Segment-tube detector produces per-frame segmentation masks instead of bounding boxes, offering superior spatial accuracy to tubelets. This is achieved by alternating iterative optimization between temporal action localization and spatial action segmentation.

The proposed Segment-tube detector is illustrated in the flowchart on the right. The sample input is an untrimmed video containing all frames in a pair figure skating video, with only a portion of these frames belonging to a relevant category (e.g., the DeathSpirals). Initialized with saliency based image segmentation on individual frames, this method first performs temporal action localization step with a cascaded 3D CNN and LSTM, and pinpoints the starting frame and the ending frame of a target action with a coarse-to-fine strategy. Subsequently, the Segment-tube detector refines per-frame spatial segmentation with graph cut by focusing on relevant frames identified by the temporal action localization step. The optimization alternates between the temporal action localization and spatial action segmentation in an iterative manner. Upon practical convergence, the final spatio-temporal action localization results are obtained in the format of a sequence of per-frame segmentation masks (bottom row in the flowchart) with precise starting/ending frames.

See also[edit]


  1. ^ a b c d Liu, Ziyi; Wang, Le; Hua, Gang; Zhang, Qilin; Niu, Zhenxing; Wu, Ying; Zheng, Nanning (2018). "Joint Video Object Discovery and Segmentation by Coupled Dynamic Markov Networks" (PDF). IEEE Transactions on Image Processing. 27 (12): 5840–5853. doi:10.1109/tip.2018.2859622. ISSN 1057-7149.
  2. ^ Vicente, Sara; Rother, Carsten; Kolmogorov, Vladimir (2011). Object cosegmentation. IEEE. doi:10.1109/cvpr.2011.5995530. ISBN 978-1-4577-0394-2.
  3. ^ Chen, Ding-Jie; Chen, Hwann-Tzong; Chang, Long-Wen (2012). Video object cosegmentation. New York, New York, USA: ACM Press. doi:10.1145/2393347.2396317. ISBN 978-1-4503-1089-5.
  4. ^ a b c Wang, Le; Duan, Xuhuan; Zhang, Qilin; Niu, Zhenxing; Hua, Gang; Zheng, Nanning (2018-05-22). "Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation" (PDF). Sensors. MDPI AG. 18 (5): 1657. doi:10.3390/s18051657. ISSN 1424-8220. CC-BY icon.svg Material was copied from this source, which is available under a Creative Commons Attribution 4.0 International License.