Is Temporal Continuity Sufficient for Associations
Video Segmentation
A. Murat Tekalp , in The Essential Guide to Video Processing, 2009
6.3.2 Temporal Integration
An important consideration is to add memory to the motion detection process to ensure both spatial and temporal continuity of the changed regions at each frame. This can be achieved in a number of different ways, including temporal filtering (integration) of the intensity values across multiple frames before thresholding and postprocessing of labels after thresholding.
A variation of the successive FD and normalized FD is the FD with memory FDM k (x), which is defined as the difference between the present frame s (x, k) and a weighted average of past frames given by [24]
(6.4)
where
(6.5)
and
Here, 0 < α < 1 is a constant. After processing a few frames, the unchanged regions in maintain their sharpness with a reduced level of noise, whereas the changed regions are blurred. The function FDM k(x) is thresholded either by a global or a spatially adaptive threshold as in the case of two-frame methods. The temporal integration increases the likelihood of eliminating spurious labels, thus, resulting in spatially contiguous regions.
Accumulative differences can be employed when detecting changes between a sequence of images and a fixed reference image (as opposed to successive frame differences). Let s (x, k), s (x, k – 1), …, s (x, k – N) be a sequence of N frames, and let s (x, r) be a reference image. An accumulative difference image is formed by comparing every frame in the sequence with this reference image. For every pixel location, the accumulative image is incremented if the difference between the reference image and the current image in the sequence at that pixel location is bigger than a threshold. Thus, pixels with higher counter values are more likely to correspond to changed regions.
An alternative procedure that was adopted by MPEG-4 as a non-normative tool considers postprocessing of labels [25]. First, scene changes are detected. Within each scene (shot), an initial change detection mask is estimated between successive pairs of frames by global thresholding of the FD function. Next, the boundary of the changed regions is smoothed by a relaxation method using local adaptive thresholds [22]. Then, memory is incorporated by relabeling unchanged pixels which correspond to changed locations in one of the last L frames. This step ensures temporal continuity of changed regions from frame to frame. The depth of the memory L may be adapted to scene content to limit error propagation. Finally, postprocessing to obtain the final changed and unchanged masks eliminates small regions.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123744562000074
Video Segmentation
A. Murat Tekalp , in Handbook of Image and Video Processing (Second Edition), 2005
3.2 Temporal Integration
An important consideration is to add memory to the motion detection process in order to ensure both spatial and temporal continuity of the changed regions at each frame. This can be achieved in a number of different ways, including temporal filtering (integration) of the intensity values across multiple frames before thresholding and post-processing of labels after thresholding.
A variation of the successive frame difference and normalized frame difference is the frame difference with memory FDM k (x), that is defined as the difference between the present frame s(x,k) and a weighted average of past frames given by [24]
(4)
where
(5)
and
Here 0 < α < 1 is a constant. After processing a few frames, the unchanged regions in maintain their sharpness with a reduced level of noise, while the changed regions are blurred. The function FDM k (x) is thresholded either by a global or a spatially adaptive threshold as in the case of two frame methods. The temporal integration increases the likelihood of eliminating spurious labels; thus, resulting in spatially contiguous regions.
Accumulative differences can be employed when detecting changes between a sequence of images and a fixed reference image (as opposed to successive frame differences). Let s(x,k), s(x,k − 1), …, s(x,k − N) be a sequence of N frames, and let s(x,r) be a reference image. An accumulative difference image is formed by comparing every frame in the sequence with this reference image. For every pixel location, the accumulative image is incremented if the difference between the reference image and the current image in the sequence at that pixel location is bigger than a threshold. Thus, pixels with higher counter values are more likely to correspond to changed regions.
An alternative procedure that was adopted by MPEG-4 as a non-normative tool considers post-processing of labels [25]. First, scene changes are detected. Within each scene (shot), an initial change detection mask is estimated between successive pairs of frames by global thresholding of the frame difference function. Next, the boundary of the changed regions are smoothed by a relaxation method using local adaptive thresholds [22]. Then, memory is incorporated by re-labeling unchanged pixels which correspond to changed locations in one of the last L frames. This step ensures temporal continuity of changed regions from frame to frame. The depth of the memory L may be adapted to scene content to limit error propagation. Finally, post-processing to obtain the final changed and unchanged masks eliminates small regions.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780121197926500929
Face Recognition from Video
Shaohua Kevin Zhou , ... Gaurav Aggarwal , in The Essential Guide to Video Processing, 2009
20.2.2 Temporal Continuity/Dynamics
Property P1 strips the temporal dimension available in the video sequence. In property P2, we bring back the temporal dimension, and hence, the property P2 only holds for video sequence.
Successive frames in a video sequence are continuous in the temporal dimension. The continuity arising from dense temporal sampling is two-fold: the face movement is continuous and the change in appearance is continuous.
Temporal continuity provides an additional constraint for modeling face appearance. For example, smoothness of face movement is used in face tracking. As mentioned earlier, it is implicitly assumed that all face images are normalized before utilization of the property P1: set of observations. For the purpose of normalization, face detection is independently applied on each image. When temporal continuity is available, tracking can be applied instead of detection to perform normalization of each video frame.
Temporal continuity also plays an important role for recognition. Recently psychophysical evidence [19] reveals that moving faces are more recognizable. In addition to temporal continuity, face movement and face appearance often follow certain kinematics. In other words, changes in movement and appearance are not random. Understanding kinematics is also important for FR.
Simultaneous tracking and recognition is an approach proposed by Zhou et al. [20] that systematically studied how to incorporate temporal continuity in video-based recognition. Zhou et al. modeled two tasks involved, namely tracking and recognition, in a probabilistic framework using a time series. This will be elaborated in Section 20.4.
Lee et al. [21] performed a video-based FR using probabilistic appearance manifolds. The main motivation is to model appearances under pose variation, that is, a generic appearance manifold consists of several pose manifolds. Since each pose manifold is represented using a linear subspace, the overall appearance manifold is approximated by piecewise linear subspaces. The learning procedure is based on face exemplars extracted from a video sequence. K-means clustering is first applied and then for each cluster principal component analysis is used for a subspace characterization.
In addition, the transition probabilities between pose manifolds are also learned. The temporal continuity is directly captured by the transition probabilities. In general, the transition probabilities between neighboring poses (such as frontal pose to left pose) are higher than those between far-apart poses (such as left pose to right pose). Recognition also reduces to computing a posterior distribution.
Lee et al. compared three methods that use temporal information differently: the proposed method with learned transition matrix, the proposed method with uniform transition matrix (meaning that temporal continuity is lost), and majority voting. The proposed method with learned transition matrix achieved a significantly better performance than the other two methods.
Liu and Chen [22] used adaptive hidden Markov models (HMM) to depict the dynamics. The HMM is a statistical tool to model time series and is represented by λ = (A,B,π), where A is the state transition probability matrix, B is the observation PDF, and π is the initial state distribution. Given a probe video sequence Y, its identity is determined as
(20.23)
where is the likelihood of observing the video sequence Y given the model λ n . In addition, when certain conditions hold, HMM λ n was adapted to accommodate the appearance changes in the probe video sequence that results in improved modeling over time. Experimental results on various data sets demonstrated the advantages of using adaptive HMMs.
Aggarwal et al. [23] proposed a system identification approach for video-based FR. The face sequence is treated as a first-order autoregressive and moving averaging (ARMA) random process. Once the system is identified or each video sequence is associated with its ARMA parameters, video-to-video recognition uses various distance metrics constructed based on the parameters. Section 20.5 details this approach.
In [24], Turaga et al. revisited the system identification formulation in [23] from an analytic Grassmann manifold perspective. Using the rigorous derivations of the Grassmann manifold (e.g., Procrustes distance and kernel density estimation for the manifold), Turaga et al. improved the FR performance by a large margin.
Facial expression analysis is also related to temporal continuity/dynamics but not directly related to FR. Examples of expression analysis include [25, 26]. A review of face expression analysis is beyond the scope of this chapter.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123744562000232
Artefacts in Formal Ontology
Stefano Borgo , Laure Vieu , in Philosophy of Technology and Engineering Sciences, 2009
6.5 Identity criteria
If we are to grant an ontological status to artefacts, a delicate point now needs to be addressed. We need to examine their identity criteria. We have seen that artefacts are distinct from the physical objects (or amounts of matter, in the case of artefactual matter) that constitute them. They should therefore have distinct identity criteria. Indeed, artefacts can be repaired and have some parts substituted, thus changing the entity that constitutes them for another without losing their identity. Such change comes at the cost of the former constituting entity disappearing simultaneously with the newer constituting entity coming into exis-tence, though maintaining a certain degree of spatio-temporal continuity between the two. In fact, no artefact can "jump" from one material entity to a separate preexisting one at will. If Theseus's ship [ Rea, 1997, Introduction], an artefact, does not disappear when a plank is substituted, the physical object that constitutes it, the planks-and-nails assembly, changes so that the former assembly ceases to exist and a new assembly comes into existence. 23
By pointing out the property that an artefact cannot jump from one physical object to another, we can shed some light on the important distinction between artefacts and artefact roles. Roles, in general, can be played by different entities (e.g. different persons at different times can play the role of president of the US) [Masolo et al., 2004] and the change between players can be seen as a "jump", as the previous player usually survives the change and the successor often already exists. Physical artefacts are more stable. They are not roles. This distinction is evident, for instance, in the house/home contrast. A house is an artefact which can play the role of being someone's home. One's home changes, there is a jump from a house to another when one moves house, so "home" is not a type of artefact subsumed by "house", but rather a role.
The gradual change in the constituting material entity may only occur with artefacts selected from physical objects and not with those selected from amounts of matter. It is reasonable to assume that amounts of plastic or of glass cannot switch over just as quantities of matter cannot interchange. Indeed, amounts of matter in dolce have purely mereological identity criteria. 24 Non-agentive physical objects have more complex identity criteria, which vary from sortal to sortal. It is not the purpose of this paper to establish those criteria, but as a general guideline, we will take shape and internal structure to be part of these criteria. We assume though that minor changes in shape and in the constituting amount of matter, like those induced by a scratch, are allowed. Granularity is certainly an issue here.
With artefacts, an obvious characteristic for determining their identity criteria is their intentional aspects, that is, their attributed capacity. The identity criteria should among other things determine when an artefact disappears all together. Ordinary malfunctioning does not make an artefact disappear, so its identity criteria cannot be simply based on a match between attributed capacity and capacity. Nor is the artefact's disappearance simply based on its constituting entity's disappearance, since that can be substituted, as we have just seen. So, the loss of much of the attributed capacity must be involved. We do not intend to solve here the infamous ship-of-Theseus puzzle [Rea, 1997, Introduction], but we believe that we can nevertheless safely assume that the identity criteria of artefacts are based on a combination of significant degree of spatio-temporal continuity of the constituting entities, the existence of all specific essential parts if any (e.g. for a car, its frame), and the actuality of a significant amount of attributed capacity, i.e. a significant overlap between one region member of the quale of the attributed capacity and the region quale of the capacity. Note that since the attributed capacity is not restricted to the overall or main function of the artefact and since it covers structural specifications, a malfunctioning artefact does possess most of its attributed capacity. Even a badly designed artefact, like a medieval flying machine, possesses most of its attributed capacity.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B978044451667150015X
Vulnerability of Water Resources to Climate
M. Rodell , in Climate Vulnerability, 2013
5.10.6 Drought Monitoring
Drought has devastating impacts on society and costs the US economy 6–8 billion dollars per year on an average (WGA 2004 ). Drought affects the availability of water for irrigation, industry, and municipal usage; it can ravage crops, forests, and other vegetation, and, where rivers support hydropower and power plant cooling, it affects electricity generation. Most current drought products rely heavily on precipitation indices and are limited by the scarcity of reliable objective information on subsurface water stores. Groundwater levels, which integrate meteorological conditions over time scales of weeks to years, would be particularly well suited to drought monitoring, if only such data were available with some semblance of spatial and temporal continuity and a reasonably long background climatology ( Rodell 2010).
Droughts cause declines in all types of terrestrial water storage, and as a result, they stand out in the GRACE data. For example, drought in the southeastern United States (2007–2008) imparted a negative trend in that area in Figure 2. GRACE has been applied directly to investigate a decade long drought in southeastern Australia (Awange et al. 2009; Leblanc et al. 2009). Yirdaw et al. (2008) characterized TWS changes associated with a recent drought in the Canadian prairie. Chen et al. (2009) used GRACE to study a major drought event in the Amazon that occurred in 2005, and Chen et al. (2010) examined a recent drought in the La Plata Basin.
Houborg et al. (2012) applied the GRACE data assimilation approach to enhance the value of GRACE for drought monitoring, developing surface and root zone soil moisture, and groundwater drought indicators for the continental United States. Because a long-term record is needed as background to quantify drought severity, whereas GRACE data are only available from mid-2002, it was necessary to rely on the Catchment LSM alone for most of the record. Therefore, Houborg et al. (2012) executed an open loop model simulation for the period 1948 to near present using for input a meteorological forcing dataset developed at the Princeton University (Sheffield et al. 2006). Monthly GRACE TWS anomaly fields (Swenson and Wahr 2006b) were converted to absolute TWS fields by adding the time-mean total water storage field from the open loop Catchment model simulation. This assured that the assimilated TWS output would be nearly identical to that of the open loop, which is to say that there was no discontinuity between the open loop and assimilation portions of the run. However, the GRACE data, and therefore the assimilation results, could still have a larger or smaller range of variability than the open loop land surface model (LSM) results at any given location. This is significant because drought monitoring concerns the extremes. Therefore, to correct for differences in the range of variability between the assimilation and the open loop model output, Houborg et al. (2012) computed and mapped between the cumulative distribution functions of wetness at each model pixel for the open loop and assimilation results during the overlapping period (2002 onward). Drought indicator fields for surface (top several centimeters) soil moisture, root zone soil moisture, and groundwater were then generated based on the probability of occurrence in the output record since 1948. To mimic other drought indicators that contribute to the US Drought Monitor product, dry conditions were characterized from D0 (abnormally dry) to D4 (exceptional), corresponding to decreasing cumulative probability percentiles of 20–30%, 10–20%, 5–10%, 2–5%, and 0–2%. Following this process, new GRACE-data-assimilation-based drought indicators (Figure 5) are now being produced on a weekly basis by NASA and disseminated from the University of Nebraska's National Drought Mitigation Center web portal. They are also being delivered to the principals of the US Drought Monitor, who are currently assessing them as new inputs. The US Drought Monitor is the premier decision support tool for drought in the United States; however, it lacked spatially continuous groundwater and soil moisture inputs before the development of the new GRACE-based drought indicators. Currently, there are very few drought assessment products available that have global coverage, and those that do exist lack the sort of information on subsurface water stores that GRACE provides. It is likely that the GRACE-based drought indicators just described will be extended to the global scale in the near future to help alleviate this knowledge gap.
Read full chapter
URL:
https://www.sciencedirect.com/science/article/pii/B9780123847034005219
Hybrid soft computing approaches to content based video retrieval: A brief review
Hrishikesh Bhaumik , ... Susanta Chakraborty , in Applied Soft Computing, 2016
3.2.2 Soft computing approaches
Soft computing has played an important role in the detection and tracking of objects present in a video. Researchers have tested and relied upon different paradigms in soft computing for obtaining accurate results. In [113] Doulamis et al. presented an adaptive neural network classifier architecture which consists of two modules. The first module has the role of tracking video objects (VO), while the second module is used for initial VO estimation. The adaptation of the network takes place through a cost effective weight updating algorithm. The method is applied for two applications. In the first one, humans are extracted from video conferencing applications, while in the other, generic VOs are detected in stereoscopic video sequences. The algorithm is efficient for complex scenarios like object blending and occlusion. Hu et al. [114] applied fuzzy self-organizing neural network for learning the activity patterns of objects. The work focuses on anomaly detection and prediction of activity patterns. The entire trajectory is fed as input to the neural network. Patterns are learned by the network based on fuzzy set theory and batch learning. The fuzzy self-organizing neural network so designed was found to perform better than Kohonen self-organizing feature map (SOFM) and vector quantization in terms of both speed and accuracy. In another work, Perlovsky et al. [115] developed an object tracker based on neural networks. The neural network incorporates models for ground moving target indicator tracks. A 20 dB improvement was achieved for the signal to clutter ratio. Culibrk et al. [116] presented a neural network architecture in the form of an unsupervised Bayesian classifier. Video objects were segmented based on background modeling and subtraction approach. For testing the classifier complex background motion with changes in illumination were taken. The algorithm was parallelized at the sub-pixel level and designed for proficient hardware implementation.
The object tracking process is a function which incorporates spatial and temporal information. As such it can be viewed as a dynamic optimization problem. Consequently, a sequential particle swarm optimization was proposed by Zhang et al. [117] where information related to temporal continuity was integrated into the traditional PSO algorithm. The authors derive theoretically that this modified PSO framework is essentially a multilayer particle filter based on importance sampling. It is also demonstrated that the proposed algorithm is capable of handling changes in the appearance of an object and unpredicted motion of an object, better than state-of-the art particle filters and its variants. Subsequently, Zhang et al. [118] used particle swarm optimization to handle the problem of multiple objects tracking in an occluded environment. The problem gathers a different dimension when the objects have similar appearance. An analogy is drawn between the behavior of bird flocks from multiple species and the multiple objects tracking problem. The global swarm is divided into groups depending on the number of objects. Each species keeps track of its objects. Thus the joint tracker used in other approaches is decentralized into a set of individual trackers. The method is demonstrated to work effectively and efficiently. The use of Support Vector Machines has also been demonstrated for multi-object tracking by Zhang and van der Maaten [119], where a structure preserving approach was proposed. The authors devised a model-free tracker for the purpose. Spatial constraints are learned by a SVM. The tracker provided significant improvement in performance for single-object tracking by capturing motion in different parts of the object.
A real-time object tracking approach was proposed by Kobayashi et al. [120] based on PSO. The algorithm was designed on the basis that the optimal solution can be searched efficiently by using each particle in the swarm. PSO is applied for tracking an object in the wide search range of a video sequence. The practical usefulness of the method is demonstrated through simulations.
Hwang et al. [121] applied genetic algorithms for automatic extraction and tracking of objects from a video sequence. Producing video object planes is a difficult problem. Using distributed genetic algorithms, the frames are spatially fragmented by the chromosomes. The video object planes were generated using a change detection mask in combination with the results of frame fragmentation. To maintain the temporal continuity of video objects in consecutive frames, the chromosomes are started from the spatial decomposition results of the previous frame. This also helps to eliminate redundant computation. This finally produces accurate video object planes and generates good extraction results. Genetic algorithms have been used in other works where tracking and extraction of objects in a video was taken up. Extracting moving objects in a video requires segmenting it both in the spatial and temporal domains. Spatial segmentation is necessary for obtaining the object boundaries, while temporal segmentation enables detection of foreground and background. Embarking upon this concept, Kim and Park [122] used genetic algorithm for video segmentation automatically. The proposed method had two advantages. Firstly, for segmenting the video, no a priori knowledge is required and secondly it includes an algorithm in its architecture for tracking objects efficiently.
Han et al. [123] used a kernel-based Bayesian filter for tracking objects in a video sequence. The method encompasses an analytic approach for approximating and promulgating density functions required for real-time tracking of objects. In the high dimensional space, the method performs sampling more efficiently.
Video segmentation forms the underlying basis for content-based video retrieval. As such, advances in techniques of video segmentation determine the effectiveness of the approaches developed for content-based video retrieval. In the next section, techniques related to CBVR are enumerated.
Read full article
URL:
https://www.sciencedirect.com/science/article/pii/S1568494616301314
Perception, information processing and modeling: Critical stages for autonomous driving applications
Dominique Gruyer , ... Andry Rakotonirainy , in Annual Reviews in Control, 2017
6 Conclusion
We note that in recent years, research on automated driving has grown exponentially. At the sensors level, a consensus seems to be reached, in particular by implementing perception belts around the vehicle by combining a set of sensors (LiDAR, RADAR, and camera). This type of architecture is, for the moment, not yet installed on vehicles of large series (mainly because of the cost of the sensors) and it bans fully autonomous driving benefits. Nevertheless, some manufacturers are already able to offer partially automated driving services by using significantly fewer sensors.
For a collection of the most comprehensive amount of information regarding the environment, it seems necessary to use different types of sensors in order to exploit the complementarity of their operating ranges. Algorithms processing images from cameras, for example, are very effective for object classification. RADARs remain the preferred sensors to detect obstacles at long distance and in harsh weather conditions. Finally, LiDARs make it possible to obtain an accurate representation of the shape of the objects constituting the environment. Adapted algorithms make it possible to recover the state vectors of properties characterizing the moving objects in the scene.
In order to process the information coming from these different sensors, the multi-sensors data fusion algorithms constitute a mandatory step. Fusion strategies are often based on a tracking algorithm ensuring temporal continuity in tracking objects. Several types of fusion strategies are possible: centralized fusion, decentralized and even cooperative (where a sensor is used to help detect a second sensor). These multi-sensors fusion strategies and environmental representation models generate significant amounts of computation to ensure real-time operation of applications.
To manage these problems, the computer architectures evolve compared to the ECUs usually used in the automotive environment. Now, platforms like the NVidia PX-2, based on a GPU are alternatives, certainly original, but nonetheless relevant to process real-time data provided by the sensors.
Finally, the driver should not be forgotten in the design of automated driving systems, as it is suggested by the SAE classification. To include the driver, a systemic development around the driver is needed. For this, we have also discussed some elements that could be used in future research to monitor, model and predict the driver's supervision state in order to integrate him in the co-pilot application: for example retrieving information about the perception, via adapted sensors, could give important information about the awareness state of the driver and his health status. In addition, IoT technologies can contribute to monitor information from distant sources, so that the information about the environment is enhanced with background information. These kinds of information, from distant sources, can also bring specifications on the current driver's activity and driving expectations. Going further, we have discussed current research on driver modeling mainly based in artificial intelligence tools Indeed, the last step for the passage of the driver to the machine consists in making advances in the research on driver modeling so as to offer an interpretation of the actions of the system in line with his behavior.
Although purely autonomous vehicles are currently not yet available to the general public, captive fleets have already been launched by some companies (including via automated shuttles). Given the speed of technological evolution and also the challenges involved, it is reasonable to assume that the first vehicles offering purely automated driving services will not be available before 2025.
Read full article
URL:
https://www.sciencedirect.com/science/article/pii/S136757881730113X
A survey on deep learning techniques for image and video semantic segmentation
Alberto Garcia-Garcia , ... Jose Garcia-Rodriguez , in Applied Soft Computing, 2018
3.2.6 Video sequences
As we have observed, there has been a significant progress in single-image segmentation. However, when dealing with image sequences, many systems rely on the naïve application of the very same algorithms in a frame-by-frame manner. This approach works, often producing remarkable results. Nevertheless, applying those methods frame by frame is usually non-viable due to computational cost. In addition, those methods completely ignore temporal continuity and coherence cues which might help increase the accuracy of the system while reducing its execution time.
Arguably, the most remarkable work in this regard is the clockwork FCN by Shelhamer et al. [95]. This network is an adaptation of a FCN to make use of temporal cues in video to decrease inference time while preserving accuracy. The clockwork approach relies on the following insight: feature velocity – the temporal rate of change of features in the network – across frames varies from layer to layer so that features from shallow layers change faster than deep ones. Under that assumption, layers can be grouped into stages, processing them at different update rates depending on their depth. By doing this, deep features can be persisted over frames thanks to their semantic stability, thus saving inference time. Fig. 21 shows the network architecture of the clockwork FCN.
It is important to remark that the authors propose two kinds of update rates: fixed and adaptive. The fixed schedule just sets a constant time frame for recomputing the features for each stage of the network. The adaptive schedule fires each clock on a data-driven manner, e.g., depending on the amount of motion or semantic change. Fig. 22 shows an example of this adaptive scheduling.
Zhang et al. [115] took a different approach and made use of a 3DCNN, which was originally created for learning features from volumes, to learn hierarchical spatio-temporal features from multi-channel inputs such as video clips. In parallel, they over-segment the input clip into supervoxels. Then they use that supervoxel graph and embed the learned features in it. The final segmentation is obtained by applying graph-cut [116] on the supervoxel graph.
Another remarkable method, which builds on the idea of using 3D convolutions, is the deep end-to-end voxel-to-voxel prediction system by Tran et al. [96]. In that work, they make use of the Convolutional 3D (C3D) network introduced by themselves on a previous work [117], and extend it for semantic segmentation by adding deconvolutional layers at the end. Their system works by splitting the input into clips of 16 frames, performing predictions for each clip separately. Its main contribution is the use of 3D convolutions. Those convolutions make use of three-dimensional filters which are suitable for spatio-temporal feature learning across multiple channels, in this case frames. Fig. 23 shows the difference between 2D and 3D convolutions applied to multi-channel inputs, proving the usefulness of the 3D ones for video segmentation.
Novel approaches such as SegmPred model proposed by Luc et al. [97] are able to predict semantic segmentation maps of not yet observed video frames in the future. This model consists in a two-scale architecture which is trained in both, adversarial and non-adversarial ways in order to deal with blurred predicted results. Model inputs have been previously per-frame annotated and consists in the softmax output layer pre-activations. Model performance drops when predicting more than a few frames in the future. However, this approach is able to model the object dynamics on the semantic segmentation maps, which remains an open challenge for current computer vision systems.
Read full article
URL:
https://www.sciencedirect.com/science/article/pii/S1568494618302813
A review of segmentation methods in short axis cardiac MR images
Caroline Petitjean , Jean-Nicolas Dacher , in Medical Image Analysis, 2011
5.2 Tracking the ventricle borders and motion information
Tracking the ventricle boundaries over time allows for full cycle segmentation and for further investigation of the cardiac deformations and strain. Only a few methods really exploit the information provided by cardiac motion, partly because of the complexity and the variability of the motion model, but also because ED and ES image segmentations are sufficient for estimating the cardiac contractile function in a clinical context. Making use of the temporal dimension can help the segmentation process and yields temporally consistent solutions. Temporal continuity is also of interest for the clinician during manual segmentation, as he often examines images that follow or precede the image to be segmented in the sequence, as well as corresponding images in slices below or above. The expert behavior can thus be emulated by considering that the neighborhood of each voxel includes its six spatial nearest neighbors and the voxels in the neighboring time frames of the sequence, such as in MRF ( Lorenzo-Valdés et al., 2004) or in (Cousty et al., 2010), where a 4D graph corresponding to the 3D+t sequence is considered.
Tracking the ventricle contours over time may be performed with or without external knowledge. In this latter case, tracking relies on intrinsic properties of the segmentation methods. The variational approach of deformable models has been shown to be a very powerful and versatile framework for tracking the ventricle borders, the temporal resolution of cardiac MR allowing for using the segmentation result of the previous frame and propagating it to the next one (Pham et al., 2001; Gotardo et al., 2006; Hautvast et al., 2006; Ben Ayed et al., 2009). In order to improve the robustness of the propagation results, solutions have been proposed, like tracking the contours both backwards and forwards in time (Gotardo et al., 2006) and constraining the contour to respect the user's preference by maintaining it to a constant position, through the matching of gray level profiles (Hautvast et al., 2006). Non-rigid registration may also be used to propagate manually initialized contours or a heart atlas (Lorenzo-Valdés et al., 2002; Noble et al., 2003; Zhuang et al., 2008). In this case, the segmentation boils down to a registration problem: the matching of one (segmented) image to the other (unsegmented) is then applied to the contour of the segmented image, thus providing the new deformed contour. Segmentation and registration can also be coupled together by jointly searching for the epicardial and encocardial contours, and for an aligment transformation to a shape reference (Paragios et al., 2002).
Deformable models also are efficient for introducing knowledge regarding heart motion. Prior knowledge about cardiac motion can be encoded in a weak manner, by temporally averaging the trajectories of the contour or surface points (Heiberg et al., 2005; Montagnat and Delingette, 2005), or in a strong manner, by using external prior. In (Lynch et al., 2008), the authors observed that the change of the blood volume of the ventricle is intrinsically linked to the boundary motion and that the volume decreases and increases during one cardiac cycle. The movement of the contour points are then modeled by an inverted Gaussian function, used to constrain the deformation of the level set. The dynamics of the heart can be taken into account via a biomechanical model (see Section 4.2.3), used not to predict motion but to regularize the deformations of the volumetric model. Another possibility is to consider the moving boundaries as constituting a dynamic system, whose state must be estimated thanks to observations (the images) and a model, learned from training date (Sénégas et al., 2004; Sun et al., 2005). This explicit combination of shape and motion information, together with the appearance of the heart, can also be embedded in an ASM/AAM framework as a single shape and intensity sample (Lelieveldt et al., 2001).
Tracking the LV and RV borders over time is feasible, mainly in a deformable model framework. Thanks to full cycle image segmentation, the volume variation throughout the cardiac cycle can be assessed, along with other quantitative parameters (such as strain), that have proven useful for assessing the cardiac contractile function (Papademetris et al., 2002).
Read full article
URL:
https://www.sciencedirect.com/science/article/pii/S1361841510001349
Source: https://www.sciencedirect.com/topics/engineering/temporal-continuity
0 Response to "Is Temporal Continuity Sufficient for Associations"
Post a Comment