首页 > 范文大全 > 正文

AVS 3D Video Coding Technology and System

开篇:润墨网以专业的文秘视角,为您筛选了一篇AVS 3D Video Coding Technology and System范文,如需获取更多写作素材,在线客服老师一对一协助。欢迎您的阅读与分享!

Abstract

Following the success of the audio video standard (avs) for 2D video coding, in 2008, the China AVS workgroup started developing 3d video (3DV) coding techniques. In this paper, we discuss the background, technical features, and applications of AVS 3DV coding technology. We introduce two core techniques used in AVS 3DV coding: inter-view prediction and enhanced stereo packing coding. We elaborate on these techniques, which are used in the AVS real-time 3DV encoder. An application of the AVS 3DV coding system is presented to show the great practical value of this system. Simulation results show that the advanced techniques used in AVS 3DV coding provide remarkable coding gain compared with techniques used in a simulcast scheme.

Keywords

AVS; 3D video coding; inter-view prediction; stereo packing

1 Introduction

The China Audio Video Standard (AVS) video coding standard is developed by the AVS workgroup, whose role is to establish general technical standards for the compression, decoding, processing, and representation of digital audio and video [1]. After ten years, the AVS workgroup has developed a series of standards based on different applications, and these standards have attracted the attention of both industry and academia. In 2007, AVS was accepted by the ITU-T IPTV focus group as one of four video formats. With the fast development of display technologies and rapidly growing demands of 3D video (3DV) applications, high-efficiency 3DV compression is needed. The most straightforward 3DV coding scheme is simulcast, in which compression and transmission are performed separately for each view. However, simulcast ignores inter-view correlation, which produces double the amount of data compared with traditional video. Thus, simulcast is not the optimal solution for 3DV coding. In 2008, the AVS workgroup launched the 3DV coding project to satisfy demand for higher resolution and better quality that had arisen as a result of widespread 3DV usage [2].

Currently, the AVS workgroup is focused on stereoscopic video coding because of the rapidly growing 3DV market and number of applications. Two advanced stereoscopic video coding schemes have been adopted: inter-view prediction and enhanced stereo packing coding [3]. In inter-view prediction, the correlation between two channels is greatly reduced by allowing disparity compensation from the inter-view frame. Enhanced direct-mode prediction and enhanced motion-vector prediction further improve coding performance [4]. In enhanced stereo packing, the stereoscopic image of each view is down-sampled by half and merged into a single frame. View prediction is allowed in the frame in order to improve coding efficiency. This technique supports backward compatibility with existing 2DV coding infrastructure. To flexibly support these two 3DV coding schemes, AVS defines high-level syntax at both system layer and video layer.

Because of the fast development of microelectronic techniques, there is an urgent need to develop a dedicated AVS 3DV real-time encoder chip that is capable of huge throughput and mass computation in consumer applications. Although AVS is designed for optimized coding and low complexity, compressing high-definition (HD) stereoscopic video in real time is a very big challenge. Several key techniques have therefore been proposed. These techniques include parallel pipeline video coding, advanced rate control, and inter-view synchronization. In this paper, we review these key techniques used in encoder chip design and propose an AVS 3D system that incorporates a real-time encoder for TV broadcasting. Our proposal is the first end-to-end AVS 3DV system, and it has already been successfully used to broadcast the Guangzhou Asia Games 2010 on 3DTV.

In section 2, we introduce inter-view prediction and stereo packing schemes used in AVS. In section 3, we propose the 3D AVS coding system, including the core techniques for designing an AVS 3D real-time encoder chip and broadcasting system. In section 4, we give the results of experiments conducted with these technologies. Section 5 concludes the paper.

2 3D Video Coding in the Audio Video

Standard

2.1 Inter-View Prediction

Fig. 1 shows the basic concept of the AVS inter-view stereoscopic coding system. The input signal comprises left and right views that are captured by a stereo camera. These views are coded using an AVS 3D encoder, and the resulting bitstreams are multiplexed to form the final bitstream packet. At the receiver, the bitstream packet is decoded with the AVS 3D decoder for stereo display. To ensure compatibility with AVS 2D, the sub-bitstream, which represents the independent view, can also be decoded using an AVS 2D decoder and displayed on a conventional 2D display system.

Inter-view prediction uses the already coded data in the other view to efficiently represent the current view [3]. One of the two views, referred to as the base view or independent view, is coded independently using an unmodified AVS P2 video coder. Fig. 2 shows the coding structure of inter-view prediction. To ensure compatibility with monoview AVS, the number of reference frames for both the base view and dependent view is restricted to two. The base view can be decoded independently for 2D display.

To exploit the inter-view correlation, in the dependent view, the first frame is inter-predicted from the reconstructed I frame in the base view. Other P frames in the dependent view can reference either the previous P frames in the same view or the corresponding simultaneously displayed P frame in the base view. Inter-view prediction for the B frame does not affect coding performance; therefore, references for the B frame can only be reconstructed frames from forward and backward directions in the same view.

Because the AVS inter-view coding structure changes the reference-frame mechanism of the dependent view, the related view prediction techniques should also be developed. Recently, two advanced techniques were adopted by the AVS workgroup: enhanced motion vector prediction and enhanced direct mode [4]. These techniques can be used to exploit the correlation between the base view and dependent view in order to improve coding performance.

Conventional AVS motion vector prediction for monoview uses scaled motion vectors from four neighboring blocks. However, for the P frame of the dependent view, it is not desirable to use the motion vectors of neighboring blocks if they refer to different channels. To resolve this problem, enhanced motion-vector prediction is proposed. We can assume the current block is A. If A is temporally predicted, the inter-view predicted block in the neighboring blocks is unavailable. Similarly, if A is inter-view predicted, the temporally predicted block in the neighboring blocks is unavailable. This approach ensures that appropriate motion vectors are used for prediction.

In monoview AVS coding, the motion vectors of direct mode for the B frame are derived from the motion vector of the co-located block of the backward reference [5]. However, in inter-view prediction, the farthest reference frame of the backward reference is substituted by the inter-view frame. To obtain accurate motion vectors, when the backward reference is inter-view predicted, the motion vectors of the neighboring blocks are used instead of the disparity vectors.

2.2 Enhanced Stereo-Packing Coding

The stereo-packing mode is used for backward compatibility with 2DTV infrastructure and to improve coding performance. Fig. 3 shows how stereo-packing mode is used in AVS 3DV coding. At the encoder, each view is first decimated by half using down-sampling, then the two down-sampled frames are merged into one frame that is the input of a conventional AVS 2D encoder. At the decoder, the bitstream can be decoded using an AVS 2D decoder and can then be detached into multiple views. Each view is up-sampled to support 3D display. Two key techniques in stereo-packing mode involve down-sampling and up-sampling algorithms, and view-merging [6]. Because sampling algorithms are non-normative for the video coding standard, various algorithms can be supported depending on the application scenarios. For more details on the sampling algorithms refer to [6]. Currently, AVS supports two merging approaches: side-by-side and top-to-bottom (Fig. 4). These two approaches make the down-sampling and up-sampling algorithms more flexible.

Coding efficiency in the stereo-packing scheme can be further improved by exploiting inter-view redundancies. Similar to inter-view prediction, AVS allows interprediction between two decimated views (Fig. 5). This technique is limited in that the encoding process of the dependent view’s decimated frame cannot begin until the base view has been encoded.

2.3 High-Level Syntax

To support the two kinds of AVS 3D coding, high-level syntax is designed at both the system layer and video layer. Three AVS 3D coding schemes are created by incorporating the descriptor: simulcast compression, inter-view prediction, and stereo packing. We define a syntax view_orgnizing_type for describing each coding system. If the syntax is zero, both simulcast compression and inter-view prediction are supported. If the syntax is one, only stereo packing is supported.

The syntax in the video layer indicates different merging approaches for stereo packing mode (Table 1). Stereo packing mode is fully compatible with monoview coding when the stereo packing mode is set at zero. Moreover, a reserved value is also defined for future extension.

3 AVS 3D Video Coding System

From production to broadcasting, 3DTV usually goes through the processes of acquisition, encoding, multiplexing, modulation, demodulation, demultiplexing, decoding, and display. Among these, the most important is real-time encoding of the HD stereoscopic video. In this section, we discuss AVS 3DV encoder chip design techniques. We also present a 3DTV broadcasting system and discuss potential applications of the AVS 3DV coding standard.

3.1 AVS 3D Real-Time Encoder

Fig. 6 shows the AVS 3D real-time encoding system for HD stereoscopic video. In the encoder, the left and right views are down-sampled and merged into a single frame. The down-sampling direction can be horizontal, to support the side-by-side merging approach, or vertical, to support the top-to-bottom merging approach. The syntax for these approaches is defined in section 2.3. The packing frame is fed into the AVS HD encoder, which generates the AVS bitstream. Finally, the AVS bitstream is packaged into transport stream format for storage or transmission.

The computing power required by the SD/HD encoder is far beyond the capacity of a single central processing unit (CPU). Fortunately, multicore processors allow the possibility of achieving real-time encoding. To fully exploit multicore processors, parallel encoding algorithms are highly desired in the encoder design.

The motion estimation (ME) module generally takes more than 60% of the total encoding time, and this is a bottleneck for real-time compression. We therefore isolate the ME module for parallel processing. Because the ME module frequently needs to exchange data with other modules, it is not appropriate to use macroblock or finer-level parallel processing for ME. We propose a frame-level parallel ME algorithm that exchanges ME information with other modules until the ME of the whole frame is finished. Fig. 7 shows the architecture of the proposed dual-pipeline parallel scheme. The ME process is completely isolated and is the first-level encoding process. The output of the ME module is used by other modules in second-level encoding.

The main obstacle in the proposed dual-pipeline parallel video coding scheme is the generation of the reference frame. Conventional video coding uses the reconstructed frame as the reference in the ME process, which means the frame-level ME process cannot start until the reconstructed frame has been obtained. This is problematic for frame-level parallel ME because the ME of the next frame can only begin after the current frame has been encoded. Fortunately, the original frame can be used in the encoder for reference. Because the output of the ME module is only the motion vector information, no error-drifting is incurred in this approach. The reconstructed frames are still used in residual calculation. Although this approach does not ensure that motion vectors obtained in the ME are optimal, it is a practical approach to frame-level parallel ME and strikes a good balance between computational complexity and coding performance.

Rate control is important in a practical encoder design. Without rate control, there would be mismatch between the source bit rate and channel capacity, and this would cause underflow or overflow. To accurately control the rate of AVS 3DV coding, we propose a window-based scheme to tackle the problem of interference between rate control and rate distortion optimization (RDO). In this scheme, rate-Qstep (R-Q) and distortion-Qstep (D-Q) models are used to allocate the appropriate number of bits to each coding unit and to adjust the quantization parameter so that each unit is properly encoded with the allocated bits. With the proposed D-Q model, distortion can be estimated, and the optimized coding mode can be obtained by comparing the rate distortion cost of the coding modes. Scene switching is also considered in the window-based scheme because it may cause large bit-rate fluctuations.

Besides these encoder control techniques, synchronization between two views in the 3DV coding is also very important. To synchronize the two views, we use a clock control mechanism and design a scheme based on the AVS system layer specification. We define a transport-stream program map table (PMT) that creates a relationship between the program and its elements. Using this map, we attach a timestamp to each frame and synchronize the timestamps of the two views for synchronized 3DV display.

3.2 AVS 3D Live Broadcasting System

We have already incorporated the AVS 3D real-time encoder into a real 3D live broadcasting system. This system was the first end-to-end broadcasting system and was an example of the practical application of AVS 3DV coding. The system was used for broadcasting 3D TV programs from the Guangzhou Asian Games in 2010. The system successfully delivered an immersive entertainment experience.

Fig. 8 shows the architecture of the broadcasting system, including content acquisition, and encoding and display modules. Two-channel high-definition serial digital interface (HD-SDI) audio and video signals are the input. These signals are first transmitted to the switching station for editing. Other program content, such as captions, can be integrated into the process. Then, the uncompressed audio and video signals are fed into the 3DV processor. In the processor, left and right views are adjusted to ensure the two views match exactly. The signals are then transmitted to the Quantel 3D broadcasting system where the programs can be edited and reviewed. Finally, the signals are fed into the real-time AVS 3D encoder for compression. In the display system, the 3D program stream is input into the set-top box for decoding, and the decoded signal can be displayed by various 3D display systems, such as a 3D TV or projector.

This system is an optimal, low-cost solution to smoothly transferring from a monoview to 3D TV broadcasting system. This system also highlights the great value of the AVS 3DV coding standard in practical applications such as 3D mobile phone TV, remote interview, video surveillance, and remote learning. The whole 3DV industry chain, from acquisition to display, will benefit from the development of AVS 3DV coding technology.

4 Performance Comparisons

4.1 Inter-Frame Prediction Versus Simulcast

Inter-frame prediction, enhanced motion-vector prediction, and directed mode are integrated into the AVS reference software RM52k_r2. The coding parameters are set according to the general test conditions [7]. The RD performance comparison in [8] is shown in Table 2. From Table 2, inter-frame prediction can reduce the rate by up to 30% for the same peak signal-to-noise ratio (PSNR). The reason for the superior coding performance is that the correlations between two channels are exploited to reduce inter-view redundancy.

4.2 Stereo Packing Scheme

Side-by-side and top-to-bottom stereo packing make the coding process very flexible because various downsampling and upsampling algorithms can be used. The downsampling and upsampling methods greatly affect coding efficiency. Figs. 9 to 11 show the horizontal, vertical, and diamond downsampling algorithms, respectively.

For each of the down-sampling algorithms, several corresponding upsampling algorithms are used. These include bilinear, cubic, and AVC-based interpolation algorithms. In Fig. 12, Down0, Down1, and Down2 denote horizontal, vertical, and diamond down-sampling, respectively, and Up0, Up1, and Up2 denote bilinear, cubic, and AVC-based upsampling algorithms, respectively. Combinations of these algorithms are then integrated into the AVS 3D stereo packing scheme, which is implemented in RM52k_r2. The RD performance is shown in Fig. 12. For the sequence Cafe, horizontal downsampling with AVC-based up-sampling performs the best. For sequence PoznanHall, vertical downsampling performs better than horizontal down-sampling. This suggests that the performance of the downsampling method is depends greatly on the properties of video sequence. In the case of low bit rate, the stereo packing scheme is capable of great coding gain compared with the simulcast scheme. At a low bit rate, the quantization of encoding causes most of the distortion, and the stereo packing scheme can provide the best RD trade-off.

5 Conclusion

In this paper, we have discussed the background, technical features, and applications of the AVS 3DV coding standard. The AVS 3DV coding greatly advances coding efficiency and backward compatibility in standard video coding technology. We also introduce the two main features in AVS 3DV coding: inter-view prediction and stereo packing. The AVS 3D TV live broadcasting system shows that the adopted schemes can provide great flexibility for effective use over broad application domains. In the future, more and more new applications will be developed over existing and future AVS 3DV coding technology.

References

[1] L. Yu, S. Chen, and J. Wang, “Overview of AVS-video coding standards,” Signal Process.: Image Commun., vol.24, Issue 4, pp. 247-262, 2009.

[2] AVS Requirement Group, Technical requirement of 3D video applications, AVS Doc. AVS N1566. 2008.

[3] X. Ji, Y. Zhang, L. Yu, G. Lee, “Stereoscopic video coding in AVS,” Visual Communications and Image Processing, Nov. 2011, pp. 1-4.

[4] D. Li, Y. Zhang, Q. Liu, X. Ji, Q. Dai, “Enhanced block prediction in stereoscopic video coding,” 3DTV Conference: The True Vision - Capture, Transmission and Display of 3D Video (3DTV-CON), 2011, May. 2011, pp. 1-4.

[5] X. Ji, D. Zhao, F. Wu, Y. Lu, and W. Gao, “B-picture coding in AVS video compression standard,” Signal Processing: Image Commun., vol.23, Issue 1, pp. 31-41, 2008.

[6] X. Zhao, X. Zhang, L. Zhang, S. Ma, W. Gao, “Low-Complexity and Sampling-Aided Multi-view Video Coding at Low Bitrate,” in Proceedings of IEEE Pacific-Rim Conference International Conference on Multimedia, PCM, Shanghai, China, Sep. 2010.

[7] AVS Video Group, General test conditions of stereoscopic video coding, AVS Doc. AVS N1760. 2010.

[8] D. Li, X. Ji, Q. Liu, “AVS BM reference software maintenance report”, AVS Doc. AVS M2712. 2010.

Manuscript received: April 16, 2012

Biographies

Siwei Ma () received his BSc degree from Shandong Normal University, Jinan, China, in 1999. He received his PhD degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, in 2005. From 2005 to 2007, he was a postdoctoral student at the University of Southern California. Then he joined the Institute of Digital Media, Peking University, where he is currently an associate professor. He has published more than 100 technical articles in refereed journals and proceedings in the areas of image and video coding, video processing, video streaming, and transmission.

Shiqi Wang () received his B.S. degree from Harbin Institute of Technology, China, in 2008. He is currently pursuing his PhD degree in computer science At Peking University.

Wen Gao () received his PhD degree in electronics engineering from the University of Tokyo in 1991. He is a professor of computer science at Peking University. From 1991 to 1995, he was a professor of computer science at Harbin Institute of Technology and a professor at the Institute of Computing Technology, Chinese Academy of Sciences. He has published five books and more than 600 technical articles in refereed journals and conference proceedings in the areas of image processing, video coding and communication, pattern recognition, multimedia information retrieval, multimodal interface, and bioinformatics. Dr. Gao has been on the editorial boards of several journals, including IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions on Multimedia, IEEE Transactions on Autonomous Mental Development, EURASIP Journal of Image Communications, and Journal of Visual Communication and Image Representation. He chaired a number of prestigious international conferences on multimedia and video signal processing, including IEEE ICME and ACM Multimedia. He has been in the advisory and technical committees of numerous professional organizations.