VENGME H.264/AVC – A Low Power 130nm H.264/AVC Compatible Video Encoder for Next Generation Multimedia Equipment

ASEAN IVO Forum 2015
26 November 2015, Kuala Lumpur, Malaysia

Xuan-Tu Tran, Duy-Hieu Bui, Ngoc-Mai Nguyen,
Nam-Khanh Dang, Van-Huan Tran, Viet-Thang Nguyen
Key Laboratory for Smart Integrated Systems (SISLAB)
Outline

• Context and Motivation
• VENGME H.264/AVC: architecture, design, implementation
• FIFO-based low power method
• Conclusion & Perspective
Video coding

First specification version in May-2003, latest version in April-2013

H.261
MPEG-1
H.262/MPEG-2
H.263
MPEG-4 Visual
H.264/AVC
H.265/HEVC

1988
1993
1994
1995
1998
2003
2013

11/26/2015
ASEAN IVO Forum

for 60 seconds
1080HD @ 25 fps
raw RGB 3x8 bit: 9.33 GB

420 MB
H.264/AVC challenges

- Adopt wide set of video coding tools
- Provide significantly higher coding efficiency
  - Achieves 39%, 49% and 64% of bit-rate reduction in comparison to MPEG-4, H.263 and MPEG-2
- Enable «network friendliness»
- Support variety applications

- High computational complexity & data dependency
- High power consumption
  - Hard to be implemented!
  - Hardware designer need to reduce power consumption!
H.264/AVC encoding HW classical implementation

Video sequence

Video picture

16×16-pixel macroblock

Stage 1
Inter prediction
IME
FME
FTQ
ITQ
EC
Stream output
DF

Stage 2
Intra prediction
IntraP
Rec.

Stage 3

Stage 4

Legend:
IME: Integer Motion Estimation
FME: Fractional Motion Estimation
IntraP: Intra prediction
FTQ: Forward Transform & Quantization
ITQ: De-Quantization & Inverse transform
Rec.: Reconstruction
EC: Entropy Coder
DF: De-blocking filter

Video input

Current macroblock

Reference frame(s)
H.264/AVC encoding HW implementations

**Scalability-oriented** [Chen, 08] additional feature | high area cost

**Speed-oriented** [Iwata, 09] [Chen, 08] pipeline architecture for more parallelization | high area & power consumption cost

**Power-oriented**
Dynamic Clock Supply Stop [Mochizuki, 07], fine-grained clock-gating [Chen, 09]
Memory access reduction techniques applied at design phase [Zuo, 12][Kim, 11][Lin, 08][Chen, 09]
Quality-scalability by parameterization [Kim, 11]
VENGME: 4-stage pipelining architecture

- 4-stage pipeline structure with low power design features
- **Main profile:** Maximal speed 11880MB/s, frame size 396MBs, bit rate 2Mbps
VENGME: System HW architecture (simplified)
VENGME: Specifications

MAIN PROFILE

• Image format: QCIF/CIF (up to HD 1280x720p@30fps)
• Color sub-sampling: YUV 4:2:0 (progressive) with 8 bits/pixel
• Supporting I/P/B slices with configurable Group of Picture (GOP)
• System throughput (CIF sequences): 76 frames/s@100MHz
• Compression ratio: 2-70 times (depending on QP and Group of Pictures)
• Full-search inter prediction with variable block size partitioning
• Quarter-pixel accuracy and 48x48 search window size
• Full-search intra prediction
• CAVLC entropy coding with NAL bitstream formatting
• In-loop deblocking filters support
• Single transformation module with 4x4 FDCT/IDCT and 4x4, 2x2 Hadamard transform
• Objective quality: PSNR: 24-64 dB (depending on QP)
• Power target: < 100mW
VENGME: Key contributions

- An efficient HW architecture with **unbalance 4-stage pipeline processing** for developing power-aware ability;
- Memory access reduction by the proposed **data reuse mechanism**;
- **On-the-fly calculation** techniques;
- **Low power techniques** (power gating, clock gating…);
- Optimization & pipelining techniques at block level.
VENGME: test chip
VENGME: test chip

Features:
- Technology: Global Foundry CMOS 130nm
- Area overhead: 16mm²
- Complexity: ~2 Mgates (with memory)
- Power consumption: 53mW
- Operating frequency: 100MHz
- Voltage: 1.2 volt
- Package: QFP256

Development:
- 18 man.year total investment for the chip design *(completed and sent to Foundry in 2014)*
- Publications: IEEE Trans. on VLSI (S), JEC’14, IEEE APCCAS’14, IEEE IECON’14, IEEE NEWCAS’14, IEEE DDECS’14, IEEE SOCC’13, REV’13, IEICE ICDV’13, IEEE ATC’13, ICGHIT’13, IEEE ATC’12, IEICE ICDV’12, JEC’11, IEEE DDECS’11...
- Training: 1 PhD student, 6 MSc students, 15 BSc students...
- 2 “Best Paper Awards”
## Comparison of VLSI H.264/AVC encoders

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Target</strong></td>
<td>Real-time</td>
<td>SVC, high profile</td>
<td>Performance, Low power, Video size scalable</td>
<td>HW design for H.264 video codec</td>
<td>Low power, Low-aware, Portable devices</td>
<td>Low power real-time high picture quality</td>
<td>High profile, Low area, High throughput</td>
<td>Dynamic Scalable, Power aware</td>
<td>Low power, Real-time, portable devices</td>
<td></td>
</tr>
<tr>
<td><strong>Profile</strong></td>
<td>Baseline, level 4</td>
<td>High profile, SVC</td>
<td>High, level 4.1</td>
<td>Baseline, level up to 3.1</td>
<td>Baseline</td>
<td>Baseline, level 3.2</td>
<td>Baseline/High level 4</td>
<td>Baseline</td>
<td>N/A</td>
<td>Main Profile</td>
</tr>
<tr>
<td><strong>Resolution</strong></td>
<td>1080p30 HDTV 1080p</td>
<td>1080p30 720p SD/HD</td>
<td>QCI, 720SDTV</td>
<td>720p SD/HD</td>
<td>CIF to 1080p</td>
<td>CIF to HD720</td>
<td>CIF, HD 720p (1280x720)</td>
<td>Global Foundry CMOS 130nm</td>
<td></td>
<td></td>
</tr>
<tr>
<td><strong>Techno (nm)</strong></td>
<td>UMC 180, 1P6M CMOS</td>
<td>UMC 90 1P9M</td>
<td>MOS 65</td>
<td>UMC 180, 1P6M CMOS</td>
<td>TSMC 180, 1P6M CMOS</td>
<td>Renesas 90, 1POLY-7Cu-AlP</td>
<td>UMC 130</td>
<td>CMOS 130</td>
<td>N/A</td>
<td>100MHz (HD720p)</td>
</tr>
<tr>
<td><strong>Frequency (MHz)</strong></td>
<td>200</td>
<td>120 (high profile); 166 (SVC)</td>
<td>162</td>
<td>81 (SD); 180 (HD)</td>
<td>N/A</td>
<td>54 (SD); 144 (HD)</td>
<td>7.2 (CIF); 145 (1080p)</td>
<td>10-12-18-28 (CIF); 72-108 (HD720)</td>
<td>N/A</td>
<td>100MHz (HD720p)</td>
</tr>
<tr>
<td><strong>Gate count (K Gates)</strong></td>
<td>1140</td>
<td>2079</td>
<td>3745</td>
<td>922.8</td>
<td>452.8</td>
<td>1300</td>
<td>593</td>
<td>470</td>
<td>N/A</td>
<td>1900 (include memory)</td>
</tr>
<tr>
<td><strong>Memory (KBytes)</strong></td>
<td>108.3</td>
<td>81.7</td>
<td>230</td>
<td>34.72</td>
<td>16.95</td>
<td>56</td>
<td>22</td>
<td>13.3</td>
<td>N/A</td>
<td>N/A</td>
</tr>
<tr>
<td><strong>Power consumption (mW)</strong></td>
<td>1410</td>
<td>360 (high profile), 411 (SVC)</td>
<td>256</td>
<td>581 (SD), 785 (HD)</td>
<td>40.3 (CIF, 2 references)</td>
<td>4.9 - 15.9 (CIF 1 reference)</td>
<td>64.2 (720SDTV)</td>
<td>6.74 (CIF baseline), 242 (1080p high profile)</td>
<td>238.38 to 259.89 depends on PW level</td>
<td>53mW (100MHz; HD720p)</td>
</tr>
</tbody>
</table>
Demo (different encoded videos with different QPs)

PSNR > 30
Demonstration
Expected applications

- Monitoring cameras
- Intelligent transportation systems (ITS)

Security

- Building management systems
- Public security systems
- Bank security systems
- School monitoring systems
- Government/military video conferencing

Internet of Things (IoT)

VEMGE H.264/AVC

Multimedia devices

- Cameras, video recorder
- Mobile devices

Home use

- Security systems
- IP cameras

Transport

Multimedia devices

Cameras, video recorder

Mobile devices
More power consumption techniques

- In VENGME: workload is not identical
- Room to reduce power consumption
Power reduction method proposal

- Scale down (up) frequency (& voltage)
  - → power decrease (increase)
    \[ P_{total} = K C_L V_{dd}^2 f + q_{sc} f V_{dd} + I_{leakage} V_{dd} \]
  - → Smooth functioning of the FIFO

- Solution: Manage power consumption by scaling frequency \( f \) (and voltage \( V \)) according to FIFO link status

11/26/2015
ASEAN IVO Forum
Split in many frequency domains

- Choosing the consuming module to apply the control
- Global power optimization among controlled FIFOs
Split in **many power** domains

- Control law re-design
- Delay due to power supply switching
Publications

- Nam-Khanh Dang, Alain Merigot, Xuan-Tu Tran. An Efficient Hardware Architecture for Inter-Prediction in H.264/AVC Encoders. IEEE Transactions on Very Large Scale Integration Systems.


- N.-M. Nguyen, et. al. FIFO-level based Power Management and its Application to a H.264 Encoder. In Proceeding of the IEEE Industrial Electronics Conference, IECON14, pp. 158-163, Dallas, TX, USA, October 2014. (Best Presentation Award)


- Viet-Thang Nguyen, Xuan-Tu Tran, Ha Vu Le. An Efficient Algorithm of Inter-Prediction Coding for H.264/AVC Encoders. International Conference on Green and Human Information Technology (ICGHIT 2013), Hanoi, Vietnam, February 27 – March 1, 2013.


Thank you for your attention!