Hi! I am currently a PhD student in the Department of Computer Science of Duke University, and a member of Computational Evolutionary Intelligence (CEI) Lab under the supervision of Prof. Yiran Chen. Before joining Duke University, I received my M.S. degree from Peking University (北京大学) in 2024, under the supervision of Prof. Hailong Jiao. I obtained my BEng degree from Southern University of Science and Technology (南方科技大学) in 2021, advised by Prof. Fengwei An. For now, my research realm focuses on algorithm-hardware co-design.
E-mail: yuzhe.fu@duke.edu

📖 Educations

  • 2024.08 - Present, PhD student in the department of Computer Science, Duke University.
  • 2021.09 - 2024.07, Master of Science, Peking University.
  • 2017.09 - 2021.06, Bachelor of Engineering, Southern University of Science and Technology.
  • 2019.08 - 2019.12, Global Access Program, University of California, Berkeley.

🔥 News

  • 2024.10:  🎉🎉 Our paper “Nebula: A 28-nm 109.8 TOPS/W 3D PNN Accelerator Featuring Adaptive Partition, Multi-Skipping, and Block-Wise Aggregation” has been accepted by IEEE ISSCC!
  • 2024.08: Starting my PhD journey — wish me luck🍀!
  • 2024.04:  🎉🎉 Our paper “SoftAct: A High-Precision Softmax Architecture for Transformers with Nonlinear Functions Support” has been accepted by IEEE TCSVT!
  • 2024.02:  🎉🎉 Our paper “Adjustable Multi-Stream Block-Wise Farthest Point Sampling Acceleration in Point Cloud Analysis” has been accepted by IEEE TCAS-II!
  • 2023.10:  🎉🎉 Yuzhe Fu presents the work on 3D Point Cloud Neural Network Accelerator at IEEE/ACM ICCAD 2023!
  • 2023.08:  🎉🎉 Our paper “Sagitta: An Energy-Efficient Sparse 3D-CNN Accelerator for Real-Time 3D Understanding” has been accepted by IEEE IOTJ!
  • 2023.07:  🎉🎉 Our paper “An Energy-Efficient 3D Point Cloud Neural Network Accelerator with Efficient Filter Pruning, MLP Fusion, and Dual-Stream Sampling” has been accepted by IEEE/ACM ICCAD 2023!

📝 Publications

ISSCC
sym

Nebula: A 28-nm 109.8 TOPS/W 3D PNN Accelerator Featuring Adaptive Partition, Multi-Skipping, and Block-Wise Aggregation

C. Zhou, T. Huang, Y. Ma, Yuzhe Fu, S. Qiu, X. Song, J. Sun, M. Liu, Y. Yang, G. Li, Y. He, H. Jiao.

2025, International Solid-State Circuits Conference (ISSCC) (Accepted)

TCSVT
sym

SoftAct: A High-Precision Softmax Architecture for Transformers Supporting Nonlinear Functions

Yuzhe Fu, Changchun Zhou, Tianling Huang, Eryi Han, Yifan He, Hailong Jiao.

2024, IEEE Transactions on Circuits and Systems for Video Technology [pdf]

Abstract Transformer-based deep learning networks are revolutionizing our society. The convolution and attention co-designed (CAC) Transformers have demonstrated superior performance compared to the conventional Transformer-based networks. However, CAC Transformer networks contain various nonlinear functions, such as softmax and complex activation functions, which require high precision hardware design yet typically with significant cost in area and power consumption. To address these challenges, SoftAct, a compact and high-precision algorithm-hardware co-designed architecture, is proposed to implement both softmax and nonlinear activation functions in CAC Transformer accelerators. An improved softmax algorithm with penalties is proposed to maintain precision in hardware. A stage-wise full zero detection method is developed to skip redundant computation in softmax. A compact and reconfigurable architecture with a symmetrically designed linear fitting module is proposed to achieve nonlinear functions. The SoftAct architecture is designed in an industrial 28-nm CMOS technology with the MobileViT-xxs network as the benchmark. Compared with the state of the art, SoftAct improves up to 5.87% network accuracy, 153.2× area efficiency, and 1435× overall efficiency.
TCAS-II
sym

Adjustable Multi-Stream Block-Wise Farthest Point Sampling Acceleration in Point Cloud Analysis

Changchun Zhou#, Yuzhe Fu#, Yanzhe Ma, Eryi Han, Yifan He, Hailong Jiao.

2024, IEEE Transactions on Circuits and Systems II: Express Briefs (# with equal contribution) [pdf]

Abstract Point cloud is increasingly used in a variety of applications. Farthest Point Sampling (FPS) is typically employed for down-sampling to reduce the size of point cloud and enhance the representational capability by preserving contour points in point cloud analysis. However, due to low parallelism and high computational complexity, high energy consumption and long latency are caused, which becomes a bottleneck of hardware acceleration. In this brief, we propose an adjustable multi-stream block-wise FPS algorithm, adjusted by four configurable parameters, according to hardware and accuracy requirements. A unified hardware architecture with one parameter is designed to implement the adjustable multi-stream block-wise FPS algorithm. Furthermore, we present a rapid searching algorithm to select the optimal configuration of the five parameters. Designed in an industrial 28-nm CMOS technology, the proposed hardware architecture achieves a latency of 0.005 (1.401) ms and a frame energy consumption of 0.09 (27.265) µJ/frame for 1 k (24 k) input points at 200 MHz and 0.9 V supply voltage. Compared to the state of the art, the proposed hardware architecture reduces the latency by up to 99.9%, saves the energy consumption by up to 99.5%, and improves the network accuracy by up to 9.34%.
ICCAD
sym

An Energy-Efficient 3D Point Cloud Neural Network Accelerator with Efficient Filter Pruning, MLP Fusion, and Dual-Stream Sampling

Changchun Zhou, Yuzhe Fu, Min Liu, Siyuan Qiu, Ge Li, Yifan He, Hailong Jiao.

2023, IEEE/ACM International Conference On Computer Aided Design [pdf][YouTube]

Abstract Three-dimensional (3D) point cloud has been employed in a wide range of applications recently. As a powerful weapon for point cloud analysis, point-based point cloud neural networks (PNNs) have demonstrated superior performance with less computation complexity and parameters, compared to sparse 3D convolution-based networks and graph-based convolutional neural networks. However, point-based PNNs still suffer from high computational redundancy, large off-chip memory access, and low parallelism in hardware implementation, thereby hindering the applications on edge devices. In this paper, to address these challenges, an energy-efficient 3D point cloud neural network accelerator is proposed for on-chip edge computing. An efficient filter pruning scheme is used to skip the redundant convolution of pruned filters and zero-value feature channels. A block-wise multi-layer perceptron (MLP) fusion method is proposed to increase the on-chip reuse of features, thereby reducing off-chip memory access. A dual-stream blocking technique is proposed for higher parallelism while maintaining inference accuracy. Implemented in an industrial 28-nm CMOS technology, the proposed accelerator achieves an effective energy efficiency of 12.65 TOPS/W and 0.13 mJ/frame energy consumption for PointNeXt-S at 100 MHz, 0.9 V supply voltage, and 8-bit data width. Compared to the state-of-the-art point cloud neural network accelerators, the proposed accelerator enhances the energy efficiency by up to 66.6× and reduces the energy consumption per frame by up to 70.2×.
IOTJ
sym

Sagitta: An Energy-Efficient Sparse 3D-CNN Accelerator for Real-Time 3D Understanding

Changchun Zhou, Min Liu, Siyuan Qiu, Xugang Cao, Yuzhe Fu, Yifan He, Hailong Jiao.

2023, IEEE Internet of Things Journal [pdf]

TCAS-I
sym

A 4.29 nJ/pixel stereo depth coprocessor with pixel level pipeline and region optimized semi-global matching for IoT application

Pingcheng Dong, Zhuoyu Chen, Zhuoao Li, Yuzhe Fu, Lei Chen, Fengwei An.

2021, IEEE Transactions on Circuits and Systems I: Regular Papers [pdf]

📃 Patent

🍀 Tape Out

  • An energy-efficient pipelined and configurable 3D point cloud-based neural network accelerator is being designed in TSMC 28-nm HPC technology with an area of 2.0 mm×1.5 mm and is taped out in July 2023.
  • A 4.5 TOPS/W sparse 3D-CNN accelerator for real-time 3D understanding was fabricated in UMC 55-nm low-power CMOS technology with an area of 4.2 mm×3.6 mm in August 2020.

🎖 Honors and Awards

  • 2021 Excellent Graduate Award, Southern University of Science and Technology
  • 2021 Best Presentation Award in IEEE CASS Shanghai and Shenzhen Joint Workshop
  • 2020 National Scholarship, Ministry of Education of the PRC (The highest scholarship for Chinese undergraduates)
  • 2018, 2019 The First Prize of Outstanding Students in SUSTech (Top 5% in SUSTech)

💻 Reviewer for

  • TCSVT-IEEE Transactions on Circuits and Systems for Video Technology
  • ELL-Electronics Letters
  • AICAS-IEEE International Conference on Artificial Intelligence Circuits and Systems

0D4D876E About Me:

  • I am a easy-going and spirited individual with a passion for life. My enthusiasm not only drives my own life but also positively influences those around me.
  • Interests and Hobbies: fitness, jogging, swimming, photography, traveling. (Here is a short 📸 video about my graduation trip in 2021.)