Publications

xKV: Cross-Layer SVD for KV-Cache Compression

Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed Abdelfattah

arxiv: 2503.18893

PDF · Code

Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed Abdelfattah, Diana Marculescu

International Conference on Machine Learning (ICML), 2025

PDF · Code

Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed Abdelfattah, Jae-sun Seo, Zhiru Zhang, Udit Gupta

arxiv: 2505.21467

PDF

SplitReason: Learning To Offload Reasoning

Yash Akhauri, Anthony Fei, Chi-Chih Chang, Ahmed Fathy, Yueying Li, Mohamed Abdelfattah

arxiv: 2504.16379

PDF · Code

Palu: Compressing KV-Cache with Low-Rank Projection

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed Abdelfattah, Kai-Chiang Wu

International Conference on Learning Representations (ICLR), 2025

PDF · Code

FlashDepth: Real-time Streaming Depth Estimation at 2K Resolution

Gene Chou, Wenqi Xian, Guandao Yang, Mohamed Abdelfattah, Bharath Hariharan, Noah Snavely, Ning Yu, Paul Debevec

arxiv: 2504.07093

PDF

TokenButler: Token Importance is Predictable

Yash Akhauri, Safeen Huda, Mohamed Abdelfattah

arxiv: 2503.07518

PDF · Code

BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

Yuzong Chen, Ahmed Fathy, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed Abdelfattah

IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2025

PDF · Code

SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs

Ahmed Fathy, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J. Pablo Muñoz, Vui Seng Chua, Nilesh Jain, Mohamed Abdelfattah

arxiv: 2502.12444

PDF

The Power of Negative Zero: Datatype Customization for Quantized Large Language Models

Yuzong Chen, Xilai Dai, Chi-Chih Chang, Yash Akhauri, Mohamed Abdelfattah

arxiv: 2501.04052

PDF · Code

ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

Yash Akhauri, Ahmed Fathy, Jordan Dotzel, Zhiru Zhang, Alexander M. Rush, Safeen Huda, Mohamed Abdelfattah

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

PDF

BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration

Yuzong Chen, Jian Meng, Jae-sun Seo, Mohamed Abdelfattah

IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024

PDF

Attamba: Attending To Multi-Token States

Yash Akhauri, Ahmed Fathy, Yifei Gao, Chi-Chih Chang, Nilesh Jain, Mohamed Abdelfattah

arxiv: 2411.17685

PDF · Code

Kratos: An FPGA Benchmark for Unrolled DNNs with Fine-Grained Sparsity and Mixed Precision

Xilai Dai, Yuzong Chen, Mohamed Abdelfattah

International Conference on Field-Programmable Logic and Applications (FPL), 2024

PDF · Code

FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi, Quoc V. Le, Sheng Li

International Conference on Automated Machine Learning (AutoML), 2024

PDF

Towards Neural Architecture Search through Hierarchical Generative Modeling

Lichuan Xiang, \L ukasz Dudziak, Mohamed Abdelfattah, Abhinav Mehrotra, Nicholas D. Lane, Hongkai Wen

International Conference on Machine Learning (ICML), 2024

PDF

Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs

Jordan Dotzel, Yuzong Chen, Bahaa Kotb, Sushma Prasad, Gang Wu, Sheng Li, Mohamed Abdelfattah, Zhiru Zhang

International Conference on Machine Learning (ICML), 2024

PDF

Encodings for Prediction-based Neural Architecture Search

Yash Akhauri, Mohamed Abdelfattah

International Conference on Machine Learning (ICML), 2024

PDF · Code

Beyond Inference: Performance Analysis of DNN Server Overheads for Computer Vision

Ahmed Fathy, Susanne Balle, Deshanand Singh, Mohamed Abdelfattah

Design Automation Conference (DAC), 2024

PDF

PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration

Ahmed Fathy, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed Abdelfattah

ACM Transactions on Reconfigurable Technology and Systems (TRETS), 2024

PDF · Code

On Latency Predictors for Neural Architecture Search

Yash Akhauri, Mohamed Abdelfattah

International Conference on Machine Learning and Systems (MLSYS), 2024

PDF · Code

M4BRAM: Mixed-Precision Matrix-Matrix Multiplication in FPGA Block RAMs

Yuzong Chen, Jordan Dotzel, Mohamed Abdelfattah

International Conference on Field-Programmable Technology (FPT), 2023

PDF · Code

DiviML: A Module-based Heuristic for Mapping Neural Networks onto Heterogeneous Platforms

Yassine Ghannane, Mohamed Abdelfattah

International Conference on Computer-Aided Design (ICCAD), 2023

PDF

Multi-Predict: Few Shot Predictors for Efficient Neural Architecture Search

Yash Akhauri, Mohamed Abdelfattah

International Conference on Automated Machine Learning (AutoML), 2023

PDF

BRAMAC: Compute-in-BRAM Architectures for Multiply-Accumulate on FPGAs

Yuzong Chen, Mohamed Abdelfattah

International Symposium On Field-Programmable Custom Computing Machines (FCCM), 2023

PDF · Code

Learned Connectivity Sparsification for LUT-based Neural Networks

Erwei Wang, Georgios Stavrou, Peter Cheung, George Constantinides, Mohamed Abdelfattah, James Davis

ACM Transactions on Reconfigurable Technology and Systems (TRETS), 2023

PDF

BLOX: Macro Neural Architecture Search Benchmark and Algorithms

Thomas Chau, Lukasz Dudziak, Hongkai Wen, Nicholas D. Lane, Mohamed Abdelfattah

Conference on Neural Information Processing Systems (NeurIPS), 2022

PDF

Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design

Hongxiang Fan, Thomas Chau, Stylianos Venieris, Royson Lee, Alexandros Kouris, Wayne Luk, Nicholas D. Lane, Mohamed Abdelfattah

IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022

PDF

Zero-Cost Operation Scoring in Differentiable Architecture Search

Lichuan Xiang, Lukasz Dudziak, Mohamed Abdelfattah, Thomas Chau, Nicholas D. Lane, Hongkai Wen

arxiv: 2106.06799.pdf

PDF

Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural Network Inference

Erwei Wang, James Davis, Georgios Stavrou, Peter Cheung, George Constantinides, Mohamed Abdelfattah

International Symposium on Field-Programmable Gate Arrays (FPGA), 2022

PDF · Code

Abdelfattah Research Group

Publications

2025

xKV: Cross-Layer SVD for KV-Cache Compression

Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models

Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

SplitReason: Learning To Offload Reasoning

Palu: Compressing KV-Cache with Low-Rank Projection

FlashDepth: Real-time Streaming Depth Estimation at 2K Resolution

TokenButler: Token Importance is Predictable

BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs

The Power of Negative Zero: Datatype Customization for Quantized Large Language Models

2024

ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration

Attamba: Attending To Multi-Token States

Kratos: An FPGA Benchmark for Unrolled DNNs with Fine-Grained Sparsity and Mixed Precision

FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

Towards Neural Architecture Search through Hierarchical Generative Modeling

Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs

Encodings for Prediction-based Neural Architecture Search

Beyond Inference: Performance Analysis of DNN Server Overheads for Computer Vision

PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration

On Latency Predictors for Neural Architecture Search

2023

M4BRAM: Mixed-Precision Matrix-Matrix Multiplication in FPGA Block RAMs

DiviML: A Module-based Heuristic for Mapping Neural Networks onto Heterogeneous Platforms

Multi-Predict: Few Shot Predictors for Efficient Neural Architecture Search

BRAMAC: Compute-in-BRAM Architectures for Multiply-Accumulate on FPGAs

Learned Connectivity Sparsification for LUT-based Neural Networks

2022

BLOX: Macro Neural Architecture Search Benchmark and Algorithms

Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design

Zero-Cost Operation Scoring in Differentiable Architecture Search

Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural Network Inference

Search

Tags