Towards Efficient Edge Inference: Quantization Alignment Optimization for Multimodal Large Language Models

Project: HDR ProjectMasters by Research

Project Details

Description

With the rise of large-scale deep learning models, particularly Multimodal Large Language Models (MLLMs), efficient deployment on edge
devices has become a pressing challenge due to limited hardware resources and the growing demand for real-time performance. This
work explores the use of graph transformation and advanced scheduling techniques within deep learning compilers to optimize memory
usage and reduce latency on edge devices. By leveraging graph transformations, the compiler restructures computational graphs to
minimize redundant operations, strategically share resources, and reduce memory overhead. Scheduling algorithms are then applied to
manage task execution order and parallelism, balancing memory allocation and minimizing delay without sacrificing model accuracy.
Memory optimization is critical for deploying MLMs and similar resource-intensive models on edge devices, as these models demand
significant memory to store parameters and intermediate data. Without optimization, such models often exceed the memory limits of
edge hardware, leading to failures or severely degraded performance. Reducing latency is equally essential, as edge deployments
frequently require real-time responses, for instance, in applications like speech processing, autonomous vehicles, and augmented reality.
By addressing these constraints, our approach enhances the feasibility and efficiency of deploying sophisticated deep learning models on
edge devices, expanding the potential for AI applications to operate in distributed, resource-constrained environments.
StatusNot started

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.