Title: Neighborhood Attention: Reducing the O(n^2) complexity of Attention at the threadblock level

 

 

Ali Hassani

Ph.D. Student in Computer Science

School of Interactive Computing

Georgia Institute of Technology

alihassanijr.com

 

Date & Time: Thursday 7/31/2025 12:00 PM - 2:00 PM Eastern Time

Location: Coda C1103 Lindberg + Zoom Meeting

 

Committee:

Dr. Humphrey Shi (Advisor) - School of Interactive Computing, Georgia Institute of Technology

Dr. Kartik Goyal - School of Interactive Computing, Georgia Institute of Technology

Dr. Judy Hoffman - School of Interactive Computing, Georgia Institute of Technology

Dr. Wen-mei Hwu - Electrical & Computer Engineering, University of Illinois at Urbana-Champaign.

 

 

Abstract:

Attention is at the heart of most foundational AI models, across tasks and modalities. In many of those cases, it incurs a significant amount of computation, which is quadratic in complexity, and often

cited as one of its greatest limitations. As a result, many approaches have been proposed to alleviate this issue, with one of the most common approaches being masked or reduced attention span.

 

In this work, we revisit sliding window approaches, which were commonly believed to be inherently inefficient, and we propose a new framework called Neighborhood Attention (NA). Through it, we solve design flaws in the original sliding window attention works, attempt to implement the approach efficiently for modern hardware accelerators, specifically GPUs, and conduct experiments that highlight the strengths and weaknesses of these approaches.

 

At the same time, we bridge the parameterization and properties of Convolution and Attention, by showing that NA exhibits inductive biases and receptive fields similar to that in convolutions, while still capable of capturing inter-dependencies, both short and long range, similar to attention.

 

We then show the necessity for and challenges that arise from infrastructure, especially in the context of modern implementations such as Flash Attention, and develop even more efficient and performance-optimized implementations for NA. Through these implementations, we achieve orders of magnitude improvement over naive implementations, and up to 2X improvement in inference and 1.4X improvement in training time.

 

We finally show the limitations of the existing methodology, and outline research topics that can address them. All of our work is open sourced through the NATTEN project.

 

 

Meeting Details:

Join Zoom Meeting

https://gatech.zoom.us/j/99422563124?pwd=kSII1Cab0ooku6rpPtf2hR5Uoylb9O.1

Meeting ID: 994 2256 3124

Passcode: 435466