It is well known that due to the uneven distribution of cones in the human retina, we have sharp vision only in the central (fovea) region. The angular span of this region is tiny, just about 1 degree^2. In comparison, the angular span of a TV set watched from 4x screen heights is over 250 degrees^2. This observation implies that using eye-tracking for video compression offers enormous potential. If the encoder can instantaneously know which spot (1degree^2 patch) is visible, only information in that spot will need to be encoded and transmitted. Up to 2 orders of magnitude savings in bandwidth may be attainable!
This idea has been known, at least, since Bernd Girod’s paper “Eye movements and coding of video sequences,” published in 1988. Many additional works have followed, proposing various variants of implementations of gaze-based video coding systems. Even special classes of compression techniques called foveated video coding or region-of-interest (ROI)-based video coding have appeared, motivated by this application. However, most early attempts to build complete systems based on this idea were unsuccessful. The key reasons were the long network delays observed in the 1990s and 2000s – years when this idea was studied most extensively. But things have changed since.
In this talk, I first briefly survey some basic principles (retinal eccentricity, eye movement types, and related statistics) and some key previous studies/results. I will then derive an equation explaining the relationship between network delay and bandwidth savings that may be achievable by gaze tracking. Then, I will switch the attention to modern-era mobile wireless networks – 5G and Wi-Fi 6 / 802.11ax – and discuss delays currently achievable in direct links to user devices and in cases of device-to-device communication in the same cell (or over the same WiFI access network), as well as in cases of data transmissions involving 5G core networks.