VoiceOver Recognition

description: "VoiceOver makes the theoretical possible, again!"


At the launch of the iPhone 3GS, Apple unveiled VoiceOver on the iPhone. Blind users and accessibility experts had been used to screen readers on computers, and even rudimentary screen readers for smart phones that used a keyboard, trackball, or quadrants of a touch screen for navigation and usage. But here was a screen reader that not only came prepackaged on a modern, relatively inexpensive to the competition, and off-the-shelf device, but it also allowed the user to use the touch screen as it is, a touch device.

This year, VoiceOver added a feature called "VoiceOver recognition." This feature allows VoiceOver to utilize the machine learning coprocessor in newer iPhone models to describe images with near-human quality, make apps more accessible using ML models, and read the text in images.

This article will explore these new features, go into their benefits, compare VoiceOver Recognition to other options, and discuss the history of these features, and what's next.

VoiceOver Recognition, the features

VoiceOver Recognition, as discussed before, contains three separate features: Image Recognition, Screen Recognition, and Text recognition. All three work together to bring the best experience. In accessible apps and sites, though, Image and Text recognition do the job fine. All three features must be downloaded and turned on in VoiceOver settings. Image recognition acts upon images automatically, employing Text recognition when text is found in an image.

Screen recognition makes inaccessible apps as good as currently possible with the ML (Machine Learning) model. It is still great, though. It allows me to play Final Fantasy Record Keeper quite easily. It is not perfect, but it is only the beginning!

Benefits of VoiceOver Recognition

Imagine, if you are sighted, that you have never seen a picture before, or if you have, that you've never seen a picture you've taken yourself. Imagine that all the pictures you have viewed on social media have been blurry and vague. Sure, you can see some movies, but they are far and few between. And apps? You can only access a few, relative to the number of total apps. And games are laughably simple and forgettable.

That is how digital life is for blind people. Now, however, we have a tool that helps with that immensely. VoiceOver Recognition gives amazing descriptions for photos. Not perfect, and sometimes when playing a game, I just get "A photo of a video game" as a description, but again, this is the first version. And photos in news articles and on websites, and in apps, are amazingly accurate. If I didn't know better, I would think someone at Apple is busy describing all the images I come across. While Screen Recognition can fail spectacularly sometimes, especially with apps that do not look native to iOS, it has allowed me to get out of sticky situations in some apps and has allowed me to press the occasional button that VoiceOver can't press due to poor app coding and such. And I can play a few text-heavy games with it, like Game of Thrones, a tale of crows.

Even my ability to take pictures is greatly enhanced with image recognition. With this feature, I can open the Camera app, put VoiceOver focus on the "view finder," and it will describe what is in the camera view! When it changes, I must move focus away and back to the View Finder, but that's a small price to pay for a "talking camera" that is actually accurate.

Comparing VO Recognition to Other Options

Blind people may then say "Okay, what about Narrator on Windows? It does the same thing, right?" No. First, the photo is sent to a server owned by Microsoft. On iOS, the photo is captioned using the ML Coprocessor. What Microsoft needs and Internet connection and remote server to do, Apple does far better with the chip on your device!

You may then say "Well, how does it give better results?" First, it's automatic. Land on an image, and it works! Second, it is not shy about what it thinks it sees. If it is confident in its description, it will simply describe it. Narrator, and Seeing AI, always say "Image may contain: " before giving a guess. And, with more complex images, Narrator fails, and so does Seeing AI. I have read that this is set to improve, but I've not seen the improvements yet. Only when VoiceOver Recognition isn't confident in what it sees, it says, "Photo contains," and then gives a list of objects that it is surer of. This does not happen nearly as frequently as Narrator/Seeing AI, though.

You may also say "Okay, so how is this better than NVDA's OCR? You can use it to click on items in an app." Yes, and that is great, it really is, and I thank the NVDA developers every time I use VMWare with Linux because there always seems to be something going on with it. But with VoiceOver Recognition, you get an actual natively "accessible," app. You don't have to click on anything, and you know what VoiceOver thinks the item type of something is: a button, text field, ETC., and can interact with the item accordingly. With NVDA, you have a sort of mouse. With VO Recognition, you have an entire app experience.

The history of these features

Using AI to bolster the accessibility of user Interfaces is not a new idea. It has been floating around the blind community for a while now. I remember discussing it on an APH (American Printing House for the Blind) mailing list around a decade ago. Back then, however, it was just a toy idea. No one thought it could be done with current, at the time, Android 2.3 era hardware or software. It continued to be brought up by blind people who dreamed bigger than I, but never really went anywhere.

Starting with the iPhone X R, Apple began shipping a machine learning Coprocessor within their iPhones. Then n iOS 13, VoiceOver gained the ability to describe images. This was not using the ML chip, however, since older phones could take advantage of it. I thought they may improve this, but I had no idea they would do as great a job as they are doing with iOS 14.

What's Next?

As I've said a few times now, this is only version one. I suspect Apple will continue building on their huge success this year, fleshing out Screen recognition, and perhaps having VoiceOver automatically speak what's in the camera view when preparing to take a picture, and perhaps adding even more than I cannot imagine now. I suspect, however, that this is leading to an even larger reveal for accessibility in the next few years, Augmented and Virtual reality. Apple Glasses, after all, would be very useful if they could describe what's around a blind person.