February 27, 2017

Accessibility Research: Developing automatic-alt text for Facebook screen reader users

By: Julie Schiller, Omid Farivar

The release of the Automatic Alt-Text service allows people who are blind (or might otherwise use screen readers) to better understand what exists in photos in their News Feeds. User research helped develop the tool through interviews, usability testing, and surveys. In this note, we’ll briefly touch on some of the highlights of this work.

If you’d like to discuss this work in person, please catch us at CSCW 2017 in Portland this week, where our lead author Shaomei Wu, Data Scientist, will present our paper detailing the feature and the research that went into it!


As you may be aware, your Facebook News Feed is often filled to the brim with images and videos. The more ubiquitous high-quality cameras become on phones, the more images and videos people share. Being able to see and discuss what’s happening in visual media is a key part of being on Facebook. In fact, people share more than 2 billion photos across Facebook, Instagram, Messenger, and WhatsApp every day. Sounds great, right? Not for everyone. For those with visual impairments, such as blindness, it’s naturally difficult to follow the conversation around a picture.

Facebook’s mission is to create a more open and connected world, and to give people the power to share. Worldwide, more than 39 million people are blind, and more than 246 million have a severe visual impairment. They have reported experiences of feeling frustrated, excluded, or isolated because they can’t fully participate in conversations centered on photos and videos. In an effort to allow more people to participate in the social aspect of photo viewing,Facebook launched Automatic Alt-Text (AAT) to allow screen reader users the ability to understand the content of most images (hopefully all images soon!) in News Feed.

What was and what is

Where do you start in tackling this challenge? For a technical and detailed discussion of the creation of AAT and the Lumos technology that underlies the computer vision model, please refer to Facebook Data Scientist Shaomei Wu’s previous technical post. Here we’ll focus on how we worked with blind users to build an amazing user experience for them.

We knew from previous research that some services caption photos using a bespoke service (or a good friend) where the user makes a request for a single picture.

Unfortunately, this presents several issues:

  • It’s much too slow
  • It requires a human to be present and willing to perform the task
  • It interrupts the flow of using News Feed
  • And perhaps mostly importantly, it’s extremely hard to scale

However, on the positive side, the effect of having a friend or a representative translating photos for you can be highly accurate. Friends can also give you extra context based on your relationship (e.g. add color to the description or an inside joke). But how do you scale this solution, while keeping the good and removing the bad? We aimed to build a new Facebook feature that could act as the next evolution of this kind of idea.

The AAT project seeks to algorithmically generate useful and accurate descriptions of photos in a way that works on a much larger scale without latency in the user experience. We provide these descriptions as image alt-text, an HTML attribute designed for content managers to provide text alternatives for images. Since alt-text is part of the W3C accessibility standards, any screen reader software can pick it up and read it out to people when they move the screen reader’s reading cursor to an image.

The studies

There were two types of studies done after 10 months of building a scalable, artificial intelligence system. We ran qualitative interviews and usability testing with prototypes that our teammate Shaomei Wu had designed. These qualitative sessions helped identify key issues with the proposed system and allowed us to make changes that led to people being pleasantly surprised and grateful post-launch rather than frustrated or confused. The other approach we used to triangulate our findings was to launch an experiment, informing people that we had turned on an experimental feature for them (the experimental group), getting their consent in turning on an experimental feature for them, and finally running an identical survey to a group without the feature turned on (our control group). Both groups were in the subset of VoiceOver Facebook iOS users.

Interviews & usability testing

The biggest challenge, as we learned during this process, is balancing people’s desire for more information about the images in their Feeds with the quality and social intelligence of such information. Interpretation of visual content can be very subjective and context-dependent. For instance, though people mostly care about who is in the photo and what they are doing, sometimes the background of the photo is what makes it interesting or significant. This was a key finding in how we ultimately ended up structuring sentences to be read out to people.

In addition, it’s a pretty trivial task for a person to pick out the most interesting part of a photos, whereas it can be quite difficult for even the most intelligent AI. The social context and the right amount of feedback is what would give this service a magical experience, and we hope to get to that point eventually! From our interviews, we saw that it can often be worse to provide incorrect information about a picture rather than just leaving out items we aren’t sure about. For example, if the service were to say the photo contained a child, accidentally misidentifying a small woman. We also considered learning from other companies’ AI systems getting things really wrong, such as mischaracterizing humans as animals, which can lead to undesirable circumstances for all parties. If the user knows the friend doesn’t have children, they may comment in a way that causes the user embarrassment or social awkwardness. Keeping this in mind we worked with the development team to create a system that:

  • Could identify content at a massive scale
  • Pick interesting concepts, or things, from a photo
  • Give meaningful feedback to the user
  • Feel like a seamless interaction

The last big lesson we learned from our qualitative sessions was how important it was to not talk about how sure the AI was in determining what concepts existed in the photos. We heard from our participants that this made the system feel ominous or robotic. It also instilled a bit more disbelief in the system, according to our participants. The fix we made here, post-sessions, was to essentially be extremely sure of a concept in a photo (above a certain threshold of AI accuracy). We also very quickly decided to remove the robotic nature of reading back the sureness ratings our AI was using to determine concepts in each photo. Despite moving the accuracy needle higher, our initial launch allows us to identify at least one concept in more than 50% of all photos uploaded to Facebook. This number will improve with better technology over time.

Overall, working with our extremely helpful participants, we learned a lot about interviewing people who are blind and wanted to share some practical tips for qualitative research with.

One simple learning is to have blind participants bring in their own devices. This made them much more comfortable and naturalistic in the study (a good tip for any participant) but also allowed them to come with their own accessibility setup preconfigured to their needs.

Another tip is to request that the users of screen readers turn the speech rate down just a bit, so you can follow along for think-aloud. Think-aloud in a lot of ways is how participants are interpreting what the screen-reader is saying, out loud. If you can’t follow along with both conversations (i.e the participant and the reader’s voice), you’re missing out on half of the data. Try using a screen reader before you start your sessions and you’ll be a far more effective moderator.

Finally, some researchers say that just recruiting screen reader users can be challenging as many UX Recruiters are unfamiliar with this population. We found it effective to partner with advocacy groups (i.e. Lighthouse, thank you for your support) or contact specialized recruiters to find participants.


With the deep qualitative understanding behind us, we shifted to a survey in order to paint a fuller, more generalizable understanding of how it felt to use AAT. We surveyed around 550 participants who identified themselves as having one or more visual impairment(s) or were blind. As mentioned above, we received a mix of responses from our control group (Facebook as usual) or an updated version of Facebook with AAT (experimental group), with a total sample of around 9,000. Participants filled out nearly-identical surveys with a wide range of questions, with the only difference being a few questions specific to AAT if participants were in the experimental group. Participants could also opt to enter a sweepstakes for one of ten $100 gift cards to Amazon.

As with any survey writing, its important to create the most concise, and easy-to-understand survey for the intended respondents. We developed some practical tips for creating surveys for blind users:

  • Avoid using horizontal radio buttons and drag/drop questions. The former is harder to tab through compared to vertical options while the latter is just impossible with a screen reader.
  • Avoid using matrix and star rating questions. The former will not always be properly tagged on the HTML side, making it impossible to discern where in the matrix the respondent is, while the latter should be swapped for a non-graphical HTML element to be more universally accessible for different screen readers.
  • Offer back functionality for screen reader users as unintentional errors may occur more often.
  • Taking a survey on a screen reader takes a bit longer than a sighted user using the OS with a mouse. It may be important to reflect this in the intro message if having screen reader users responding to your survey is important to you by having people with screen readers pilot it first.
  • As with traditional good survey design, try to include as fewquestions per page to avoid cognitive complexity and navigation problems.
  • Use spacing to ensure radio buttons and checkboxes are clearly associated with their labels, preventing ambiguity and confusion.
  • Acronyms and abbreviations are common in surveys. However not all respondents may be familiar with or remember them and screen readers can struggle with pronouncing acronyms and abbreviations. While the “acronym” and “abbr” tags can be used to alleviate this, and the “title” attribute can be used to provide further information where needed.

Survey/experiment findings: highlights

People in the test group valued the AAT feature. Their answers reflected this when compared to our control group who didn’t have the feature turned on. At a high level, participants in the test group were more likely to:

  • Like (or react to) photos in their News Feeds
  • Think Facebook cared more about accessibility vs. non-AAT users
  • Think Facebook was more useful, overall, vs. non-AAT users
  • And most importantly, hadan easier time figuring out what was in photos

Sample questions from the survey:

We asked AAT users to confirm if they heard a sentence upon tabbing to a photo in their News Feed. If they self-reported that they did indeed hear this line of text beginning with “Image may contain…”, we asked them a few questions!

Question: [if in test group] How did you feel after hearing this alt-text? (check all that apply)

Respondents in the test group were shown a randomly sorted set of words that they could use to describe how they felt after hearing the alt-text in the photo. We also enabled an other category for people to write whatever they wanted. Based on our results, we found that there was a heavy emphasis on more positive words, with happy (29%), surprised (26%), and impressed (25%) leading the way.

Question: [Think back to the last few photos you remember seeing in your News Feed for the questions on this page.] For these photos, how easy or hard was it to tell what the photos were about?

There’s quite a difference between “somewhat easy” (23% vs. 2%) and “extremely hard” responses (42% vs. 73%). This shows the added value AAT provides.

What’s next?

Acknowledging that this feature is only in its infancy, almost all respondents offered suggestions on how AAT can be improved in their write-in feedback. These suggestions concentrated on the following two categories:

  • Extract and recognize text from the images (29% recommended this)
  • Provide more details about the people in the images (26% recommended this)

Other requests included expanding the vocabulary of the algorithm, increasing the recall for existing tags, and making AAT available in more languages and platforms.

Final thoughts

We are excited at the prospect of being able to include more of the world in our rapidly-growing, and increasingly visual social network. For Omid, this was his first big foray into the accessibility landscape and he fell in love with the ability to connect to a completely different type of demographic he was used to working with. For Julie, this connected her previous work making services more accessible with Facebook’s incredible engineering capabilities.

Facebook takes its commitment to making the world more open and connected very seriously. The Facebook Accessibility team is continuing to identify the ways it can deliver great user experiences to everyone. Building upon the success of the work that went into this feature, we’ve hired our first full-time Accessibility researcher and as a team, are excited for what is to come.