AI learns to decipher images based on spoken words—almost like a toddler

Photograph of a passenger jet with a sloppy computer-generated blob on the fuselage.

Enlarge / Given this picture and audio of the word "airliner," a neural network identifies the portions of the image where there's an airplane (indicated by the red lines). The software learned to do this entirely by looking at 400,000 pictures, each paired with a brief, free-form spoken description of the scene. (credit: David Harwath et al.)

Babies learn words by matching images to sounds. A mother says "dog" and points to a dog. She says "tree" and points to a tree. After repeating this process thousands of times, babies learn to recognize both common objects and the words associated with them.

Researchers at MIT have developed software with the same ability to learn to recognize objects in the world using nothing but raw images and spoken audio. The software examined about 400,000 images, each paired with a brief audio clip describing the scene. By studying these labels, the software was able to correctly label which portions of the picture contained each object mentioned in the audio description.

For example, this image comes with the caption "a white and blue jet airliner near trees at the base of a low mountain."

Read 7 remaining paragraphs | Comments

Comments are closed.