AI learns to decipher images based on spoken words—almost like a toddler
Babies learn words by matching images to sounds. A mother says "dog" and points to a dog. She says "tree" and points to a tree. After repeating this process thousands of times, babies learn to recognize both common objects and the words associated with them.
Researchers at MIT have developed software with the same ability to learn to recognize objects in the world using nothing but raw images and spoken audio. The software examined about 400,000 images, each paired with a brief audio clip describing the scene. By studying these labels, the software was able to correctly label which portions of the picture contained each object mentioned in the audio description.
For example, this image comes with the caption "a white and blue jet airliner near trees at the base of a low mountain."