How’s this for a dystopian future: You finally receive your personal robot assistant, delivered to your door by Amazon drone. You unpack the shiny new machine, dust off the Styrofoam peanuts, and charge up the batteries. Then you switch it on and lead it to the kitchen so it can cook you dinner. The robot points its camera at you, waiting. Suddenly you realize in horror that your assistant doesn’t know how to cook, either—you’re supposed to teach it.
To prevent this nightmare dinnertime scenario, computer scientists are working on a robot that can teach itself to cook. It learns by watching YouTube videos.
This is much harder for a robot than it is for you, no matter how inept a cook you are. Imagine a mind that’s stumped by CAPTCHAs (“Letters with a squiggle through them? I’m out!”) trying to follow a video host who’s chatting and chopping at the same time. To tackle the task, University of Maryland graduate student Yezhou Yang and his coauthors broke it down into a few simpler pieces.
First, their robot would look at the person’s hands. For each hand, it would decide what type of grip the person was using. Was it a powerful grasp, as when holding a knife or a jar lid? Or was it a more delicate, precise grasp, maybe to lift a slice of bread from the counter? How wide was the object? The scientists taught the robot to recognize six grasps in all.
Next, the robot would try to identify the objects in the video. The researchers taught it 48 objects, including tools (such as spatula, bowl, and brush) and foods (meat, lettuce, yogurt, and so on).
After the robot had matched what the video host was holding in each hand came the crucial step—actually doing something.
“Due to the huge variation in human actions,” Yang says, it’s not yet possible for the robot to deduce what someone’s doing just by watching. So the researchers taught their robot to guess instead. Given the objects in its hands, the robot picked the most likely verb from 10 options: cut, pour, transfer, spread, grip, stir, sprinkle, chop, peel, or mix?
The authors chose 88 cooking videos from YouTube and used most of them to train their robot. The last dozen video clips—each showing just one cooking action—were the robot’s final exam.
The aspiring robot chef performed pretty well. After watching the test videos, it chose the right kind of grasp about 90% of the time. It correctly identified the objects about 80% of the time, and did equally well at guessing the action. Some of its mistakes happened when the videos included objects it hadn’t been trained on. When it saw a person using a knife to slice tofu, for example, the robot guessed that it was supposed to slice up a bowl.
Even getting all the components of the plan right didn’t guarantee success. “When it comes to execution, there are also many difficulties,” Yang says. Objects may be in different places on a counter; tools may have unexpected shapes and sizes; a human video host might pause mid-action. Before they can actually cook for us, Yang says, robots must be able to handle this sort of variation. In other words, they’ll need to think for themselves.
“Our robots are not cooking yet in a real environment where all kinds of things are happening,” Yang says. “We can pour water, we can stir things. We’re working towards getting robots to make a meal, but we’re not there yet.” You’ll have to wait just a little longer for that personal robot assistant to make you dinner. Especially if you want tofu.
Image: Yang et al
This research will be presented later this month at the Association for the Advancement of Artificial Intelligence Conference in Austin, Texas.