Listen to this article
|
The Robot Report recently spoke with Ph.D. student Cheng Chi about his research at Stanford University and recent publications about using diffusion AI models for robotics applications. He also discussed the recent universal manipulation interface, or UMI gripper, project, which demonstrates the capabilities of diffusion model robotics.
The UMI gripper was part of his Ph.D. thesis work, and he has open-sourced the gripper design and all of the code so that others can continue to help evolve the AI diffusion policy work.
AI innovation accelerates
How did you get your start in robotics?
I worked in the robotics industry for a while, starting at the autonomous vehicle company Nuro, where I was doing localization and mapping.
And then I applied for my Ph.D. program and ended up with my advisor Shuran Song. We were both at Columbia University when I started my Ph.D., and then last year, she moved to Stanford to become full-time faculty, and I moved [to Stanford] with her.
For my Ph.D. research, I started as a classical robotics researcher, and I started working with machine learning, specifically for perception. Then in early 2022, diffusion models started to work for image generation, that’s when DALL-E 2 came out, and that’s also when Stable Diffusion came out.
I realized the specific ways which diffusion models could be formulated to solve a couple of really big problems for robotics, in terms of end-to-end learning and in the actual representation for robotics.
So, I wrote one of the first papers that brought the diffusion model into robotics, which is called diffusion policy. That’s my paper for my previous project before the UMI project. And I think that’s the foundation of why the UMI gripper works. There’s a paradigm shift happening, my project was one of them, but there are also other robotics research projects that are also starting to work.
A lot has changed in the past few years. Is artificial intelligence innovation is accelerating?
Yes, exactly. I experienced it firsthand in academia. Imitation learning was the dumbest thing possible you could do for machine learning with robotics. It’s like, you teleoperate the robot to collect data, the data is paired with images and the corresponding actions.
In class, we’re taught that people proved that in this paradigm of imitation learning or behavior, cloning doesn’t work. People proved that errors grow exponentially. And that’s why you need reinforcement learning and all the other methods that can address these limitations.
But fortunately, I wasn’t paying too much attention in class. So I just went to the lab and tried it, and it worked surprisingly well. I wrote the code, I applied the diffusion model to this and for my first task; it just worked. I said, “That’s too easy. That’s not worth a paper.”
I kept adding more tasks like online benchmarks, trying to break the algorithm so that I could find a smart angle that I could improve on this dumb idea that would give me a paper, but I just kept adding more and more things, and it just refused to break.
So there are simulation benchmarks online. I used four different benchmarks and just tried to find an angle to break it so that I could write a better paper, but it just didn’t break. Our baseline performance was 50% to 60%. And after applying the diffusion model to that, it was like 95%. So it was a jump in terms of these. And that’s the moment I realized, maybe there’s something big happening here.
How did those findings lead to published research?
That summer, I interned at Toyota Research Institute, and that’s where I started doing real-world experiments using a UR5 [cobot] to push a block into a location. It turned out that this worked really well on the first try.
Normally, you need a lot of tuning to get something to work. But this was different. When I tried to perturb the system, it just kept pushing it back to its original place.
And so that paper got published, and I think that’s my proudest work, I made the paper open-source, and I open-sourced all the code because the results were so good, I was worried that people were not going to believe it. As it turned out, it’s not a coincidence, and other people can reproduce my results and also get very good performance.
I realized that now there’s a paradigm shift. Before [this UMI Gripper research], I needed to engineer a separate perception system, planning system, and then a control system. But now I can combine all of them with a single neural network.
The most important thing is that it’s agnostic to tasks. With the same robot, I can just collect a different data set and train a model with a different data set, and it will just do the different tasks.
Obviously, collecting the data set part is painful, as I need to do it 100 to 300 times for one environment to get it to work. But in actuality, it’s maybe one afternoon’s worth of work. Compared to tuning a sim-to-real transfer algorithm takes me a few months, so this is a big improvement.
UMI Gripper training ‘all about the data’
When you’re training the system for the UMI Gripper, you’re just using the vision feedback and nothing else?
Just the cameras and the end effector pose of the robot — that’s it. We had two cameras: one side camera that was mounted onto the table, and the other one on the wrist.
That was the original algorithm at the time, and I could change to another task and use the same algorithm, and it would just work. This was a big, big difference. Previously, we could only afford one or two tasks per paper because it was so time-consuming to set up a new task.
But with this paradigm, I can pump out a new task in a few days. It’s a really big difference. That’s also the moment I realized that the key trend is that it’s all about data now. I realized after training more tasks, that my code hadn’t been changed for a few months.
The only thing that changed was the data, and whenever the robot doesn’t work, it’s not the code, it’s the data. So when I just add more data, it works better.
And that prompted me to think, that we are into this paradigm of other AI fields as well. For example, large language models and vision models started with a small data regime in 2015, but now with a huge amount of internet data, it works like magic.
The algorithm doesn’t change that much. The only thing that changed is the scale of training, and maybe the size of the models, and makes me feel like maybe robotics is about to enter that that regime soon.
Can these different AI models be stacked like Lego building blocks to build more sophisticated systems?
I believe in big models, but I think they might not be the same thing as you imagine, like Lego blocks. I suspect that the way you build AI for robotics will be that you take whatever tasks you want to do, you collect a whole bunch of data for the task, run that through a model, and then you get something you can use.
If you have a whole bunch of these different types of data sets, you can combine them, to train an even bigger model. You can call that a foundation model, and you can adapt it to whatever use case. You’re using data, not building blocks, and not code. That’s my expectation of how this will evolve.
But simultaneously, there’s a there’s a problem here. I think the robotics industry was tailored toward the assumption that robots are precise, repeatable, and predictable. But they’re not adaptable. So the entire robotics industry is geared towards vertical end-use cases optimized for these properties.
Whereas robots powered by AI will have different sets of properties, and they won’t be good at being precise. They won’t be good at being reliable, they won’t be good at being repeatable. But they will be good at generalizing to unseen environments. So you need to find specific use cases where it’s okay if you fail maybe 0.1% of the time.
Safety versus generalization
Robots in industry must be safe 100% of the time. What do you think the solution is to this requirement?
I think if you want to deploy robots in use cases where safety is critical, you either need to have a classical system or a shell that protects the AI system so that it guarantees that when something bad happens, at least there’s a worst-case scenario to make sure that something bad doesn’t actually happen.
Or you design the hardware such that the hardware is [inherently] safe. Hardware is simple. Industrial robots for example don’t rely that much on perception. They have expensive motors, gearboxes, and harmonic drives to make a really precise and very stiff mechanism.
When you have a robot with a camera, it is very easy to implement vision servoing and make adjustments for imprecise robots. So robots don’t have to be precise anymore. Compliance can be built into the robot mechanism itself, and this can make it safer. But all of this depends on finding the verticals and use cases where these properties are acceptable.
Tell Us What You Think!