However, in other tasks, the model showed much more variable results. For example, when asked to generate a video that emphasizes a particular written character on a grid, the model failed 9 out of 12 attempts. When asked to model a Bunsen burner, burning paper and burning it failed 9 out of 12 times as well. When asked to solve a simple maze, I failed 10 of the 12 exams. When asked to sort numbers by populating labeled bubbles in order, it failed a whopping 11 of 12 times.
However, for researchers, all of the above examples are not evidence of failure, but rather signs of the model's capabilities. To be listed in the “failure cases” in the paper, VEO 3 had to fail the tested tasks on all 12 attempts that occurred in 16 of the 62 tested tasks. For the rest, the researchers write that “a success rate greater than zero suggests that the model has the ability to solve tasks.”
Therefore, failing 11 of the 12 trails for a particular task is considered proof for The functions of the model within the paper. The evidence of the model is “owned”[ing] The ability to resolve tasks includes 18 tasks where the model failed in over half of the 12 trials and another 14 tasks that failed in 25-50% of the trials.
Past results, future performance
Yes, in all these cases, the model technically demonstrated the functionality being tested at some point. However, the fact that the model cannot perform its task reliably means that it is not actually performed well in most use cases. Future models that could become “unified generalist vision foundation models” should be much more consistently successful in these types of tests.
