Technology

The new global Meta Meta model allows things to be processed in environments that they have never faced


Join the event that the leaders of the institutions have been trusted for nearly two decades. VB Transform combines people who build AI’s strategy for real institutions. Learn more


While LLMS models have mastered the text (and other methods to some extent), they lack the material “common sense” to work in dynamic environments in the real world. This has limited the spread of artificial intelligence in areas such as manufacturing and logistical services, where understanding is a reason and impact is very important.

The latest Meta, V-jepa 2It takes a step towards bridging this gap by learning a global model of video and material interactions.

V-Jepa 2 can create Amnesty International applications that require expectations and planning procedures in unexpected environments with many edge cases. This approach can provide a clear path towards more capable and automated robots in physical environments.

How to learn the “world model” planning

Humans develop an early physical intuition by monitoring their surroundings. If you see a ball that is thrown, you know instinctively its path and you can predict the location of its landing. V-Jepa 2 learns a similar “global model”, an internal simulation of the artificial intelligence system for how the material world works.

The model is designed on three basic capabilities necessary for institutions’ applications: understanding what is happening in a scene, predicting how the scene changes based on a procedure, and planning a series of procedures to achieve a specific goal. The death also states in Blog“Its long -term vision is that the world’s models will enable artificial intelligence agents to plan and mind in the material world.”

The structure of the model, which is called the video that integrates the predictive architecture (V-JePa), consists of two main parts. “Cracks” watch a video clip and intensify it to a compressed digital summary, known as the name Inclusion. This inclusion takes the basic information about the objects and their relationships in the scene. Then he takes the second component, the “predictable”, this summary and imagines how the scene will develop, generating a prediction of what the following summary will look like.

V-jepa consists of encryption and forecast (Source: Meta Blog)
V-jepa consists of encryption and forecast (Source: Meta Blog)

This architecture is the latest development of the Jepa frame, which was first applied to the pictures I am Deeb And now it comes to the video, which indicates a fixed approach to building the world’s models.

Unlike the models of the intrigue intelligence that tries to predict the exact balloon of each pixel in a future frame-an intense mathematical task-the V-Jepa 2 operates in an abstract area. It focuses on the prediction of high -level features of the scene, such as the location of the object and its path, instead of its text or background details, which makes it more efficient than the other largest models in 1.2 billion teachers only

This translates into reducing account costs and makes it more suitable for publication in real world settings.

Learning from observation and work

V-jepa 2 is trained in two phases. First, it builds its primary understanding of physics through Supervising learningWatch more than a million hours of unnamed Internet videos. By monitoring how objects move and interact, they develop a global model for general purposes without any human guidance.

In the second stage, this pre -trained model is set on a small specialized data collection. By processing only 62 hours of video that displays robot performance tasks, along with corresponding control orders, V-JePa 2 learns to connect specific procedures with their material results. This results in a model that can plan and control procedures in the real world.

Two -stage training pipeline (Source: Meta)
Two -stage training pipeline (Source: Meta)

This two -stage training enables critical ability to automate in the real world: zero robot layout. V-jepa 2 robot can be deployed in a new environment and successfully deals with the objects that it has not faced before, without the need to re-train this specific setting.

This is a great progress on the previous models that require training data from exactly Robot and the environment where they work. The model was trained on an open source data collection and then successfully published on different robots in Meta Laborators.

For example, to complete a task like picking an object, the robot is given a target image of the desired result. Then the V-Jepa 2 prediction is used to simulate a group of the following potential movements internally. It records each imagined action based on the extent of its proximity to the target, and it performs the highest -rated procedure, and repeats the process until the task is completed.

Using this method, the model achieved success rates between 65 % and 80 % in the tasks captured with unfamiliar objects in new settings.

The effect of the real world of physical thinking

This ability to plan and act in new situations have direct effects on commercial operations. In logistical and manufacturing services, more adaptable robots can deal with differences in products and warehouse layouts without intense reprocession. This can be especially useful as companies explore publishing Human robots In factories and assembly lines.

The global model itself can occupy the very realistic digital twins, allowing companies to simulate new operations or other AIS training in a physical, accurate physical environment. In industrial settings, the model can monitor the video extracts of machines, and based on its understanding of physics, it predicts safety problems and their failure before they occur.

This research is an essential step towards what Meta calls “the advanced intelligence of the machine (Ami), where artificial intelligence systems” can get to know the world as humans do, and plan how to carry out unfamiliar tasks, and adapt efficiently with the changing world around us. ”

Meta released the model and its training code and hoped to “build a wide society on this research, which prompted progress towards our final goal of developing the world’s models that can transform the way artificial intelligence interacts with the material world.”

What does this mean for the technical decision makers

V-jepa 2 moves robots closer to the form of knowledge by the programs that the cloud teams already recognize: before training once, it is published anywhere. Since the model learns public physics from the general video and only needs a few dozen hours of the task shots, institutions can reduce the data collection cycle that usually withdraws experimental projects. In practice, you can the initial model of the robot selection on the desktop at affordable prices, then wrap the same policy on an industrial platform on the factory floor without collecting thousands of fresh samples or writing textual programs for custom movement.

Low general expenses training also restores the cost equation. At 1.2 billion teachers, the V-Jepa 2 is comfortably proportional to the single high-graphics processing unit, and its abstract prediction targets reduce the inference pregnancy more. This allows the teams to run in the closed loop or on the edge, avoid the time for the cloud and compliance headache that comes with the flowing video outside the factory. The budget that once gone to huge account groups can fund additional sensors, repetitions or repetition cycles faster instead.


[publish_date

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button