Lightricks open sources AI video model LTX-2 to take on Sora and Veo

Israeli company Lightricks has open sourced its 19 billion parameter model LTX-2. The system generates synchronized audio-video content from text descriptions and claims to be faster than its competitors.

According to a technical report, this model generates up to 20 seconds of video with synchronized stereo audio from a single text prompt. This includes lip-synced audio, background sounds, Foley effects, and music tailored to each scene. According to Lightrix, the full version of LTX-2 will reach 4K resolution at up to 50 frames per second.

Researchers argue that existing approaches to audiovisual production are fundamentally flawed. Many systems work by generating video first and then adding audio, or vice versa. These separated pipelines cannot capture the true integrated distribution of both modalities. Although lip-sync relies primarily on audio, the acoustic environment is shaped by visual context. Only a unified model can handle these bidirectional dependencies.

Why asymmetric architectures are important for audio-video generation

LTX-2 runs on an asymmetric dual-stream transformer with a total of 19 billion parameters. The video stream gets 14 billion parameters. This is significantly more capacity than the 5 billion audio streams. According to the researchers, this division reflects differences in the information density of each modality.

Both streams use separate variational autoencoders for each modality. This separation allows for modality-specific positional encoding. 3D rotational position embedding (RoPE) for the spatiotemporal structure of video, and 1D embedding for the purely temporal dimension of audio. A bidirectional cross-attention layer connects both streams and precisely links visual events, such as an object hitting the ground, with the corresponding sound.

LTX-2 architecture details: Video and audio change the latent token specification in the embedding pipeline validation method to account for text and create separate VAEs. Dual stream diffusion transformers enable potential interaction of audio and video and conditioning of text at the same time. — The cross-attention map shows how LTX-2 links visual and audio elements.

To understand the text, LTX-2 uses Gemma3-12B as a multilingual encoder. Rather than querying only the final layer of the language model, the system taps all decoder layers and combines their information. This model also uses “thought tokens”. This is an additional placeholder in the input sequence that gives room to process complex prompts before generation begins.

Increased speed puts LTX-2 ahead of the competition

Benchmarks show that LTX-2 shows a significant advantage in inference speed. On the Nvidia H100 GPU, the model requires 1.22 seconds per 121 frame step at 720p resolution. The equivalent Wan2.2-14B produces only video without audio, but takes 22.30 seconds. According to Lightricks, this makes LTX-2 18 times faster.

The maximum video length of 20 seconds also exceeds the competition. Google's Veo 3 reached 12 seconds, OpenAI's Sora 2 reached 16 seconds, and Character.AI's open source model Ovi reached 10 seconds. In human preference studies, LTX-2 “significantly outperformed” open source alternatives such as Ovi and achieved results comparable to proprietary models such as Veo 3 and Sora 2.

However, the researchers acknowledge some limitations. Quality varies by language. Speech synthesis may be less accurate for underrepresented languages and dialects. In scenes with multiple speakers, the model may assign what was said to the wrong character. Sequences longer than 20 seconds may experience temporal drift and loss of synchronization.

Open source releases pose challenges to closed API approaches

Lightricks describes the decision to open source the model as a critique of the current market. “I just don't understand how you can do that with a closed API,” Lightricks founder Zeev Farbman said in the announcement video about the promise of the current video generation model. The industry has fallen into a gap. On the one hand, it can produce great results, but on the other hand, it falls far short of the level of control that professionals require.

The company also takes a clear ethical stance. “Artificial intelligence can augment human creativity and human intelligence. My concern is that someone will own my augmentation,” Farbman continues. The goal is to run AI on your own hardware, on your own terms, and make ethical decisions with a broader community of creators, rather than outsourcing AI to a select group with their own interests.

In addition to model weights, this release includes a distilled version, several LoRA adapters, and a modular training framework with multi-GPU support. This model is optimized for Nvidia's RTX ecosystem and runs on consumer GPUs like the RTX 5090 as well as enterprise systems. Model weights and code are available on GitHub and Hugging Face, and demos are available on the company's content platform after free registration.

AI News Without the Hype – Curated by Humans

as The Decoder Subscriberyou can read without ads. Weekly AI Newsletterexclusive “AI Radar” Frontier Report 6 times a yearaccess comments, and Complete archive.

Subscribe now

Source link