Step 3.7 Flash on NVIDIA GPUs using enterprise-ready multimodal AI

AI Video & Visuals


AI applications are moving beyond text generation to multimodal systems that can recognize, search, and reason about images, documents, videos, and language in real time, turning fragmented information into actionable insights.

The latest version of StepFun, Step 3.7 Flash, brings these capabilities to production and enterprise scale and is available on NVIDIA accelerated infrastructure. It is a 198B parameter expert mixed vision language model with approximately 11 billion enabling parameters per forward pass, optimized for agent workflows that combine perception, search, and multi-step inference at production scale.

With native image and video inputs, three configurable inference levels (low, medium, and high), and a 256k context window, it is designed for enterprise use cases such as financial analysis, concurrent coding agents, and other high-throughput multimodal use cases. Developers can use StepFun’s NVFP4 quantized checkpoints, available through Hugging Face, to enhance inference with reduced memory bandwidth and storage requirements.

model Step 3.7 Flash
total parameters 198B
visual encoder parameters 1.8B
active parameters 11B
context length 256K
expert 288 (active 8)
Table 1. Summary of flash specifications for major step 3.7 (number of parameters, context length, MoE configuration, etc.)

Step 3.7 Flash is deployed using open source frameworks such as SGLang, NVIDIA TensorRT-LLM, and vLLM to take advantage of kernels optimized for NVIDIA hardware.

Build with NVIDIA endpoints

Developers can prototype and evaluate Step 3.7 Flash using GPU-accelerated endpoints available at build.nvidia.com. Step 3.7 Test this on a demo notebook using Flash and NVIDIA Nemotron Parse. A multi-step document intelligence pipeline extracts structured insights from large, complex documents with bounding boxes, such as financial reports, slide decks, and scientific papers, including PDFs, and organizes the output.

Video 1. See how the Document Intelligence Pipeline extracts usable data and follows the JupyterLab Notebook workflow.

Production-ready deployment with NVIDIA NIM

NVIDIA NIM makes it easy to move Step 3.7 Flash from development to production. NIM is available as an optimized, containerized inference microservice that packages models with the performance tuning, standardized APIs, and deployment flexibility that enterprises require. Download and run on-premises, in the cloud, or across hybrid environments. NIM provides standard OpenAI inference for sending inference requests to NIM servers.

  1. Download the NIM container from the NVIDIA Container Registry (enterprise license required).
  2. Start the server using the OpenAI client.
  3. Send either text or image input to the endpoint.
      from openai import OpenAI 
        
      client = OpenAI( 
        base_url = "http://0.0.0.0:8000/v1", 
        api_key="no-key-required" 
      ) 
        
      completion = client.chat.completions.create( 
        model="stepfun/step-3.7-flash", 
        messages=[{"role":"user","content":"Explain particle physics?"}] 
        temperature=0.5, 
        top_p=1, 
        max_tokens=1024, 
        stream=True 
      ) 
        
      for chunk in completion: 
        if chunk.choices[0].delta.content is not None: 
          print(chunk.choices[0].delta.content, end="")
      

      Day 0 fine-tuning using the NVIDIA NeMo framework

      Step 3.7 You can customize the flash with domain-specific data using the NVIDIA NeMo framework open libraries. The NVIDIA NeMo Automodel library combines native PyTorch nD parallelism with optimized performance and supports Day 0 fine-tuning of Hugging Face models directly from checkpoints without checkpoint conversion. The automodel fine-tuning recipe in step 3.7 supports techniques such as supervised fine-tuning (SFT) and memory-efficient LoRA at 600 tokens/second on Hopper GPUs.

      For advanced large-scale training, teams can also use the NeMo Megatron-Bridge tweak recipe, which provides additional performance optimizations.

      From data center deployments with NVIDIA Blackwell to deskside, managed NIM microservices and Day 0 fine-tuning workflows with NVIDIA DGX Station, NVIDIA offers a wide range of options for integrating Step 3.7 Flash across various stages of development and deployment. With 748 GB of coherent memory, DGX Station is ideal for performing Step 3.7 flash due to increased headroom for a full 256k context length and faster iteration for local developers.

      NVIDIA is an active contributor to the open source ecosystem, releasing hundreds of projects under open source licenses. NVIDIA is committed to open models, such as Step 3.7 Flash, that promote transparency in AI and allow users to share their work on AI safety and resiliency.

      To get started, check out step 3.7 “Flash on Hugging Face” and test with your own data at build.nvidia.com or on your local DGX Station using the vLLM playbook.



Source link