Shopify open source ML platform has already saved us a year of computing time

Shopify has open sourced Tangle, an internal machine learning experimentation platform designed to reduce iteration, enhance reproducibility, and accelerate development cycles.

The system was born from the challenges faced by our in-house search and discovery teams who routinely train and evaluate models against millions of products and billions of queries.

AIM Banner Swissre

Before Tangle, engineers often struggled to recreate historical results by rebuilding identical datasets and rerunning lengthy preprocessing steps.

Shopify reports that the platform has already saved more than a year of internal computing time by eliminating redundant work. “The CPU time savings alone are ridiculous,” says Shopify CTO Mikhail Parakin.

Tangle addresses six standard failure modes in ML development: distributed queries, unstructured notebooks, repetitive data preparation, non-reproducible results, slow deployment, and limited collaboration.

As Shopify puts it, “Machine learning development shouldn’t work this way, but it does. 80% of development time is spent on data engineering, not algorithms.”

Its core mechanism is a visual pipeline interface based on content-based caching. Developers assemble pipelines as directed acyclic graphs made up of “components,” YAML-defined, language-independent units that wrap arbitrary CLI programs.

“Think of Tangle as the glue that connects everything in your workflow, no matter how mismatched.”

Each task runs in isolation within a container, ensuring deterministic behavior and enabling automatic reuse of artifacts.

Components behave as pure functions, and as Shopify explains, “Components are designed as pure functions, deterministic and without side effects.”

Because caching is based on output content rather than lineage, Tangle reuses the same intermediate results even if only part of the pipeline changes or if another user has already performed the equivalent step.

According to Shopify, this translates into big real-world benefits. “A 10-hour pipeline can be completed in 20 minutes if just one component changes.” This also applies globally. “Tangle’s cache operates globally across all users; all three pipelines share artifacts, even between executions.”

The platform is designed to work in any language, cloud provider, or on-premises environment. Components can be written in Python, JavaScript, Rust, or anything that can read and write files. This neutrality allows teams to integrate existing code without refactoring.

The visual editor provides real-time visibility into run status, cached steps, logs, and performance bottlenecks, and all runs are saved with full lineage to ensure reproducibility.

“Tangle is a key part of our Shopify data and ML system,” said Tobi Lutke, CEO of Shopify.

“Complex tasks are simplified, multiple tasks are automatically avoided, and huge amounts of waste are saved.”

Source link