This post was co-written by Vlad Lebedev and DJ Charles of Mixbook.
Mixbook is an award-winning design platform that gives users unparalleled creative freedom to design and share unique stories, changing the lives of over 6 million people. Today, Mixbook is the #1 ranked photo book service in the US with 26,000 5-star reviews.
Mixbook empowers users to share their stories with creativity and confidence. The company's mission is to help users celebrate life's beautiful moments. Mixbook aims to foster deeper connections between users and their loved ones by sharing stories through both physical and digital mediums.
Several years ago, Mixbook embarked on a strategic initiative to migrate its operational workloads to Amazon Web Services (AWS), an initiative that has continued to deliver significant benefits. This pivotal decision has enabled the company to operate systems characterized by reliability, superior performance, and operational efficiency, playing a key role in moving the company forward toward achieving its mission.
In this post, we share how Mixbook used generative artificial intelligence (AI) capabilities from AWS to personalize the photo book experience—one step toward achieving their mission.
Business Challenge
In today's digital world, we take a lot of photos to share with friends and family. Let's say you have hundreds of photos taken during a recent family trip and you want to create a coffee table photo book to make it memorable. But selecting the best photos from among them and describing them with captions takes a lot of time and effort. As we all know, a picture is worth a thousand words. That's why it's so hard to sum up a moment in a 6-10 word caption. Mixbook truly understands this problem and is here to solve it.
Solution
Mixbook Smart Captions is the magical solution to your captioning conundrum, helping you not only interpret your photos but also add a dash of creativity to make your story stand out.
Most importantly, Smart Captions doesn't fully automate the creative process. Instead, it gives you a creative partner that enables your own storytelling and infuses your book with a personal flourish. Our goal is to make it easy for your photos to say more, whether they're selfies or landscapes.
Architecture overview
The implementation of the system involves three main components:
- Data Ingestion
- Information Inference
- Creative Integration
Caption generation relies heavily on the inference process because the quality and meaning of the output of the understanding process directly impacts the specificity and personalization of caption generation. Below is a data flow diagram of the caption generation process, which is explained in the subsequent text.
Data Ingestion
Users upload photos to Mixbook, and the raw photos are stored in Amazon Simple Storage Service (Amazon S3).
The data ingestion process involves three macro components: Amazon Aurora MySQL-compatible edition, Amazon S3, and AWS Fargate for Amazon ECS. Aurora MySQL serves as the primary relational data storage solution for tracking and recording media file upload sessions and their accompanying metadata. It offers flexible capacity options, from serverless on the one hand to reserved provisioned instances for predictable long-term use on the other. S3, on the other hand, provides efficient, scalable, and secure storage for the media file objects themselves. Its storage class allows recent uploads to be kept warm for low-latency access while older objects can be migrated to the Amazon S3 Glacier tier, minimizing storage costs over time. Amazon Elastic Container Registry (Amazon ECS), used in conjunction with AWS Fargate's low-maintenance compute environment, forms a convenient orchestrator for containerized workloads, seamlessly integrating all the components.
inference
The understanding phase extracts important contextual and semantic elements from the input, including image descriptions, temporal and spatial data, facial recognition, emotions, and labels. Of these, image descriptions generated by computer vision models provide the most basic understanding of the moment captured. Amazon Rekognition accurately detects facial bounding boxes and emotional expressions. Face detection is essential for optimal automatic photo placement and cropping, while emotion recognition enables more effective adjustment of story tone. The facial bounding boxes detected in photos are primarily used for optimal automatic photo placement and cropping. Emotions help select a better tone, for example, to make it more funny or more nostalgic. Additionally, Amazon Rekognition enhances safety by identifying potentially offensive content.
The inference pipeline employs an AWS Lambda based multi-step architecture to run independent image analysis steps in parallel for maximum cost efficiency and resiliency, with AWS Step Functions enabling synchronization and sequencing of interdependent steps.
Image captions are generated by an Amazon SageMaker inference endpoint and augmented by a buffer backed by Amazon ElastiCache for Redis. The buffer was implemented after benchmarking the performance of the captioning model. Benchmarking results showed that the model performed optimally when processing batches of images, but underperformed when analyzing individual images.
generation
The caption generation mechanism behind the Writing Assistant feature turns Mixbook Studio into a natural language story-writing tool. Powered by the Llama language model, the assistant initially used carefully designed prompts created by AI experts. However, the Mixbook Storyarts team wanted more control over the style and tone of the captions, leading to a diverse team, including Emmy-nominated screenwriters, reviewing, tweaking, and adding their own handcrafted examples. This resulted in a process of fine-tuning the model, moderating the revised responses, and rolling out the approved models to experimentation and public release. After inference, three captions are created and stored in Amazon Relational Database Service (Amazon RDS).
The following image shows the Mixbook Smart Captions feature in Mixbook Studio.
advantage
Mixbook implemented this solution to provide new capabilities to its customers, resulting in an improved user experience and increased operational efficiency.
User Experience
- Enhanced Storytelling: Capture your users' emotions and experiences and express them beautifully with heartfelt captions.
- User Delight: Adds an element of wow with captions that are not only accurate but fun and imaginative. Satisfied user Hanie U says, “I hope they release more captioning experiences in the future.” Another user Megan P. says, “Works great!” Users can also edit the generated captions.
- Time efficiency: Nobody has time to struggle with captions. This feature will save you valuable time while making your user stories shine.
- Safety and accuracy: Captions were generated responsibly, utilizing guidelines to ensure content moderation and relevance.
system
- Lambda Elasticity and Scalability
- Easy-to-understand workflow orchestration with Step Functions
- SageMaker's diverse base models and tuning capabilities for maximum control
Due to its increasing user satisfaction, Mixbook has been named an official Webby Awards winner for 2024. Apps and Software Make the most of AI and machine learning.
“AWS enables us to scale the innovation that delights our customers most, and now, with AWS's new generative AI capabilities, we can wow our customers with creativity they never imagined. Innovation like this is why we've partnered with AWS since the beta in 2006.”
-Andrew Laffoon, CEO of Mixbook
Conclusion
Mixbook began experimenting with AWS-generated AI solutions to enhance their existing applications in early 2023. They started with a simple proof of concept and got results that showed the potential. Continuous development, testing, and integration with AWS's broad range of services in compute, storage, analytics, and machine learning allowed them to iterate quickly. After releasing the smart caption feature in beta, they were able to quickly adjust it in response to real-world usage patterns and protect the value of their product.
Try Mixbook Studio and experience storytelling for yourself. To learn more about AWS generative AI solutions, see Transforming Your Business with Generative AI. To hear more from Mixbook leaders, AWS re:Think podcast Available on Art19, Apple Podcasts and Spotify.
About the Author
Vlad Lebedev Vlad is a Senior Technology Lead at Mixbook. He leads the product engineering team responsible for transforming Mixbook into a place for heartfelt storytelling. He brings over 10 years of experience in web development, systems design, and data engineering to create elegant solutions to complex problems. Vlad enjoys learning about modern and ancient cultures, their history, and languages.
DJ Charles He is the CTO of Mixbook and has a 30-year career designing interactive and e-commerce designs for top brands. He innovated broadband technology for the cable industry in the 90s, revolutionized supply chain processes in the 2000s, and drove environmental technology at Perillon to enable a global real-time bidding platform for brands like Sotheby's and eBay. Outside of technology, DJ loves learning new instruments and composition techniques and is an avid music producer and engineer in his spare time.
Malini Chatterjee She is a Senior Solutions Architect at AWS. She advises AWS customers on their workloads across various AWS technologies. She has broad expertise in data analytics and machine learning. Prior to joining AWS, she designed data solutions in the finance industry. She is a passionate semi-classical dancer and performs at community events. She loves traveling and spending time with her family.
Jessica Oliveira As an Account Manager for AWS, I provide guidance and support to commercial sales in Northern California. I am passionate about building strategic collaborations to ensure customer success. Outside of work, I enjoy traveling, learning different languages and cultures, and spending time with my family.