Leveraging AI and Machine Learning for Speech-to-Text and Summarization | By Johan Maurits Soritua Sianipar

As a lifelong learner, embarking on a journey can be both challenging and exciting at times. To stay competitive in the technology market, developers need to meet industry expectations and improve their capabilities and knowledge on current state-of-the-art technologies. With a personal goal of becoming an AI and Machine Learning expert, I stepped into the world of speech-to-text and summarization using AI in Swift.

A spark of curiosity

As employees, we may attend numerous meetings. From my observations, I have found that people are unable to understand the information shared during discussions. There are multiple factors that influence the situation, but poor communication and the attention span of participants may contribute to the problem. At first glance, it does not pose a problem, mainly because the secretary writes down the minutes of meetings (MoM) during the meeting. But what if an individual fails to understand the main idea? An individual may record the meeting and listen to the meeting again to create a MoM. They have to make an extra effort to convey the information to the stakeholders.

Now, imagine if you could convert the spoken words into a concise, easy-to-read, and actionable summary with just one click. You would save a ton of time without compromising on quality and accuracy. Every participant would share a common idea and understanding, leading to better decision-making within your organization. That would be great, wouldn't it?

So, I decided to dig deep into the world of AI and machine learning to address the problem.

Laying the foundations

Learning AI and Machine Learning is hard because you have to make a functioning app. But you have no choice but to move forward. You have understood the concepts of speech-to-text and summarization, chosen a platform and technology as a foundation, and even explored which framework would suit your project best between MVVM and MV (SwiftUI is Model-View itself!). We won't cover it in this article, but if you want to learn more, follow the links below:

https://azamsharp.com/2023/02/28/building-large-scale-apps-swiftui.html

Finally, for this project, I decided to build the app on macOS using SwiftUI as the programming language.

Development Stage

Practice makes perfect! The best way to learn how to swim is to swim. So in this project, we learned how to build an app ourselves. We used Apple's App Dev Tutorials to create a project from scratch. For more information, see the following links:

https://developer.apple.com/tutorials/app-dev-training/getting-started-with-scrumdinger

First, we organized views and models into different folders. Views display all the information, while models define the data and have functionality bound to a specific context. For example, a speech recognition model contains data structures and functionality to access the microphone, start speech, transcribe speech, save data to local storage, etc.

As you can see, we've also standardized file names to help contributors understand the purpose of the files at a glance. We use the word “View” in all files that display information to the user. To keep the project simple, we're only using Models and Views.

We then created the ability to organize meetings into relevant groups. The Application folder would contain all meetings on implementation, discussions, progress sharing, and all other topics related to the application. It would be convenient for individuals to review and access the data again.

Additionally, we've enabled users to set up meetings, allowing them to create titles, meeting durations, attendees, and customize folder colors.

Now here comes the fun part. I created a speech recognition model. I took a model from Apple's App Dev Tutorial and modified it to work on macOS. The goal of this model is to listen to an individual's speech and transcribe it into a string of text.

Finally, we created another model with the context to generate summaries with designed prompts to make the summaries useful. For this purpose, we implemented the API of the Chat-GPT 4.0-turbo model, which is the most advanced to date.

result

Now that we have added all the “important ingredients”, it's time to test the app. First, I created a folder, did the necessary configuration, and tried to start a meeting.

Each participant could take turns speaking during the meeting according to the time given to them. After all the speakers had spoken, the app would summarize the meeting with a prompt. The result is shown in the screenshot below.

My Reflection

Every great journey has its bumps. Although the app works, there is room for improvement, such as fixing bugs and improving the usability of the app. In addition, the recording function needs to be redesigned because in a meeting, people naturally do not speak in order, but individuals speak based on topics, and sometimes many people speak in one minute.

Additionally, some meetings contain sensitive information. In those cases, the app uses third-party models and is not recommended for use during meetings. Organizations may be able to build local models that allow data to be stored on-premises. However, building models is labor intensive and requires advanced infrastructure to train the models.

All in all, this is a good start for me to further explore and improve my expertise in AI and machine learning. I am looking forward to learning more and improving my skillset in this research field.

Additional resources:

You can clone my project and modify it as you need. Although I use English in the project, some texts are in Indonesian because this app's main target audience is Indonesian. However, I guarantee it will not affect your learning process.

Github: https://bit.ly/NotulaAI-Dev

Happy coding!

Source link