Practical Machine Learning Techniques by David Langer

David Langer explains four easy-to-learn, cutting-edge, and practical techniques that can help you extract value from machine learning.

Upside Staff
June 5, 2024

David Langer appeared again on TDWI's Speaking of Data podcast to discuss real-world machine learning techniques. Langer has been a technology professional for almost 28 years, half of which in hands-on analytics roles. He now works as an independent consultant and trainer for TDWI, focusing on teaching practical data science skills. [Editor’s note: Langer will be teaching a machine learning bootcamp for TDWI on June 24–26, 2024.].

His previous appearances on “Speaking of Data” focused on data literacy, and he says that remains a big part of his work, but given what's happened over the past year, he sees a shift in demand toward more advanced analytics, including the use of AI to improve productivity.

Langer points out that ChatGPT is a very sophisticated machine learning model. “It’s very powerful, but technically it’s a machine learning model.” Langer’s research focuses on a subset of machine learning techniques that are disproportionately useful to most organizations. Rather than focusing on Microsoft’s Copilot (which the company says is ChatGPT’s integration into Excel), Langer believes it’s more important to understand how AI and machine learning work.

“If you tell Copilot, ‘Generate a predictive model based on the data in this particular table in an Excel worksheet,’ it will indeed do that and generate Python code for you. But if you don’t understand Python code, if you don’t understand the modeling process and what’s actually being generated, you’re in trouble.

“Of course, you can use Copilot or generative AI to accelerate these tasks, but you need to first learn the basics.” That's why he focuses on practical machine learning approaches: tools and techniques that are valuable to any professional, regardless of where they work.

The most common use case for machine learning/AI is to create a model that predicts a label: will this customer convert?

“Let's say I'm predicting the gold medal winner at the Olympics. I'm predicting bronze, silver, gold and no gold. So I'm trying to predict four different things. In machine learning terms, I'm classifying data. I'm trying to predict the 'class' or label, which is the most ROI-rich advanced analytics scenario for any organization. I train people to build predictive models that solve these classification problems.”

Four basic techniques

The best techniques are based on decision trees and random forests. “These are state-of-the-art techniques,” Langer asserts. “They're very easy to learn and very valuable.” His four-hour TDWI course teaches students “the basics of Python, everything you need to know from the basics onwards, even if you have no programming background at all – just the subset of Python you need to be productive.”

Langer explains that this mindset also applies to Excel users: “If you've ever written an Excel formula, you've written code, whether you think of it that way or not. It's all pretty much the same. If you think about using Python for analytics or data science, you ignore all that stuff, because you're not actually designing software from the ground up. You just focus on how to use the tools to get the job done, which is very similar to using Excel formulas.”

What should companies focus on to get the best ROI from traditional machine learning? “First of all, you don't need to learn a ton of stuff to be super productive. Most people look at a book and think, 'Oh my gosh, here are 50 different machine learning techniques. Do I have to learn them all?'”

Langer's answer is a resounding “no.”

“There are basically four basic techniques you need to learn first,” he says, starting with decision trees. A decision tree is just a separate tree with like little nodes of yes/no decisions. “If you've ever seen an organizational chart, that's what a decision tree looks like. Using decision trees is intuitive, so we're going to focus on decision trees to learn the basics of machine learning.”

“We can then use that knowledge to build something called a random forest, which is just a collection of individual decision trees. Random forest is a production-quality, state-of-the-art machine learning algorithm that's incredibly easy to learn to use. You don't need to be a math genius (or even a math novice) to understand how it works, but random forest gives you that power.”

Decision trees and random forests are known as supervised learning. “I can use those to predict the labels that I was talking about earlier. The next really useful area of machine learning is clustering. I have a pile of data. You have a pile of customers. I have a pile of patients or claims. It doesn't matter what documents that pile is. I want to extract hidden structure from that data. I don't want to have to go through 1,000 documents to understand what they are. Can I use a computer to group them together and say, 'All of those documents are similar to these documents here,' and help me with that? That's what clustering, or cluster analysis, is all about, and it's a type of machine learning known as unsupervised learning.”

Two of the most popular unsupervised clustering algorithms are hair-means (which Langer says is very easy for anyone to learn) and DBSCAN.

“These four elements are enough to get started, and that's what we focus on in the TDWI Machine Learning Bootcamp. It's easy to learn for a wide range of users, cutting edge technology that you can quickly pick up and use to create real value when you return to work.”

Source link