Bridging the gap between research and readability with Marco Henning Talarico

Machine Learning


In our Author Spotlight series, TDS editors chat with community members about career paths, writing, and sources of inspiration in data science and AI. Today we are pleased to share our conversation with Marco Henning Talarico.

Marco is a graduate student at the University of Toronto and a researcher at Risklab, with a deep interest in applied statistics and machine learning. Born in Brazil and raised in Canada, Marco appreciates the universal language of mathematics.

What motivates you to turn dense academic concepts (such as stochastic differential equations) into accessible tutorials for the broader TDS community?

It’s natural to want to learn everything in its natural order. Algebra, calculus, statistics, etc. But if you want to improve quickly, you need to abandon this tendency. When trying to solve a maze, choosing the middle spot is cheating, but in learning there are no rules. Start at the end and work back as needed. It’s less hassle.

your data science challenge This article focuses on identifying data leaks in your code, not just theory. In your experience, what are the most common silent leaks still infiltrating production systems today?

Data leakage can occur very easily during data analysis or when using aggregations as inputs to models. Especially now that aggregates can be calculated in real time relatively easily. Before graphing, before running, .head() In the function, I think it’s important to split the train-test. Think about how you want to segment your users, from user level, size, and chronology to segmentation by tier. There’s a lot to choose from, but it’s worth your time.

Also, if you use a metric such as average number of users per month, you should double-check that no aggregates are calculated in the month you are using as your test set. These are more complex because they are indirect. If you’re trying to predict which planes are going to crash, it’s not always obvious not to use black-box data. If you have a black box, it’s not a prediction. The plane crashed.

you mention it Learning grammar from data alone is computationally expensive.. Do you think a hybrid model (statistics + formal) is the only way to achieve long-term sustainable AI scaling?

If we take LLM as an example, there are many simple tasks to struggle with, such as adding a list of numbers or converting a page of text to uppercase. It would be logical to think that making the model larger would solve these problems, but that is not a good solution. It’s much more reliable to call . .sum() or .upper() It works on behalf of the user and uses its linguistic reasoning to select inputs. This is probably what leading AI models are already doing through clever prompt engineering.

Using a formal grammar to remove unnecessary artifacts like em-dash problems is much easier than collecting another third of the internet’s data and performing further training.

you are a contrast Forward and inverse problems in PDE theory. Can you share a real-world scenario other than temperature modeling where an inverse problem approach could be a solution?

Forward issues tend to be more comfortable for most people. Looking at the Black-Scholes model, the future problem looks like this: Given market assumptions, what should the option price be?But another question arises. If there are many observed option prices, what are the parameters of the model? This is an inverse problem. This is inference and implied volatility.

You can also think in terms of the Navier-Stokes equations that model fluid dynamics. Forward problem: Calculate the velocity or pressure field given the wing shape, initial velocity, and air viscosity. But we can also ask what the shape of an airplane’s wing is, given the velocity and pressure fields. This tends to be very difficult to resolve. Once you know the cause, it’s much easier to calculate the effect. However, when there are a large number of influences, it is not always easy to calculate their causes. This is because multiple causes can explain the same observation.

This is also one of the reasons why PINN has become popular recently. These highlight how neural networks can efficiently learn from data. This opens up a whole toolbox of things like Adam, SGD, backpropagation, etc., but it’s out of the box when it comes to solving partial differential equations.

As a master’s student who is also a prolific technical writer, what advice would you give to other students who want to start sharing their research on platforms like Towards Data Science?

In technical writing, I think there are two competing choices you have to actively make. You can think of it as distillation or dilution. A research paper is a lot like a vodka shot. The preface summarizes a vast field of research in a few sentences. Vodka’s bitter taste comes from evaporation, but in writing, the main culprit is technical jargon. This language compression algorithm allows you to discuss abstract concepts such as the curse of dimensionality and data leakage in just a few words. It is a tool that can also restore you.

The original deep learning paper is 7 pages long. There are deep learning textbooks that are 800 pages long (compared to a piña colada). Both are good for the same reasons. Provide the right level of detail to the right audience. To understand the appropriate level of detail, you need to read books in the genre you want to publish.

Of course, how you dilute your spirit is important. No one wants a part lukewarm, part Tito monster. Recipes for making your writing more understandable include using memorable metaphors (your content stays with you like a piña colada on your table), focusing on a few key concepts, and elaborating with examples.

But technical writing also involves distillation, and it comes down to “omission.”[ing] The old Srank & White adage of “unnecessary words” always rings true and reminds me to read about the art of writing. Roy Peter Clarke is my favorite.

you are also writing research paper. How do you tailor your content when writing for a general data science audience versus a research-focused audience?

I avoid alcohol-related metaphors at all costs. In fact, any figurative language. Stick to the concrete. In a research paper, the main thing you need to convey is what progress has been made. Where was the field before, and where is it now? It’s not about teaching. The audience thinks they know. It’s about pitching ideas, advocating methodologies, and supporting hypotheses. You should show how the gap was and explain how your paper filled it. If you can do these two things, you will have a great research paper.

To learn more about Marco’s work and keep up with his latest articles, visit his website and follow him on TDS or LinkedIn.



Source link