Google Trends is misleading: How to use Google Trends data for machine learning

. What a gift this is to society. Without Google Trends, how would we know that the increase in Disney movies released in the 2000s led to a decrease in divorces in the UK? Or that drinking Coca-Cola is the unknown cure for cat scratches.

Wait, are you confusing correlation and causation again?

If you’d rather watch than read, you can do so here.

Google Trends is one of the most widely used tools for analyzing human behavior at scale. Used by journalists. Data scientists use it. The entire paper is built on it. However, there are fundamental properties of Google Trends data that make it very easy to exploit, especially when working with time series or trying to build models, and most people don’t even realize they’re doing so.

All graphs and screenshots are created by the authors unless otherwise noted.

Google trends data problem

Google doesn’t actually publish search volume numbers.. That information is what prints them money, and there’s no way they’d publish it for others to monetize. But what they give us is a way to look at the time series and understand changes in people’s searches for particular terms; Normalized dataset.

This doesn’t seem like a problem until you try machine learning. This is because in order for a machine to learn something, it needs to be fed a large amount of data.

My first idea was to get a 5 year window. But a problem soon arises. The larger the time window, the less granular the data. I hadn’t been able to get daily data for five years, so I thought, “I can just get the maximum time period that I can get daily data for and move that window around,” but that was also a problem. Because this is where I learned true fear Normalization:

Regardless of the time period or single search term used, the most searched data points are immediately set to 100. In other words, the meaning of 100 changes depending on the window you use.

This entire post exists for this reason.

Google Trends Basics

Now, I don’t know if you’ve ever used Google Trends before, but if you haven’t, I’m going to walk you through it so you can get to the heart of the matter.

So if you search for the word ‘motivation’, the UK will come up by default. Because England is where I come from and it has been England for the past day. We now have a nice graph showing how often people have searched for the word “motivation” in the last 24 hours.

*24 hour motivation in uk*screenshot by author

I like this because it clearly shows that people are mainly looking for motivation. During the working day, No one is going to search for it when most of the country is asleep, and there are definitely some kids out there who need homework encouragement. There’s no explanation for the late-night raids, but I’m guessing they’re probably people who aren’t ready to go back to work tomorrow.

That’s great, but if you increment by 8 minutes over 24 hours, you have 180 data points available. Most of them are actually zero And I don’t know if the past 24 hours have been so de-motivating compared to the rest of the year, or if today represents the highest GDP contribution of the year. I’ll try adding more windows. Somewhat.

The first thing you notice as soon as you reach the week is that the granularity of the data is much lower. We have a week’s worth of data, but now we only have hourly data and we still have the same core problem of not knowing. How representative is it? This week.

You can keep zooming out. 30 days, 90 days. There is a loss of granularity at each time point and there are not as many data points as there are for 24 hours. This is not enough when building a real model. We have to go big.

And if you choose 5 years, you run into the problem (sorry for the pun, unintentional) that motivated this entire video. That means you can’t get daily data. Also, why isn’t it 100 today?

*A five-year study of motivation in the UK.*screenshot by author

Here’s the real problem with Google Trends data

As mentioned earlier, Google Trends data is normalized. This means that the data points with the most searches are immediately set to 100, regardless of the time period or single search term used. All other points will be scaled down accordingly. If your search volume on April 1st was half of the maximum, your Google Trends score on April 1st would be 50.

So, let’s look at an example here to illustrate the point. Let’s take May and June 2025 as an example. Both are 30 or 31 days, so you have daily data here, but you actually lose data after 90 days. If you look at May, you’ll see that it reached 100 on the 13th and is adjusting to reach 10 in June. Does that mean there were just as many searches for motives on June 10th as there were on May 13th?

*Google trend data for May*screenshot by author

*June Google Trends Data*screenshot by author

If we zoom out here and show May and June on the same graph, we can quickly see that this is not the case. Including both months, June 10th had a Google Trends search score of 83. So as a percentage of searches in the UK, it was 81% of the percentage of searches on May 13th. If you don’t zoom out, we wouldn’t have known that.

*Display May and June on the same graph*screenshot by author

All is not lost here. We learned a lot from this experiment because we know that if two data points are in the same graph, we can see their relative differences. So if you load May and June separately, knowing that June 10th is 81% of May 13th means you can scale June down accordingly. data will be comparable.

That’s what I decided to do. Obtain Google Trends data overlapping days for each window. So January 1st to March 31st, then March 31st to July 31st. You can then use March 31st in both data sets to scale the second set to be comparable to the first.

But this is close to what we can use, another problem I need to be noticed.

Google Trends: A new layer of randomness

So when it comes to Google Trends data, Google doesn’t actually track every search. That would be a computational nightmare. Instead, Google sampling technology Construct a representation of search volume.

This means that although the sample may be very well built, it’s Google after all, and something will go wrong every day. natural random variation. If March 31st happened to be an unusually high or low day in Google’s sample compared to the real world, the overlap technique would introduce errors into the entire dataset.

In addition to this, you should also consider: roll up. Google Trends rounds everything to the nearest whole number. There is no 50.5. 50 or 51. This may seem like a small thing, but it can actually be a big problem. Let me explain why.

On October 4, 2021, massive spike Search on Facebook. This large spike is scaled by 100, so that all other values within that period are much closer to zero. When you round to the nearest integer, a small error of 0.5 suddenly becomes a large error. large proportional error If the number is only 1 or 2. This means that the solution needs to be robust enough to handle noise as well as scaling.

So how do I solve this? On average, we know that our sample is representative. Let’s take a larger sample. Using a larger window to obtain overlap is less susceptible to random fluctuations and rounding errors.

So, here is the final plan. We know that you can get up to 90 days of daily data. I’m going to load a rolling window of 90 day periods, but I want each window to overlap the next one by a full month. Then our overlap will be more than just a noisy day; Stable 1 month anchor Can be used to scale data more precisely.

So it looks like there’s a plan. I have some concerns. The main thing is that having a large number of batches will result in compound errors, and large numbers can absolutely explode. But to see how this changes with real data we have to go and do it. So, I’d like to introduce you to something I’ve made before.

Writing code to understand Google trends

After writing up everything we’ve discussed so far in code form and having a little fun with it, temporarily prohibited I’ve put together some graphs from Google Trends since I got so much data. My immediate reaction when I saw this was: “Oh, it exploded.”

*Big spikes are a bit scary area for our project*ct, image by author

The graph below shows Facebook’s search volume chain over five years. Although we see a fairly steady downward trend, two spikes stand out. The first of these is the aforementioned massive spike on October 4, 2021.

*This spike is even scarier*image by author

The first thing I thought was Check for spikes. Ironically, I googled it and learned about the widespread meta disorder that day. When we pulled data from Instagram and Whatsapp over the same period, we saw a similar spike. So even though I knew Spike was real, I still had my doubts. Was it too big?

My heart sank when I juxtaposed my time series with Google Trends’ own graphs. My spikes were huge in comparison. I started thinking about how to deal with this. Should there be an upper limit on the maximum spike value? That feels arbitrary and loses information about the relative size of the spikes. Do I need to apply any scaling factor? it felt like a guess.

*5 years of Facebook searches on Google Trends*screenshot by author

That was until I had an epiphany. Remember, Google Trends provides weekly data for this period and that’s the whole reason we do this. If I Averaged the data for that week Want to see how it compares to Google’s weekly values?

At this point I breathed a huge sigh of relief. I set it to 100 because that week was the biggest spike in Google Trends. If you average the data for the same week, Got 102.8. Surprisingly close to Google Trends. It also ends in almost the same place. This means your data is not exploding due to compound errors due to the scaling method. Some data looks and behaves exactly like Google Trends data!

You now have a robust methodology for creating clean and comparable daily time series for any search term. That’s great. But what if you actually want to do something useful with it? Compare search terms from around the world for example?

Google Trends allows you to compare multiple search terms, but that’s not allowed. Compare multiple countries directly. So we can use the method we’ve described today to get a data set of motivations for each country, but how do we compare them? Facebook is part of the solution.

However, this solution is for a later blog post, in which: “Product basket” Compare countries and see exactly how Facebook fits into all of this.

So today we started with the question of whether we could model the motivations of our people, but as soon as we tried to do so, we hit a wall. Because Google Trends daily data is misleading. Not due to an error, However, by its very design. We have now found a way to address this problem, but in the life of a data scientist, there are always more problems lurking around the corner.

Source link