Is your model time-blind? For periodic feature encoding

Machine Learning


: midnight paradox

Imagine this. We are building a model to predict electricity demand and taxi pickup and drop-off. Therefore, enter the time (for example, minutes) starting from midnight. Clean and simple. right?

Now the model recognizes as 23:59 (1439 minutes during daytime) and 00:01 (1 minute of the day). To you, those are two minutes apart. For your model they are very far apart. That's the midnight paradox. And yes, your model is probably time-blind.

Why does this happen?

This is because most machine learning models treat numbers as straight lines rather than circles.

Linear regression, KNN, SVM, and even neural networks treat numbers logically, assuming that higher numbers are “more” than lower numbers. They don't know that time is passing. Late night is an edge case that they will never forgive.

If you've ever had trouble adding hourly information to your model and later wondered why your model struggles near day boundaries, this may be the reason.

Standard encoding failure

Let's talk about the usual approach. You've probably used at least one of them.

Encodes the time as a number between 0 and 23. Currently, an artificial cliff exists between 23:00 and 0:00. Therefore, the model considers midnight to be the largest increase of the day. But is midnight really that much different than 10pm and 9pm rather than 11pm?

Of course not. But your model doesn't know that.

Below is the time representation when in “linear” mode.

# Generate data
date_today = pd.to_datetime('today').normalize()
datetime_24_hours = pd.date_range(start=date_today, periods=24, freq='h')
df = pd.DataFrame({'dt': datetime_24_hours})
df['hour'] = df['dt'].dt.hour	

# Calculate Sin and Cosine
df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)
df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)

# Plot the Hours in Linear mode
plt.figure(figsize=(15, 5))
plt.plot(df['hour'], [1]*24, linewidth=3)
plt.title('Hours in Linear Mode')
plt.xlabel('Hour')
plt.xticks(np.arange(0, 24, 1))
plt.ylabel('Value')
plt.show()
Linear mode time. Image by author.

What happens if we one-hot encode time? 24 binary strings. Problem solved, right? Well…partially. We fixed the artificial gap, but the proximity was lost. 2am is no longer 3am, it's 10pm.
The dimensionality also exploded. That's a nuisance for trees. For linear models, it's probably inefficient.

Now let's move on to viable alternatives.

  • Solution: triangle mapping

The change in thinking is as follows.

Stop thinking of time as a line. Let's think of it as a circle.

The 24-hour day loops back around. So you have to keep thinking about the encoding in a loop as well. Each time is an equally spaced point on a circle. Now, instead of using a single number to represent a point on a circle, use the following values: two coordinates: × and y.

This is where sine and cosine come into play.

the geometry behind it

You can use sine and cosine to map every angle on a circle to a unique point. This allows the model to represent smooth, continuous time.

plt.figure(figsize=(5, 5))
plt.scatter(df['hour_sin'], df['hour_cos'], linewidth=3)
plt.title('Hours in Cyclical Mode')
plt.xlabel('Hour')
Number of hours in periodic mode after sine and cosine. Image by author.

The formula to calculate the cycles for each hour of the day is:

  • beginning, 2 * π * hour / 24 Convert each time to degrees. Midnight and 11pm are approximately the same location on the circle.
  • after that sine and cosine Project that angle into two coordinates.
  • Together these two values ​​uniquely define a time. Currently, 23:00 and 00:00 are approaching in feature space. Exactly what you've always wanted.

The same idea works for minutes, days of the week, or months.

code

Let's try this dataset Energy prediction for home appliances [4]. Attempts to improve predictions using a random forest regressor model (a tree-based model).

Candanedo, L. (2017). Energy prediction for home appliances [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5VC8G. Creative Commons 4.0 License.

# Imports
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from ucimlrepo import fetch_ucirepo 

Get the data.

# fetch dataset 
appliances_energy_prediction = fetch_ucirepo(id=374) 
  
# data (as pandas dataframes) 
X = appliances_energy_prediction.data.features 
y = appliances_energy_prediction.data.targets 
  
# To Pandas
df = pd.concat([X, y], axis=1)
df['date'] = df['date'].apply(lambda x: x[:10] + ' ' + x[11:])
df['date'] = pd.to_datetime(df['date'])
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['hour'] = df['date'].dt.hour
df.head(3)

Let's create a simple model using linear First, use it as a baseline for comparison.

# X and y
# X = df.drop(['Appliances', 'rv1', 'rv2', 'date'], axis=1)
X = df[['hour', 'day', 'T1', 'RH_1', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint']]
y = df['Appliances']

# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
lr = RandomForestRegressor().fit(X_train, y_train)

# Score
print(f'Score: {lr.score(X_train, y_train)}')

# Test RMSE
y_pred = lr.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f'RMSE: {rmse}')

Here are the results.

Score: 0.9395797670166536
RMSE: 63.60964667197874

Next, encode the periodic time component (day and hour) to retrain the model.

# Add cyclical hours sin and cosine
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['day_sin'] = np.sin(2 * np.pi * df['day'] / 31)
df['day_cos'] = np.cos(2 * np.pi * df['day'] / 31)

# X and y
X = df[['hour_sin', 'hour_cos', 'day_sin', 'day_cos','T1', 'RH_1', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint']]
y = df['Appliances']

# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
lr_cycle = RandomForestRegressor().fit(X_train, y_train)

# Score
print(f'Score: {lr_cycle.score(X_train, y_train)}')

# Test RMSE
y_pred = lr_cycle.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f'RMSE: {rmse}')

And the result. There is a 1% improvement in score and 1 point in RMSE.

Score: 0.9416365489096074
RMSE: 62.87008070927842

This doesn't seem like a big deal, but remember that this toy example uses a simple model that is ready to use without any data processing or cleanup. The effects of sine and cosine transformations are mainly seen.

What's actually happening here is that the reality is that electricity demand doesn't reset until midnight. And now your model finally recognizes that continuity.

Why we need both sine and cosine

Avoid falling into the temptation of using sine onlybecause I feel like that's enough. One column instead of two. It's a cleaner, right?

Unfortunately, the symmetry is broken. In a 24-hour clock, 6 a.m. and 6 p.m. can produce the same sine value. The model now confuses morning rush hour with evening rush hour, which can cause problems when the same encoding occurs at different times. So it's not ideal unless you enjoy confusing predictions.

Using both sine and cosine solves this problem. Together these give a unique fingerprint on the circle each time. Think of it like latitude and longitude. You need both to know where you are.

Real-world impacts and consequences

So does this actually help the model? Yes. Especially certain things.

distance-based model

KNN and SVM rely heavily on distance computation. Circular encoding prevents spurious “long ranges” at boundaries. Your neighbors actually become neighbors again.

neural network

Neural networks learn faster using smooth feature spaces. Periodic encoding removes sharp discontinuities at midnight. This usually means faster convergence and better stability.

tree-based model

Gradient boosted trees such as XGBoost and LightGBM can eventually learn these patterns. Circular encoding gives you a head start. It's worth it if you value performance and interpretability.

7. When should I use this?

Always ask yourself the following questions: Does this function repeat periodically? If yes, consider circular encoding.

Common examples are:

  • time zone
  • day of week
  • month of year
  • Wind direction (degrees)
  • If you want to loop, you might try encoding it like a loop.

before departure

Time is more than just a number. Coordinates on a circle.

Treating it like a straight line can cause the model to stumble at boundaries and make it difficult to understand the variable as a cycle, something that repeats with a pattern.

Circular encoding using sine and cosine elegantly fixes this problem, preserving proximity, reducing artifacts, and speeding up model training.

So the next time your predictions go awry around the turn of the day, try out this new tool you've learned to help your model work as effectively as it should.

If you liked this content, find more of my work and contact information on my website.

https://gustavolsantos.me

GitHub repository

The complete code for this exercise is:

https://github.com/gurezende/Time-Series/tree/main/Sine%20Cosine%20Time%20Encode

References and further information

[1. Encoding hours Stack Exchange]: https://stats.stackexchange.com/questions/451295/encoding-cyclical-feature-minutes-and-hours

[2. NumPy trigonometric functions]: https://numpy.org/doc/stable/reference/routines.math.html

[3. Practical discussion on cyclical features]:
https://www.kaggle.com/code/avanwyk/encoding-cyclical-features-for-deep-learning

[4. Appliances Energy Prediction Dataset] https://archive.ics.uci.edu/dataset/374/appliances+energy+prediction



Source link