Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EWMA weighted by time with adjust=True is flawed, and adjust=False is not supported #54328

Open
2 of 3 tasks
hleumas opened this issue Jul 31, 2023 · 15 comments · Fixed by #59142
Open
2 of 3 tasks

EWMA weighted by time with adjust=True is flawed, and adjust=False is not supported #54328

hleumas opened this issue Jul 31, 2023 · 15 comments · Fixed by #59142
Labels
Bug Window rolling, ewma, expanding

Comments

@hleumas
Copy link

hleumas commented Jul 31, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import math
import pandas as pd

# Current EWM implementation in pandas
def ema(df, e_time):
    return df.ewm(
        halflife=pd.to_timedelta(e_time),
        times=df.index,
    ).mean()

# Correct EWM calculation for reference
def ema_manual(df, e_time):
    it = df.items()
    _list = []

    lt, s = next(it)
    _list.append(s)

    for t, v in it:
        q = math.exp2((lt - t) / e_time)
        s = s * q + (1 - q) * v
        lt = t
        _list.append(s)

    return pd.DataFrame(_list, index=df.index)

zeroes = 1_000_000
ms = 1_000_000 # 1ms

# Let's create a list of 500,000 0s spaced a 1ms apart with a single 1 exactly
# 1 second after the last 0.
val = zeroes * [0] + [1]
id = list(range(0, zeroes * ms, ms))
id.append(id[-1] + 1000 * ms)
id = pd.to_datetime(id)
df = pd.DataFrame(val, columns=['val'], index=id)


# We would expect the EWM of these series with halflife=1s to be exactly 0.5,
# which is confirmed by the manual calculation
print(ema_manual(df['val'], '1s'))

# Whereas pandas EWM returns number close to 0
print(ema(df['val'], '1s'))

Issue Description

Pandas seems to have an issue with timeseries with uneven intervals. Assume following example:

1 million of 0s spaced 1 millisecond apart followed by a single 1 after a 1 second gap:

1970-01-01 00:00:00.000    0
1970-01-01 00:00:00.001    0
1970-01-01 00:00:00.002    0
1970-01-01 00:00:00.003    0
1970-01-01 00:00:00.004    0
...  999,991 more items  ...
1970-01-01 00:16:39.996    0
1970-01-01 00:16:39.997    0
1970-01-01 00:16:39.998    0
1970-01-01 00:16:39.999    0
-----   1 second gap   -----
1970-01-01 00:16:40.999    1

[1000001 rows x 1 columns]

One would naively expect the exponential moving average with a halflife of 1 second to equal to exactly 0.5, assuming equation:

w    = 0.5 ** (dt/halflife) = 0.5 **(1s/1s)
y(t) = y(t - 1) * w + x * (1 - w)

However, due to the way adjustment factor is calculated, this is not true. Unfortunately adjustment=True works correctly only for evenly spaced time series and in this situation leads to extremely small result of 0.001384.

Also, unfortunately, adjustment=False is disabled for calculations where times argument is set.

Expected Behavior

Result that is independent of the sampling frequency irregularities. Thus, increasing sampling frequency in the early times (where 0s are measured) shouldn't lead to increasing their weight in the result.

Result of 0.5 instead of 0.001384.

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.11.4.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.0.3
numpy : 1.25.1
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@hleumas hleumas added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 31, 2023
@lithomas1
Copy link
Member

I can confirm this on main.

cc @mroeschke as the expert on the window code.

@lithomas1 lithomas1 added Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 2, 2023
@hleumas
Copy link
Author

hleumas commented Aug 2, 2023

Just to add more context. I believe that the issue is caused by the way the formula works.

Currently, the result is essentially a weighted average of all datapoints with weights exp2(-t/H).

So, for measurements:

x(t0), x(t1), ..., x(tn)

the resulting average is

1/C * [x(t0) * exp2(-t0/H) + x(t1) * exp2(-t1/H) + ... + x(tn) * exp2(-tn/H)]

where C is a normalisation constant.

However, what we should be calculating is the following integral:

1/C * ∫ x(t) exp2(-t/H) dt

What is essentially missing in the summation formula is weighting each of the summands with dt. This is not issue as long as dt is same for each summand, which is why this works well for even time intervals. Unfortunately, this is not the case for uneven time intervals.

@hleumas
Copy link
Author

hleumas commented Aug 2, 2023

Also note, the recursive formula doesn't have this particular issue, so a quick fix would be to allow adjust=False when times are provided. Currently pandas refuse to accept adjust=False whenever times parameter is present.

@MarcoGorelli MarcoGorelli changed the title BUG: EWMA weighted by time with adjust=True is flawed, and adjust=False is not supported Mar 5, 2024
@MarcoGorelli
Copy link
Member

Thanks for raising the issue @hleumas , your explanation makes sense to me. It does look like adjust=True just shouldn't be supported when times is passed

Supporting adjust=False shouldn't be too hard here

Should adjust=True be deprecated or is it salvageable?

@mroeschke
Copy link
Member

I implemented this years ago, and I don't exactly recall how adjust fits into EWMA with times. cc @DiegoAlbertoTorres if you have any thoughts

@azmyrajab
Copy link

Hello @hleumas @MarcoGorelli @mroeschke @DiegoAlbertoTorres - thank you for raising this issue / discussion on EWMA weighted by time.

Spent a little bit of time thinking about this topic of adjust={True, False} for an EWMA with non uniform times: and was hoping that we could potentially explore a small modification to adjust=True (for both polars and pandas) -

Should adjust=True be deprecated or is it salvageable?

Agree w/ @hleumas with his statement (which I think entails the solution to make adjust=True and adjust=False more consistent):

What is essentially missing in the summation formula is weighting each of the summands with dt


Taking a step back, pandas docs define time weighted exponential moving average (under finite history) as:
image

This is consistent with the current pandas' adjust=True behaviour under time weights. If we instead start by a more general continuous-time definition of:
image

And then modifying to discrete, but not neces. uniformly spaced, observations - we get:
image

Under uniformly spaced time observations, delta_t = 1, and the two obviously become the same. But in my opinion this is better suited model to describe the case of non-uniformly spaced weighted average as the area for each observation is adjusted by its bar width.


To show how these formulations differ and compare to adjust=False ("infinite history" ewma), some sample code based on @hleumas example:

  1. ema_pd implements adjust=True (without delta_t weighting)
  2. ema_polars is used as a benchmark for adjust=False
  3. ema_convolve corresponds to adjust=True and is a direct implementation of the weighted summation formula above. This is added simply to confirm that the recursive implementations match exactly the summation formula math above.
  4. ema_test is a proposed implementation which supports adjust=True/False and the proposed "delta_t" correction.
import pandas as pd
import polars as pl
import numpy as np
from numba import njit


# Current EWM implementation in pandas
def ema_pd(df, e_time, adjust: bool = True):
    return df.ewm(
        halflife=pd.to_timedelta(e_time),
        times=df.index,
        adjust=adjust,
    ).mean()


def ema_polars(df: pd.Series, half_life: pd.Timedelta | str, by: str):
    """I use this as a sanity check for adjust=False"""
    return (
        pl.from_pandas(df.to_frame(), include_index=True)
        .with_columns(pl.col(df.name).ewm_mean_by(by=by, half_life=half_life))
    )


@njit()
def ema_test(array: np.array, deltas: np.array, half_life: float,
             adjust: bool = False, weight_by_dt: bool = False):
    num = 0.
    denom = 0.
    result = np.zeros_like(array)
    cumulative_decay = np.cumsum(deltas / half_life)
    cumulative_decay = cumulative_decay[-1] - cumulative_decay

    for i, v in enumerate(array):
        if deltas[i] == 0:
            if i > 1:
                result[i] = result[i-1]
            continue
        # implements the formula with finite sample adjustment 
        if adjust:
            alpha = 0.5 ** cumulative_decay[i]
            if weight_by_dt:
                alpha *= deltas[i]
            num += alpha * v
            denom += alpha
            result[i] = num / denom
        else:
            # implements "infinite history" simple recursions
            alpha = 0.5 ** (deltas[i] / half_life)
            num = alpha * num + (1. - alpha) * v
            result[i] = num
    return result


def ema_convolve(array: np.array, deltas: np.array, half_life: float, weight_by_dt: bool = False):
    """Implements the weighted average summation formula directly"""
    cumulative_decay = np.cumsum(deltas / half_life)
    cumulative_decay = cumulative_decay[-1] - cumulative_decay
    kernel = 0.5 ** cumulative_decay
    # implements the dt bar width correction such each sample in weighted sum takes into account width of time bar it applies to
    if weight_by_dt:
        kernel *= deltas
    numerator = (array * kernel).sum()
    denominator = kernel.sum()
    # alternatively can use convolution to get entire path
    # kernel = 0.5 ** np.cumsum(deltas / half_life)
    # numerator = np.convolve(array, kernel, mode="full")[:len(array)]
    # denominator = np.convolve(np.ones_like(array), kernel, mode="full")[:len(array)]
    return numerator / denominator

Then let's first confirm that the test implementation matches pandas (and the directly computed weighted average), when we don't do the "dt" correction:

zeroes = 100_000
ms = 1_000_000  # 1ms

# Let's create a list of 100k 0s spaced a 1ms apart with a single 1 exactly
# 1 second after the last 0.
val = zeroes * [0] + [1]
id = list(range(0, zeroes * ms, ms))
id.append(id[-1] + 1_000 * ms)
id = pd.to_datetime(id)
df = pd.DataFrame(val, columns=['val'], index=id)

half_life = pd.Timedelta("3s")
vals = np.asarray(df["val"]).astype("float64")

# convert to floats
half_life_seconds = half_life / pd.Timedelta("1s")
dt_seconds = np.asarray((df.index.diff() / pd.Timedelta("1s")).fillna(0.0)).astype("float64")

ema_adjust = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=True)[-1]
ema_adjust_pd = ema_pd(df['val'], half_life, adjust=True).to_numpy()[-1]
ema_weighted_avg = ema_convolve(vals, dt_seconds, half_life=half_life_seconds)

print(ema_adjust)
print(ema_adjust_pd)
print(ema_weighted_avg)

They all agree:

0.0002909852504376329
0.00029098525043919435
0.0002909852504376267
Then let's see how `adjust=False` compares:
ema_no_adjust = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=False)[-1]
ema_no_adjust_pl = ema_polars(df["val"].rename_axis(index="ts").rename("val"), half_life,
                              by="ts").select("val")[-1].item()
print(ema_no_adjust)
print(ema_no_adjust_pl)

These numbers are now a lot larger as more weight is placed on the last non-zero observation:

0.2062994740159002
0.2062994740159002

Finally let's see what turning on the "dt" correction does:

ema_corrected = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=True, weight_by_dt=True)[-1]
ema_weighted_avg_corrected = ema_convolve(vals, dt_seconds, half_life=half_life_seconds, weight_by_dt=True)

print(ema_corrected)
print(ema_weighted_avg_corrected)

Notice that the discrepency with the adjust=False result is much smaller.

0.22544862736755866
0.22544862736755866

@hleumas if I take your example numbers verbatim (1s halflife, 1 million samples), I get:

# equivalent to pandas adjust=True
0.0013838961963124187
0.0013838961963452783
0.0013838961963124207
# equivalent to your adjust=False (and polars current default)
0.5
0.5
# equivalent to what a 'corrected' adjust=True would return 
0.5808558454220695
0.5808558454220693

Sorry for the long comment - the behaviour of the corrected adjustment, to me, makes more sense and sounds like a relatively easy fix (just multiply the term used to update numerator and denominator by the delta_t(i) and handle the corner cases if it is zero). What are your thoughts? Could we implement this new behaviour of adjust=True for pandas and polars?

@hleumas
Copy link
Author

hleumas commented Apr 29, 2024

Well, I would say that you have to figure out how to deal with the fact that for n values you will always have only n-1 deltas.

But yeah, other than that this is sensible.

@MarcoGorelli
Copy link
Member

Thanks @azmyrajab for your good explanation. I think it makes sense, I'll experiment with it

Well, I would say that you have to figure out how to deal with the fact that for n values you will always have only n-1 deltas.

But yeah, other than that this is sensible.

I think the fix might be

        if deltas[i] == 0:
            if i > 1:
                result[i] = result[i-1]
            else:
                result[i] = values[i]
            continue

? As y_0 should equal x_0 (the first observation doesn't have a history to be weighted by)?

@hleumas
Copy link
Author

hleumas commented May 8, 2024

I guess I am just confused by deltas array. Where is it coming from? What is the formula for delta[i] given times t[i]?

@MarcoGorelli
Copy link
Member

from here

dt_seconds = np.asarray((df.index.diff() / pd.Timedelta("1s")).fillna(0.0)).astype("float64")

@hleumas
Copy link
Author

hleumas commented May 8, 2024

Ok, I get it. So, given that we have delta[0] == 0, we essentially completely ignore the first measured value. Now, I would say that that's slightly strange behaviour, but I guess it doesn't matter as long as we have enough values.

@MarcoGorelli
Copy link
Member

yeah the first result should be kept as-in, that's the modification I was suggesting in #54328 (comment)

@hleumas
Copy link
Author

hleumas commented May 8, 2024

I get that and that is sensible. What I am getting at is the fact that the next results are completely unimpacted by it. E.g. 2nd result is just equaling the 2nd value. All I am saying is that this is strange. But I don't know what would be the ideal solution.

@tserrao
Copy link
Contributor

tserrao commented Jun 28, 2024

Hi, I opened a PR (linked just above) to allow adjust=False with times and to handle irregular-interval deltas. It's scope is narrow and it is still an open question what to do with adjust=True. But I wanted to keep moving forward with this issue :)

@MarcoGorelli
Copy link
Member

thanks @tserrao !

great that adjust=False is now supported. adjust=True still needs adjusting though, so keeping this open until that happens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants