EWMA weighted by time with `adjust=True` is flawed, and `adjust=False` is not supported #54328

hleumas · 2023-07-31T20:42:37Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import math
import pandas as pd

# Current EWM implementation in pandas
def ema(df, e_time):
    return df.ewm(
        halflife=pd.to_timedelta(e_time),
        times=df.index,
    ).mean()

# Correct EWM calculation for reference
def ema_manual(df, e_time):
    it = df.items()
    _list = []

    lt, s = next(it)
    _list.append(s)

    for t, v in it:
        q = math.exp2((lt - t) / e_time)
        s = s * q + (1 - q) * v
        lt = t
        _list.append(s)

    return pd.DataFrame(_list, index=df.index)

zeroes = 1_000_000
ms = 1_000_000 # 1ms

# Let's create a list of 500,000 0s spaced a 1ms apart with a single 1 exactly
# 1 second after the last 0.
val = zeroes * [0] + [1]
id = list(range(0, zeroes * ms, ms))
id.append(id[-1] + 1000 * ms)
id = pd.to_datetime(id)
df = pd.DataFrame(val, columns=['val'], index=id)


# We would expect the EWM of these series with halflife=1s to be exactly 0.5,
# which is confirmed by the manual calculation
print(ema_manual(df['val'], '1s'))

# Whereas pandas EWM returns number close to 0
print(ema(df['val'], '1s'))

Issue Description

Pandas seems to have an issue with timeseries with uneven intervals. Assume following example:

1 million of 0s spaced 1 millisecond apart followed by a single 1 after a 1 second gap:

1970-01-01 00:00:00.000    0
1970-01-01 00:00:00.001    0
1970-01-01 00:00:00.002    0
1970-01-01 00:00:00.003    0
1970-01-01 00:00:00.004    0
...  999,991 more items  ...
1970-01-01 00:16:39.996    0
1970-01-01 00:16:39.997    0
1970-01-01 00:16:39.998    0
1970-01-01 00:16:39.999    0
-----   1 second gap   -----
1970-01-01 00:16:40.999    1

[1000001 rows x 1 columns]

One would naively expect the exponential moving average with a halflife of 1 second to equal to exactly 0.5, assuming equation:

w    = 0.5 ** (dt/halflife) = 0.5 **(1s/1s)
y(t) = y(t - 1) * w + x * (1 - w)

However, due to the way adjustment factor is calculated, this is not true. Unfortunately adjustment=True works correctly only for evenly spaced time series and in this situation leads to extremely small result of 0.001384.

Also, unfortunately, adjustment=False is disabled for calculations where times argument is set.

Expected Behavior

Result that is independent of the sampling frequency irregularities. Thus, increasing sampling frequency in the early times (where 0s are measured) shouldn't lead to increasing their weight in the result.

Result of 0.5 instead of 0.001384.

Installed Versions

INSTALLED VERSIONS

commit : 0f43794
python : 3.11.4.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Thu Jun 8 22:22:20 PDT 2023; root:xnu-8796.121.3~7/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.0.3
numpy : 1.25.1
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 12.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

lithomas1 · 2023-08-02T00:33:57Z

I can confirm this on main.

cc @mroeschke as the expert on the window code.

hleumas · 2023-08-02T06:50:12Z

Just to add more context. I believe that the issue is caused by the way the formula works.

Currently, the result is essentially a weighted average of all datapoints with weights exp2(-t/H).

So, for measurements:

x(t0), x(t1), ..., x(tn)

the resulting average is

1/C * [x(t0) * exp2(-t0/H) + x(t1) * exp2(-t1/H) + ... + x(tn) * exp2(-tn/H)]

where C is a normalisation constant.

However, what we should be calculating is the following integral:

1/C * ∫ x(t) exp2(-t/H) dt

What is essentially missing in the summation formula is weighting each of the summands with dt. This is not issue as long as dt is same for each summand, which is why this works well for even time intervals. Unfortunately, this is not the case for uneven time intervals.

hleumas · 2023-08-02T06:51:16Z

Also note, the recursive formula doesn't have this particular issue, so a quick fix would be to allow adjust=False when times are provided. Currently pandas refuse to accept adjust=False whenever times parameter is present.

MarcoGorelli · 2024-03-05T14:56:04Z

Thanks for raising the issue @hleumas , your explanation makes sense to me. It does look like adjust=True just shouldn't be supported when times is passed

Supporting adjust=False shouldn't be too hard here

Should adjust=True be deprecated or is it salvageable?

mroeschke · 2024-03-08T03:33:24Z

I implemented this years ago, and I don't exactly recall how adjust fits into EWMA with times. cc @DiegoAlbertoTorres if you have any thoughts

azmyrajab · 2024-04-23T12:49:27Z

Hello @hleumas @MarcoGorelli @mroeschke @DiegoAlbertoTorres - thank you for raising this issue / discussion on EWMA weighted by time.

Spent a little bit of time thinking about this topic of adjust={True, False} for an EWMA with non uniform times: and was hoping that we could potentially explore a small modification to adjust=True (for both polars and pandas) -

Should adjust=True be deprecated or is it salvageable?

Agree w/ @hleumas with his statement (which I think entails the solution to make adjust=True and adjust=False more consistent):

What is essentially missing in the summation formula is weighting each of the summands with dt

Taking a step back, pandas docs define time weighted exponential moving average (under finite history) as:

This is consistent with the current pandas' adjust=True behaviour under time weights. If we instead start by a more general continuous-time definition of:

And then modifying to discrete, but not neces. uniformly spaced, observations - we get:

Under uniformly spaced time observations, delta_t = 1, and the two obviously become the same. But in my opinion this is better suited model to describe the case of non-uniformly spaced weighted average as the area for each observation is adjusted by its bar width.

To show how these formulations differ and compare to adjust=False ("infinite history" ewma), some sample code based on @hleumas example:

ema_pd implements adjust=True (without delta_t weighting)
ema_polars is used as a benchmark for adjust=False
ema_convolve corresponds to adjust=True and is a direct implementation of the weighted summation formula above. This is added simply to confirm that the recursive implementations match exactly the summation formula math above.
ema_test is a proposed implementation which supports adjust=True/False and the proposed "delta_t" correction.

import pandas as pd
import polars as pl
import numpy as np
from numba import njit


# Current EWM implementation in pandas
def ema_pd(df, e_time, adjust: bool = True):
    return df.ewm(
        halflife=pd.to_timedelta(e_time),
        times=df.index,
        adjust=adjust,
    ).mean()


def ema_polars(df: pd.Series, half_life: pd.Timedelta | str, by: str):
    """I use this as a sanity check for adjust=False"""
    return (
        pl.from_pandas(df.to_frame(), include_index=True)
        .with_columns(pl.col(df.name).ewm_mean_by(by=by, half_life=half_life))
    )


@njit()
def ema_test(array: np.array, deltas: np.array, half_life: float,
             adjust: bool = False, weight_by_dt: bool = False):
    num = 0.
    denom = 0.
    result = np.zeros_like(array)
    cumulative_decay = np.cumsum(deltas / half_life)
    cumulative_decay = cumulative_decay[-1] - cumulative_decay

    for i, v in enumerate(array):
        if deltas[i] == 0:
            if i > 1:
                result[i] = result[i-1]
            continue
        # implements the formula with finite sample adjustment 
        if adjust:
            alpha = 0.5 ** cumulative_decay[i]
            if weight_by_dt:
                alpha *= deltas[i]
            num += alpha * v
            denom += alpha
            result[i] = num / denom
        else:
            # implements "infinite history" simple recursions
            alpha = 0.5 ** (deltas[i] / half_life)
            num = alpha * num + (1. - alpha) * v
            result[i] = num
    return result


def ema_convolve(array: np.array, deltas: np.array, half_life: float, weight_by_dt: bool = False):
    """Implements the weighted average summation formula directly"""
    cumulative_decay = np.cumsum(deltas / half_life)
    cumulative_decay = cumulative_decay[-1] - cumulative_decay
    kernel = 0.5 ** cumulative_decay
    # implements the dt bar width correction such each sample in weighted sum takes into account width of time bar it applies to
    if weight_by_dt:
        kernel *= deltas
    numerator = (array * kernel).sum()
    denominator = kernel.sum()
    # alternatively can use convolution to get entire path
    # kernel = 0.5 ** np.cumsum(deltas / half_life)
    # numerator = np.convolve(array, kernel, mode="full")[:len(array)]
    # denominator = np.convolve(np.ones_like(array), kernel, mode="full")[:len(array)]
    return numerator / denominator

Then let's first confirm that the test implementation matches pandas (and the directly computed weighted average), when we don't do the "dt" correction:

zeroes = 100_000
ms = 1_000_000  # 1ms

# Let's create a list of 100k 0s spaced a 1ms apart with a single 1 exactly
# 1 second after the last 0.
val = zeroes * [0] + [1]
id = list(range(0, zeroes * ms, ms))
id.append(id[-1] + 1_000 * ms)
id = pd.to_datetime(id)
df = pd.DataFrame(val, columns=['val'], index=id)

half_life = pd.Timedelta("3s")
vals = np.asarray(df["val"]).astype("float64")

# convert to floats
half_life_seconds = half_life / pd.Timedelta("1s")
dt_seconds = np.asarray((df.index.diff() / pd.Timedelta("1s")).fillna(0.0)).astype("float64")

ema_adjust = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=True)[-1]
ema_adjust_pd = ema_pd(df['val'], half_life, adjust=True).to_numpy()[-1]
ema_weighted_avg = ema_convolve(vals, dt_seconds, half_life=half_life_seconds)

print(ema_adjust)
print(ema_adjust_pd)
print(ema_weighted_avg)

They all agree:

0.0002909852504376329
0.00029098525043919435
0.0002909852504376267

Then let's see how `adjust=False` compares:
ema_no_adjust = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=False)[-1]
ema_no_adjust_pl = ema_polars(df["val"].rename_axis(index="ts").rename("val"), half_life,
                              by="ts").select("val")[-1].item()
print(ema_no_adjust)
print(ema_no_adjust_pl)

These numbers are now a lot larger as more weight is placed on the last non-zero observation:

0.2062994740159002
0.2062994740159002

Finally let's see what turning on the "dt" correction does:

ema_corrected = ema_test(vals, dt_seconds, half_life=half_life_seconds, adjust=True, weight_by_dt=True)[-1]
ema_weighted_avg_corrected = ema_convolve(vals, dt_seconds, half_life=half_life_seconds, weight_by_dt=True)

print(ema_corrected)
print(ema_weighted_avg_corrected)

Notice that the discrepency with the adjust=False result is much smaller.

0.22544862736755866
0.22544862736755866

@hleumas if I take your example numbers verbatim (1s halflife, 1 million samples), I get:

# equivalent to pandas adjust=True
0.0013838961963124187
0.0013838961963452783
0.0013838961963124207
# equivalent to your adjust=False (and polars current default)
0.5
0.5
# equivalent to what a 'corrected' adjust=True would return 
0.5808558454220695
0.5808558454220693

Sorry for the long comment - the behaviour of the corrected adjustment, to me, makes more sense and sounds like a relatively easy fix (just multiply the term used to update numerator and denominator by the delta_t(i) and handle the corner cases if it is zero). What are your thoughts? Could we implement this new behaviour of adjust=True for pandas and polars?

hleumas · 2024-04-29T10:03:40Z

Well, I would say that you have to figure out how to deal with the fact that for n values you will always have only n-1 deltas.

But yeah, other than that this is sensible.

MarcoGorelli · 2024-05-02T11:34:15Z

Thanks @azmyrajab for your good explanation. I think it makes sense, I'll experiment with it

Well, I would say that you have to figure out how to deal with the fact that for n values you will always have only n-1 deltas.

But yeah, other than that this is sensible.

I think the fix might be

        if deltas[i] == 0:
            if i > 1:
                result[i] = result[i-1]
            else:
                result[i] = values[i]
            continue

? As y_0 should equal x_0 (the first observation doesn't have a history to be weighted by)?

hleumas · 2024-05-08T20:33:37Z

I guess I am just confused by deltas array. Where is it coming from? What is the formula for delta[i] given times t[i]?

MarcoGorelli · 2024-05-08T20:35:24Z

from here

dt_seconds = np.asarray((df.index.diff() / pd.Timedelta("1s")).fillna(0.0)).astype("float64")

hleumas · 2024-05-08T20:42:53Z

Ok, I get it. So, given that we have delta[0] == 0, we essentially completely ignore the first measured value. Now, I would say that that's slightly strange behaviour, but I guess it doesn't matter as long as we have enough values.

MarcoGorelli · 2024-05-08T20:44:50Z

yeah the first result should be kept as-in, that's the modification I was suggesting in #54328 (comment)

hleumas · 2024-05-08T20:47:45Z

I get that and that is sensible. What I am getting at is the fact that the next results are completely unimpacted by it. E.g. 2nd result is just equaling the 2nd value. All I am saying is that this is strange. But I don't know what would be the ideal solution.

tserrao · 2024-06-28T19:58:14Z

Hi, I opened a PR (linked just above) to allow adjust=False with times and to handle irregular-interval deltas. It's scope is narrow and it is still an open question what to do with adjust=True. But I wanted to keep moving forward with this issue :)

MarcoGorelli · 2024-07-08T15:52:04Z

thanks @tserrao !

great that adjust=False is now supported. adjust=True still needs adjusting though, so keeping this open until that happens

hleumas added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 31, 2023

lithomas1 added Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 2, 2023

MarcoGorelli changed the title ~~BUG:~~ EWMA weighted by time with adjust=True is flawed, and adjust=False is not supported Mar 5, 2024

This was referenced Mar 7, 2024

BUG: ewma with time gives strange results with adjust = False vs. True, difference is too large to make sense #40098

Open

BUG: PyArrow dtypes were not supported in the interchange protocol #57764

Merged

tserrao mentioned this issue Jun 28, 2024

ENH: Allow adjust=False when times is provided #59142

Merged

mroeschke closed this as completed in #59142 Jul 8, 2024

MarcoGorelli reopened this Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EWMA weighted by time with `adjust=True` is flawed, and `adjust=False` is not supported #54328

EWMA weighted by time with `adjust=True` is flawed, and `adjust=False` is not supported #54328

hleumas commented Jul 31, 2023

INSTALLED VERSIONS

lithomas1 commented Aug 2, 2023

hleumas commented Aug 2, 2023

hleumas commented Aug 2, 2023

MarcoGorelli commented Mar 5, 2024

mroeschke commented Mar 8, 2024

azmyrajab commented Apr 23, 2024

hleumas commented Apr 29, 2024

MarcoGorelli commented May 2, 2024

hleumas commented May 8, 2024

MarcoGorelli commented May 8, 2024

hleumas commented May 8, 2024

MarcoGorelli commented May 8, 2024

hleumas commented May 8, 2024

tserrao commented Jun 28, 2024 •

edited

Loading

MarcoGorelli commented Jul 8, 2024

EWMA weighted by time with adjust=True is flawed, and adjust=False is not supported #54328

EWMA weighted by time with adjust=True is flawed, and adjust=False is not supported #54328

Comments

hleumas commented Jul 31, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

lithomas1 commented Aug 2, 2023

hleumas commented Aug 2, 2023

hleumas commented Aug 2, 2023

MarcoGorelli commented Mar 5, 2024

mroeschke commented Mar 8, 2024

azmyrajab commented Apr 23, 2024

hleumas commented Apr 29, 2024

MarcoGorelli commented May 2, 2024

hleumas commented May 8, 2024

MarcoGorelli commented May 8, 2024

hleumas commented May 8, 2024

MarcoGorelli commented May 8, 2024

hleumas commented May 8, 2024

tserrao commented Jun 28, 2024 • edited Loading

MarcoGorelli commented Jul 8, 2024

EWMA weighted by time with `adjust=True` is flawed, and `adjust=False` is not supported #54328

EWMA weighted by time with `adjust=True` is flawed, and `adjust=False` is not supported #54328

tserrao commented Jun 28, 2024 •

edited

Loading