Fill and other sanitizing methods

Real world data can have gaps, bad names, or occur at irregular intervals. The pypond toolkit contains some methods to adjust or sanitize a series of less than optimal data. As with all other mutation operations in pypond, these methods will return new Event objects, new Collections and new TimeSeries as apropos.

Fill

Data might contain missing or otherwise invalid values. TimeSeries.fill() can perform a variety of fill operations to smooth or make sure that the data can be processed in math operations without blowing up.

In pypond, a value is considered “invalid” if it is python None, a NaN (not a number) value, or an empty string.

Usage

The method prototype looks like this:

    def fill(self, field_spec=None, method='zero', limit=None)
  • the field_spec argument is the same as it is in the rest of the code - a string or list of strings denoting “columns” in the data. It can point to.deep.values using the usual dot notation.
  • the method arg denotes the fill method to use. Valid values are zero, pad and linear.
  • the limit arg places a limit on the number of events that will be filled and returned in the new TimeSeries. The default is to fill all the events with no limit.

Complete sample usage could look like this:

    ts = TimeSeries(simple_missing_data)

    new_ts = ts.fill(field_spec=['direction.in', 'direction.out'],
                     method='linear', limit=6)

Fill methods

There are three fill options:

  • zero - the default - will transform any invalid value to a zero.
  • pad - replaces an invalid value with the the previous good value: [1, None, None, 3] becomes [1, 1, 1, 3].
  • linear - interpolate the gaps based on the surrounding good values: [1, None, None, None, 3] becomes [1, 1.5, 2, 2.5, 3].

Neither pad or linear can fill the first value in a series if it is invalid, and they can’t start filling until good value has been seen: [None, None, None, 1, 2, 3] would remain unchanged. Similarly, linear can not fill the last value in a series.

The fill_limit arg

The optional arg fill_limit controls how many values will be filled before it gives up and starts returning the invalid data until a valid value is seen again.

There might be a situation where it makes sense to fill in a couple of missing values, but no sense to pad out long spans of missing data. This arg sets the limit of the number of missing values that will be filled - or in the case of linear attempt to be filled - before it just starts returning invalid data until the next valid value is seen.

So given fill_limit=2 the following values will be filled in the following ways:

Original:
    [1, None, None, None, 5, 6, 7]

Zero:
    [1, 0, 0, None, 5, 6, 7]

Pad:
    [1, 1, 1, None, 5, 6, 7]

Linear:
    [1, None, None, None, 5, 6, 7]

Using methods zero and pad the first two missing values are filled and the third is skipped. When using the linear method, nothing gets filled because a valid value has not been seen before the limit has been reached, so it just gives up and returns the missing data.

When filling multiple columns, the count is maintained on a per-column basis. So given the following data:

    simple_missing_data = dict(
        name="traffic",
        columns=["time", "direction"],
        points=[
            [1400425947000, {'in': 1, 'out': None}],
            [1400425948000, {'in': None, 'out': None}],
            [1400425949000, {'in': None, 'out': None}],
            [1400425950000, {'in': 3, 'out': 8}],
            [1400425960000, {'in': None, 'out': None}],
            [1400425970000, {'in': None, 'out': 12}],
            [1400425980000, {'in': None, 'out': 13}],
            [1400425990000, {'in': 7, 'out': None}],
            [1400426000000, {'in': 8, 'out': None}],
            [1400426010000, {'in': 9, 'out': None}],
            [1400426020000, {'in': 10, 'out': None}],
        ]
    )

The in and out sub-columns will be counted and filled independently of each other.

If fill_limit is not set, no limits will be placed on the fill and all values will be filled as apropos to the selected method.

Constructing linear fill Pipeline chains

TimeSeries.fill() will be the common entry point for the Filler, but a Pipeline can be constructed as well. Even though the default behavior of TimeSeries.fill() applies to all fill methods, the linear fill logic is somewhat different than the zero and pad methods. Note the following points when creating your own method='linear' processing chain.

  • When constructing a Pipeline to do a linear fill on multiple columns, chain them together like this rather than passing in a field_spec that is a list of columns:
    Pipeline()
    .from_source(ts)
    .fill(field_spec='direction.in', method='linear')
    .fill(field_spec='direction.out', method='linear')
    .to_keyed_collections()
  • If a non numeric value (as determined by isinstance(val, numbers.Number)) is encountered when doing a linear fill, a warning will be issued and that column will not be processed.
  • When using streaming input like Stream, it is a best practice to set a limit using the optional arg fill_limit. This will ensure events will continue being emitted if the data hits a long run of invalid values.
  • When using an unbounded source, make sure to shut it down “cleanly” using .stop(). This will ensure .flush() is called so any unfilled cached events are emitted.

Rename

It might be necessary to rename the columns/data keys in the events in a TimeSeries. It is preferable to just give the columns/keys the desired names when the Event objects are being instantiated. This is because using TimeSeries.rename() will create all new Event objects and a new TimeSeries as well. But if that is necessary, use this method.

Usage

This method takes a python dict of strings in the format {'key': 'new_key'}. This example:

    ts = TimeSeries(TICKET_RANGE)

    renamed = ts.rename_columns({'title': 'event', 'esnet_ticket': 'ticket'})

will rename the existing column title to event, etc.

Limitations

Unlike other uses of a field_spec to point at a deep.nested.value in pypond, .rename() only allows renaming a ‘top level’ column/key. If the data payload looks like this:

    {'direction': {'in': 5, 'out': 7}}

The top level key direction can be renamed but the nested keys in and out can not.

Align

The align processor takes a TimeSeries of events that might come in with timestamps at uneven intervals and produces a new series of those points aligned on precise time window boundaries. A series containing four events with following timestamps:

0:40
1:05
1:45
2:10

Given a window of 1m (one minute), a new series with two events at the following times will be produced:

1:00
2:00

Only a series of Event objects can be aligned. IndexedEvent objects are basically already aligned and it makes no sense in the case of a TimeRangeEvent.

It should also be noted that the emitted/aligned event will only contain the fields that alignment was requested on. Which is to say if you have two columns, in and out, and only request to align the in column, the out value will not be contained in the emitted event.

Usage

The full argument usage of the align method:

ts = TimeSeries(DATA_WITH_GAPS)
aligned = ts.align(field_spec='value', window='1m', method='linear', limit=2)
  • field_spec - indicates which fields should be interpolated by the selected method. Typical usage of this arg type. If not supplied, then the default field value will be used.
  • window - an integer and the usual s/m/h/d notation like 1m, 30s, 6h, etc. The emitted events will be emitted on the indicated window boundaries. Due to the nature of the interpolation, one would want to use a window close to the frequency of the events. It would make little sense to set a window of 5h on hourly data, etc.
  • method - the interpolation method to be used: linear (the default) and hold.
  • limit - sets a limit on the number of boundary interpolated events will be produced. If limit=2, window='1m' and two events come in at the following times:
0:45
3:15

That would normally produce events on three window boundaries 1:00, 2:00 and 3:00 and that exceeds the limit so those events will have None as a value instead of an interpolated value.

Fill methods

Linear

This is the default method. It interpolates differential values in Event objects on the window boundaries using a strategy like this:

linear align

The green points are the events that will be produced by the linear fill method by interpolating the raw points. It also shows why it makes little sense to use a window significantly larger than the frequency of the events. When the window is set too wide for the data, many of the points in the middle of the window will be disregarded since the generated points are interpolated from the last event in the previous window and the first one in the current window.

Hold

This is a much simpler method. It just fills the selected field(s) with the corresponding value from the previous event.

Rate (derivative)

This generates a new TimeSeries of TimeRangeEvent objects which contain the derivative between columns in two consecutive Event objects. The start and end time of the time range events correspond to the timestamps of the two events the calculation was derived from.

The primary use case for this was to generate rate data from monotonically increasing SNMP counter values like this:

    TimeSeries(RAW_COUNTERS).align(field_spec='in', window='30s').rate('in')

This would take the raw counter data, do a linear alignment on them on 30 second window boundaries, and then calculate the rates by calculating the derivative between the aligned boundaries.

However it is not necessary to align your events first, just calling .rate() will generate time range events with the derivative between the consecutive events.

Usage

The method prototype:

    def rate(self, field_spec=None, allow_negative=True)
  • field_spec - indicates which fields should be interpolated by the selected method. Typical usage of this arg type. If not supplied, then the default field value will be used.
  • allow_negative - if left defaulting to True, then if a negative derivative is calculated, that will be used as the value in the new event. If set to False a negative derivative will be set to None instead. There are certain use cases - like if a monotonically increasing counter gets reset - that this is the desired outcome.