Re-sampling time series / spectrum data to reduce dimensions [Python: SciPy]

2019/3/26

This is a method when you want to thin out the time series / spectrum data in the DataFrame in Python.

With "pandas.DataFrame.resample", you have to set the time such as D (daily) and W (weekly) with the argument.If you want to thin out time series data or spectrum data that does not have a date and time column, you can use the following method.

What is time series / spectrum data?

Time-series / spectral data is observed at regular intervals along a certain axis.The data series applies.For example, changes in stock prices and absorption spectra.

Example of time series data Spectral data example
・ Transition of temperature and precipitation conditions
・ Changes in traffic conditions
・ Daily sales
・ Changes in stock prices
・ Changes in Bitcoin prices
・ Voice data
-Compound absorption spectrum (IR, UV)
・ Spectrum from celestial bodies

Although each classification is different, the data has the following similar characteristics.

  1. Points adjacent to a measurement point take close values
  2. Noise is included in the long term (wide range)

参照: https://kotobank.jp/word/%E6%99%82%E7%B3%BB%E5%88%97%E3%83%87%E3%83%BC%E3%82%BF-1329677 https://datachemeng.com/preprocessspectratimeseriesdata/

Re-sampling of time series / spectral data

When creating a prediction model based on time series / spectrum data, if data at all measurement points is used, a huge number of features will be created.Since this leads to overfitting, dimension reduction by downsampling is effective for improving generalization performance.

What to use: SciPy (scipy.signal)

Open source software for mathematics, science and engineeringWare ecosystem (an assortment of advanced scientific computing libraries).

It can perform more advanced numerical arithmetic processing than NumPy, and can easily execute numerical integration, signal processing, optimization, statistics, etc. from physical constants, sparse matrices, and probability distributions.

scipy.signal is a module related to waveform processing in scipy.

  • scipy.signal.decimate

import numpy as np from scipy import signal # Create the underlying 40-point waveform data x = np.linspace (0, 10, 40, endpoint = False) y = np.cos (-x ** 2/6) # Downsampling to 20 points based on data x_down = np.linspace (0, 10, 20, endpoint = False) y_down = signal.decimate (y, 2) #Downsampling to 2/1 #Plot the result on a graph % matplotlib inline plt.plot (x, y,'.-', label ='data') plt.plot (x_down, y_down,'rs-', label ='down-sampled', alpha = 0.5) plt.legend () plt.show ()
Downsampling with scipy.signal.decimate
Downsampling with scipy.signal.decimate

scipy.signal.decimate performs anti-aliasing processing and downsamples (antialiasing processing is processing to eliminate distortion when sampling from continuous data at regular intervals).It is resampled in a form similar to the one with the data points reduced as it is.

reference:https://docs.scipy.org/doc/scipy-1.2.1/reference/generated/scipy.signal.decimate.html

  • scipy.signal.resample

#Create 40 base waveform data x = np.linspace (0, 10, 40, endpoint = False) y = np.cos (-x ** 2/6) # 2.5 times based on the data Upsampling to 100 points x_up = np.linspace (0, 10, 100, endpoint = False) y_up = signal.resample (y, 100) # Downsampling to half 20 points based on data x_down = np.linspace ( 0, 10, 20, endpoint = False) y_down = signal.resample (y, 20) #Plot the result on a graph import matplotlib.pyplot as plt% matplotlib inline plt.plot (x, y,'.-', label = 'data') plt.plot (x_up, y_up,'go-', label ='up-sampled', alpha = 0.3) plt.legend () plt.show () plt.plot (x, y,'.- ', label ='data') plt.plot (x_down, y_down,' rs-', label ='down-sampled', alpha = 0.5) plt.legend () plt.show ()
Waveform data processed by scipy.signal.resample
Waveform data processed by scipy.signal.resample

Since scipy.signal.resample is a resampling using the Fourier transform, it is based on the assumption that the signal is periodic.If the periodicity is not satisfied, such as at the end of the above data, the value will deviate significantly.

By the way, scipy.signal.resample can also be up-sampling.
The blue line is the original data point, the green is the plot after upsampling, and the red is the plot after downsampling.

 

kaggle voice analysis competitionThere is also a usage example in, so it will be helpful.

reference:https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.signal.resample.html

Apply to DataFrame

Rows are for each measurement point and columns are for each sample data frame.

import pandas as pd # Create data to be stored in the data frame y1 = np.cos (-x ** 2/6) * 1/2 y2 = np.cos (x) * 1/3 df1 = pd.DataFrame ({{ 'y': y,'y1': y1,'y2': y2}) # Downsampling df1_down = signal.decimate (df1, 2, axis = 0) df1_down = pd.DataFrame (df1_down, columns = ['y_down' ,'y1_down','y2_down'], index = np.linspace (0, 10, 20, endpoint = False)) # Display downsampled data & plot print (df1.head (10)) print (df1_down.head) (10)) df1.plot (kind ='line', marker ='.') df1_down.plot (kind ='line', marker ='.')
df1.head (10) df1_down.head (10)
 df1.head (10) df1_down.head (10)
Data points before and after downsampling
Before downsampling After downsampling

Even if you downsample from the original 40 points to 1 points, which is 2/20, the waveform is maintained with high accuracy, and it seems that there is little loss of information even if it is used as a feature.

An example of a data frame with rows for each sample and columns for each measurement point
When the row is each sample and the column is each measurement point

If the above row is each sample and the column is a data frame of each measurement point (so-called orderly data), the argumentaxis=1If so, it's OK.