Analysis

Analysis helper functions

Utils

mesmerize.analysis.utils.get_array_size(transmission: mesmerize.analysis.data_types.Transmission, data_column: str)int[source]

Returns the size of the 1D arrays in the specified data column. Throws an exception if they do not match

Parameters
  • transmission (Transmission) – Desired Transmission

  • data_column (str) – Data column of the Transmission from which to retrieve the size

Returns

Size of the 1D arrays of the specified data column

Return type

int

mesmerize.analysis.utils.get_frequency_linspace(transmission: mesmerize.analysis.data_types.Transmission) → Tuple[numpy.ndarray, float][source]

Get the frequency linspace.

Throwns an exception if all datablocks do not have the same linspace & Nyquist frequencies

Parameters

transmission – Transmission containing data from which to get frequency linspace

Returns

tuple: (frequency linspace as a 1D numpy array, nyquist frequency)

Return type

Tuple[np.ndarray, float]

mesmerize.analysis.utils.get_proportions(xs: Union[pandas.core.series.Series, numpy.ndarray, list], ys: Union[pandas.core.series.Series, numpy.ndarray], xs_name: str = 'xs', ys_name: str = 'ys', swap: bool = False, percentages: bool = True) → pandas.core.frame.DataFrame[source]

Get the proportions of xs vs ys.

xs & ys are categorical data.

Parameters
  • xs (Union[pd.Series, np.ndarray]) – data plotted on the x axis

  • ys (Union[pd.Series, np.ndarray]) – proportions of unique elements in ys are calculated per xs

  • xs_name (str) – name for the xs data, useful for labeling the axis in plots

  • ys_name (str) – name for the ys data, useful for labeling the axis in plots

  • swap (bool) – swap x and y

Returns

DataFrame that can be plotted in a proportions bar graph

Return type

pd.DataFrame

mesmerize.analysis.utils.get_sampling_rate(transmission: mesmerize.analysis.data_types.Transmission, tolerance: Optional[float] = 0.1)float[source]

Returns the mean sampling rate of all data in a Transmission if it is within the specified tolerance. Otherwise throws an exception.

Parameters
  • transmission (Transmission) – Transmission object of the data from which sampling rate is obtained.

  • tolerance (float) – Maximum tolerance (in Hertz) of sampling rate variation between different samples

Returns

The mean sampling rate of all data in the Transmission

Return type

float

mesmerize.analysis.utils.organize_dataframe_columns(columns: Iterable[str]) → Tuple[List[str], List[str], List[str]][source]

Organizes DataFrame columns into data column, categorical label columns, and uuid columns.

Parameters

columns – All DataFrame columns

Returns

(data_columns, categorical_columns, uuid_columns)

Return type

Tuple[List[str], List[str], List[str]]

mesmerize.analysis.utils.pad_arrays(a: numpy.ndarray, method: str = 'random', output_size: int = None, mode: str = 'minimum', constant: Any = None) → numpy.ndarray[source]

Pad all the input arrays so that are of the same length. The length is determined by the largest input array. The padding value for each input array is the minimum value in that array.

Padding for each input array is either done after the array’s last index to fill up to the length of the largest input array (method ‘fill-size’) or the padding is randomly flanked to the input array (method ‘random’) for easier visualization.

Parameters
  • a (np.ndarray) – 1D array where each element is a 1D array

  • method (str) – one of ‘fill-size’ or ‘random’, see docstring for details

  • output_size – not used

  • mode (str) – one of either ‘constant’ or ‘minimum’. If ‘minimum’ the min value of the array is used as the padding value. If ‘constant’ the values passed to the “constant” argument is used as the padding value.

  • constant (Any) – padding value if ‘mode’ is set to ‘constant’

Returns

Arrays padded according to the chosen method. 2D array of shape [n_arrays, size of largest input array]

Return type

np.ndarray

Cross correlation

functions

Helper functions. Uses tslearn.cycc

mesmerize.analysis.math.cross_correlation.ncc_c(x: numpy.ndarray, y: numpy.ndarray) → numpy.ndarray[source]

Must pass 1D array to both x and y

Parameters
  • x – Input array [x1, x2, x3, … xn]

  • y – Input array [y2, y2, x3, … yn]

Returns

Returns the normalized cross correlation function (as an array) of the two input vector arguments “x” and “y”

Return type

np.ndarray

mesmerize.analysis.math.cross_correlation.get_omega(x: numpy.ndarray = None, y: numpy.ndarray = None, cc: numpy.ndarray = None)int[source]

Must pass a 1D array to either both “x” and “y” or a cross-correlation function (as an array) to “cc”

Parameters
  • x – Input array [x1, x2, x3, … xn]

  • y – Input array [y2, y2, x3, … yn]

  • cc – cross-correlation function represented as an array [c1, c2, c3, … cn]

Returns

index (x-axis position) of the global maxima of the cross-correlation function

Return type

np.ndarray

mesmerize.analysis.math.cross_correlation.get_lag(x: numpy.ndarray = None, y: numpy.ndarray = None, cc: numpy.ndarray = None)float[source]

Must pass a 1D array to either both “x” and “y” or a cross-correlation function (as an array) to “cc”

Parameters
  • x – Input array [x1, x2, x3, … xn]

  • y – Input array [y2, y2, x3, … yn]

  • cc – cross-correlation function represented as a array [c1, c2, c3, … cn]

Returns

Position of the maxima of the cross-correlation function with respect to middle point of the cross-correlation function

Return type

np.ndarray

mesmerize.analysis.math.cross_correlation.get_epsilon(x: numpy.ndarray = None, y: numpy.ndarray = None, cc: numpy.ndarray = None)float[source]

Must pass a 1D vector to either both “x” and “y” or a cross-correlation function to “cc”

Parameters
  • x – Input array [x1, x2, x3, … xn]

  • y – Input array [y2, y2, x3, … yn]

  • cc – cross-correlation function represented as an array [c1, c2, c3, … cn]

Returns

Magnitude of the global maxima of the cross-correlationn function

Return type

np.ndarray

mesmerize.analysis.math.cross_correlation.get_lag_matrix(curves: numpy.ndarray = None, ccs: numpy.ndarray = None) → numpy.ndarray[source]

Get a 2D matrix of lags. Can pass either a 2D array of 1D curves or cross-correlations

Parameters
  • curves – 2D array of 1D curves

  • ccs – 2D array of 1D cross-correlation functions represented by arrays

Returns

2D matrix of lag values, shape is [n_curves, n_curves]

Return type

np.ndarray

mesmerize.analysis.math.cross_correlation.get_epsilon_matrix(curves: numpy.ndarray = None, ccs: numpy.ndarray = None) → numpy.ndarray[source]

Get a 2D matrix of maximas. Can pass either a 2D array of 1D curves or cross-correlations

Parameters
  • curves – 2D array of 1D curves

  • ccs – 2D array of 1D cross-correlation functions represented by arrays

Returns

2D matrix of maxima values, shape is [n_curves, n_curves]

Return type

np.ndarray

mesmerize.analysis.math.cross_correlation.compute_cc_data(curves: numpy.ndarray) → mesmerize.analysis.math.cross_correlation.CC_Data[source]

Compute cross-correlation data (cc functions, lag and maxima matrices)

Parameters

curves – input curves as a 2D array, shape is [n_samples, curve_size]

Returns

cross correlation data for the input curves as a CC_Data instance

Return type

CC_Data

mesmerize.analysis.math.cross_correlation.compute_ccs(a: numpy.ndarray) → numpy.ndarray[source]

Compute cross-correlations between all 1D curves in a 2D input array

Parameters

a – 2D input array of 1D curves, shape is [n_samples, curve_size]

Return type

np.ndarray

CC_Data

Data container

Warning

All arguments MUST be numpy.ndarray type for CC_Data for the save to be saveable as an hdf5 file. Set numpy.unicode as the dtype for the curve_uuids and labels arrays. If the dtype is 'O' (object) the to_hdf5() method will fail.

class mesmerize.analysis.cross_correlation.CC_Data(input_data: numpy.ndarray = None, ccs: numpy.ndarray = None, lag_matrix: numpy.ndarray = None, epsilon_matrix: numpy.ndarray = None, curve_uuids: numpy.ndarray = None, labels: numpy.ndarray = None)
__init__(input_data: numpy.ndarray = None, ccs: numpy.ndarray = None, lag_matrix: numpy.ndarray = None, epsilon_matrix: numpy.ndarray = None, curve_uuids: numpy.ndarray = None, labels: numpy.ndarray = None)

Object for organizing cross-correlation data

types must be numpy.ndarray to be compatible with hdf5

Parameters
  • ccs (np.ndarray) – array of cross-correlation functions, shape: [n_curves, n_curves, func_length]

  • lag_matrix (np.ndarray) – the lag matrix, shape: [n_curves, n_curves]

  • epsilon_matrix (np.ndarray) – the maxima matrix, shape: [n_curves, n_curves]

  • curve_uuids (np.ndarray) – uuids (str representation) for each of the curves, length: n_curves

  • labels (np.ndarray) – labels for each curve, length: n_curves

ccs

array of cross-correlation functions

lag_matrix

lag matrix

epsilon_matrix

maxima matrix

curve_uuids

uuids for each curve

labels

labels for each curve

get_threshold_matrix(matrix_type: str, lag_thr: float, max_thr: float, lag_thr_abs: bool = True) → numpy.ndarray

Get lag or maxima matrix with thresholds applied. Values outside the threshold are set to NaN

Parameters
  • matrix_type – one of ‘lag’ or ‘maxima’

  • lag_thr – lag threshold

  • max_thr – maxima threshold

  • lag_thr_abs – threshold with the absolute value of lag

Returns

the requested matrix with the thresholds applied to it.

Return type

np.ndarray

classmethod from_dict(d: dict)

Load data from a dict

to_hdf5(path: str)

Save as an HDF5 file

Parameters

path – path to save the hdf5 file to, file must not exist.

classmethod from_hdf5(path: str)

Load cross-correlation data from an hdf5 file

Parameters

path – path to the hdf5 file

Clustering metrics

mesmerize.analysis.clustering_metrics.get_centerlike(cluster_members: numpy.ndarray, metric: Optional[Union[str, callable]] = None, dist_matrix: Optional[numpy.ndarray] = None) → Tuple[numpy.ndarray, int][source]

Finds the 1D time-series within a cluster that is the most centerlike

Parameters
  • cluster_members – 2D numpy array in the form [n_samples, 1D time_series]

  • metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances

  • dist_matrix – Distance matrix of the cluster members

Returns

The cluster member which is most centerlike, and its index in the cluster_members array

mesmerize.analysis.clustering_metrics.get_cluster_radius(cluster_members: numpy.ndarray, metric: Optional[Union[str, callable]] = None, dist_matrix: Optional[numpy.ndarray] = None, centerlike_index: Optional[int] = None)float[source]

Returns the cluster radius according to chosen distance metric

Parameters
  • cluster_members – 2D numpy array in the form [n_samples, 1D time_series]

  • metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances

  • dist_matrix – Distance matrix of the cluster members

  • centerlike_index – Index of the centerlike cluster member within the cluster_members array

Returns

The cluster radius, average between the most centerlike member and all other members

mesmerize.analysis.clustering_metrics.davies_bouldin_score(data: numpy.ndarray, cluster_labels: numpy.ndarray, metric: Union[str, callable])float[source]

Adopted from sklearn.metrics.davies_bouldin_score to use any distance metric

Parameters
  • data – Data that was used for clustering, [n_samples, 1D time_series]

  • metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances

  • cluster_labels – Cluster labels

Returns

Davies Bouldin Score using EMD