Analysis¶
Analysis helper functions
Utils¶
- mesmerize.analysis.utils.get_array_size(transmission: Transmission, data_column: str) int [source]¶
Returns the size of the 1D arrays in the specified data column. Throws an exception if they do not match
- Parameters:
transmission (Transmission) – Desired Transmission
data_column (str) – Data column of the Transmission from which to retrieve the size
- Returns:
Size of the 1D arrays of the specified data column
- Return type:
- mesmerize.analysis.utils.get_frequency_linspace(transmission: Transmission) Tuple[ndarray, float] [source]¶
Get the frequency linspace.
Throwns an exception if all datablocks do not have the same linspace & Nyquist frequencies
- Parameters:
transmission – Transmission containing data from which to get frequency linspace
- Returns:
tuple: (frequency linspace as a 1D numpy array, nyquist frequency)
- Return type:
Tuple[np.ndarray, float]
- mesmerize.analysis.utils.get_proportions(xs: Union[Series, ndarray, list], ys: Union[Series, ndarray], xs_name: str = 'xs', ys_name: str = 'ys', swap: bool = False, percentages: bool = True) DataFrame [source]¶
Get the proportions of xs vs ys.
xs & ys are categorical data.
- Parameters:
xs (Union[pd.Series, np.ndarray]) – data plotted on the x axis
ys (Union[pd.Series, np.ndarray]) – proportions of unique elements in ys are calculated per xs
xs_name (str) – name for the xs data, useful for labeling the axis in plots
ys_name (str) – name for the ys data, useful for labeling the axis in plots
swap (bool) – swap x and y
- Returns:
DataFrame that can be plotted in a proportions bar graph
- Return type:
pd.DataFrame
- mesmerize.analysis.utils.get_sampling_rate(transmission: Transmission, tolerance: Optional[float] = 0.1) float [source]¶
Returns the mean sampling rate of all data in a Transmission if it is within the specified tolerance. Otherwise throws an exception.
- Parameters:
transmission (Transmission) – Transmission object of the data from which sampling rate is obtained.
tolerance (float) – Maximum tolerance (in Hertz) of sampling rate variation between different samples
- Returns:
The mean sampling rate of all data in the Transmission
- Return type:
- mesmerize.analysis.utils.organize_dataframe_columns(columns: Iterable[str]) Tuple[List[str], List[str], List[str]] [source]¶
Organizes DataFrame columns into data column, categorical label columns, and uuid columns.
- mesmerize.analysis.utils.pad_arrays(a: ndarray, method: str = 'random', output_size: Optional[int] = None, mode: str = 'minimum', constant: Optional[Any] = None) ndarray [source]¶
Pad all the input arrays so that are of the same length. The length is determined by the largest input array. The padding value for each input array is the minimum value in that array.
Padding for each input array is either done after the array’s last index to fill up to the length of the largest input array (method ‘fill-size’) or the padding is randomly flanked to the input array (method ‘random’) for easier visualization.
- Parameters:
a (np.ndarray) – 1D array where each element is a 1D array
method (str) – one of ‘fill-size’ or ‘random’, see docstring for details
output_size – not used
mode (str) – one of either ‘constant’ or ‘minimum’. If ‘minimum’ the min value of the array is used as the padding value. If ‘constant’ the values passed to the “constant” argument is used as the padding value.
constant (Any) – padding value if ‘mode’ is set to ‘constant’
- Returns:
Arrays padded according to the chosen method. 2D array of shape [n_arrays, size of largest input array]
- Return type:
np.ndarray
Cross correlation¶
functions¶
Helper functions. Uses tslearn.cycc
- mesmerize.analysis.math.cross_correlation.ncc_c(x: ndarray, y: ndarray) ndarray [source]¶
Must pass 1D array to both x and y
- Parameters:
x – Input array [x1, x2, x3, … xn]
y – Input array [y2, y2, x3, … yn]
- Returns:
Returns the normalized cross correlation function (as an array) of the two input vector arguments “x” and “y”
- Return type:
np.ndarray
- mesmerize.analysis.math.cross_correlation.get_omega(x: Optional[ndarray] = None, y: Optional[ndarray] = None, cc: Optional[ndarray] = None) int [source]¶
Must pass a 1D array to either both “x” and “y” or a cross-correlation function (as an array) to “cc”
- Parameters:
x – Input array [x1, x2, x3, … xn]
y – Input array [y2, y2, x3, … yn]
cc – cross-correlation function represented as an array [c1, c2, c3, … cn]
- Returns:
index (x-axis position) of the global maxima of the cross-correlation function
- Return type:
np.ndarray
- mesmerize.analysis.math.cross_correlation.get_lag(x: Optional[ndarray] = None, y: Optional[ndarray] = None, cc: Optional[ndarray] = None) float [source]¶
Must pass a 1D array to either both “x” and “y” or a cross-correlation function (as an array) to “cc”
- Parameters:
x – Input array [x1, x2, x3, … xn]
y – Input array [y2, y2, x3, … yn]
cc – cross-correlation function represented as a array [c1, c2, c3, … cn]
- Returns:
Position of the maxima of the cross-correlation function with respect to middle point of the cross-correlation function
- Return type:
np.ndarray
- mesmerize.analysis.math.cross_correlation.get_epsilon(x: Optional[ndarray] = None, y: Optional[ndarray] = None, cc: Optional[ndarray] = None) float [source]¶
Must pass a 1D vector to either both “x” and “y” or a cross-correlation function to “cc”
- Parameters:
x – Input array [x1, x2, x3, … xn]
y – Input array [y2, y2, x3, … yn]
cc – cross-correlation function represented as an array [c1, c2, c3, … cn]
- Returns:
Magnitude of the global maxima of the cross-correlationn function
- Return type:
np.ndarray
- mesmerize.analysis.math.cross_correlation.get_lag_matrix(curves: Optional[ndarray] = None, ccs: Optional[ndarray] = None) ndarray [source]¶
Get a 2D matrix of lags. Can pass either a 2D array of 1D curves or cross-correlations
- Parameters:
curves – 2D array of 1D curves
ccs – 2D array of 1D cross-correlation functions represented by arrays
- Returns:
2D matrix of lag values, shape is [n_curves, n_curves]
- Return type:
np.ndarray
- mesmerize.analysis.math.cross_correlation.get_epsilon_matrix(curves: Optional[ndarray] = None, ccs: Optional[ndarray] = None) ndarray [source]¶
Get a 2D matrix of maximas. Can pass either a 2D array of 1D curves or cross-correlations
- Parameters:
curves – 2D array of 1D curves
ccs – 2D array of 1D cross-correlation functions represented by arrays
- Returns:
2D matrix of maxima values, shape is [n_curves, n_curves]
- Return type:
np.ndarray
- mesmerize.analysis.math.cross_correlation.compute_cc_data(curves: ndarray) CC_Data [source]¶
Compute cross-correlation data (cc functions, lag and maxima matrices)
- Parameters:
curves – input curves as a 2D array, shape is [n_samples, curve_size]
- Returns:
cross correlation data for the input curves as a CC_Data instance
- Return type:
CC_Data¶
Data container
Warning
All arguments MUST be numpy.ndarray type for CC_Data for the save to be saveable as an hdf5 file. Set numpy.unicode
as the dtype for the curve_uuids
and labels
arrays. If the dtype is 'O'
(object) the to_hdf5() method will fail.
- class mesmerize.analysis.cross_correlation.CC_Data(input_data: Optional[ndarray] = None, ccs: Optional[ndarray] = None, lag_matrix: Optional[ndarray] = None, epsilon_matrix: Optional[ndarray] = None, curve_uuids: Optional[ndarray] = None, labels: Optional[ndarray] = None)¶
- __init__(input_data: Optional[ndarray] = None, ccs: Optional[ndarray] = None, lag_matrix: Optional[ndarray] = None, epsilon_matrix: Optional[ndarray] = None, curve_uuids: Optional[ndarray] = None, labels: Optional[ndarray] = None)¶
Object for organizing cross-correlation data
types must be numpy.ndarray to be compatible with hdf5
- Parameters:
ccs (np.ndarray) – array of cross-correlation functions, shape: [n_curves, n_curves, func_length]
lag_matrix (np.ndarray) – the lag matrix, shape: [n_curves, n_curves]
epsilon_matrix (np.ndarray) – the maxima matrix, shape: [n_curves, n_curves]
curve_uuids (np.ndarray) – uuids (str representation) for each of the curves, length: n_curves
labels (np.ndarray) – labels for each curve, length: n_curves
- ccs¶
array of cross-correlation functions
- lag_matrix¶
lag matrix
- epsilon_matrix¶
maxima matrix
- curve_uuids¶
uuids for each curve
- labels¶
labels for each curve
- get_threshold_matrix(matrix_type: str, lag_thr: float, max_thr: float, lag_thr_abs: bool = True) ndarray ¶
Get lag or maxima matrix with thresholds applied. Values outside the threshold are set to NaN
- Parameters:
matrix_type – one of ‘lag’ or ‘maxima’
lag_thr – lag threshold
max_thr – maxima threshold
lag_thr_abs – threshold with the absolute value of lag
- Returns:
the requested matrix with the thresholds applied to it.
- Return type:
np.ndarray
Clustering metrics¶
- mesmerize.analysis.clustering_metrics.get_centerlike(cluster_members: ndarray, metric: Optional[Union[str, callable]] = None, dist_matrix: Optional[ndarray] = None) Tuple[ndarray, int] [source]¶
Finds the 1D time-series within a cluster that is the most centerlike
- Parameters:
cluster_members – 2D numpy array in the form [n_samples, 1D time_series]
metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances
dist_matrix – Distance matrix of the cluster members
- Returns:
The cluster member which is most centerlike, and its index in the cluster_members array
- mesmerize.analysis.clustering_metrics.get_cluster_radius(cluster_members: ndarray, metric: Optional[Union[str, callable]] = None, dist_matrix: Optional[ndarray] = None, centerlike_index: Optional[int] = None) float [source]¶
Returns the cluster radius according to chosen distance metric
- Parameters:
cluster_members – 2D numpy array in the form [n_samples, 1D time_series]
metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances
dist_matrix – Distance matrix of the cluster members
centerlike_index – Index of the centerlike cluster member within the cluster_members array
- Returns:
The cluster radius, average between the most centerlike member and all other members
- mesmerize.analysis.clustering_metrics.davies_bouldin_score(data: ndarray, cluster_labels: ndarray, metric: Union[str, callable]) Tuple[float, ndarray] [source]¶
Adopted from sklearn.metrics.davies_bouldin_score to use any distance metric
- Parameters:
data – Data that was used for clustering, [n_samples, 1D time_series]
metric – Metric to use for pairwise distance calculation, simply passed to sklearn.metrics.pairwise_distances
cluster_labels – Cluster labels
- Returns:
Davies Bouldin Score using EMD