hoxo-m · mierzejk · Aug 22, 2023 · Aug 22, 2023 · Aug 23, 2023 · Aug 23, 2023
diff --git a/README.md b/README.md
@@ -223,6 +223,17 @@ compute the alpha-relative density ratio at the passed coordinates.
 -   **The function to estimate the alpha-relative density ratio** is
     named `compute_density_ratio()`.
 
+### 3.4. Setting Gaussian kernel calculation engine
+
+When working out Gaussian kernels, linear algebra calculations can be done either with `numpy` or `numba` packages. The `densratio.set_compute_kernel_target` is a function that accepts a single `str` argument to globally `target` a specified engine:
+- `numpy` - [**numpy** broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html#module-numpy.doc.broadcasting) optimized. It must be noted the underlying BLAS library (e.g. Intel's MKL) can take advantage of [multi threading model](https://software.intel.com/content/www/us/en/develop/documentation/mkl-linux-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/using-additional-threading-control.html).
+- `cpu` - [**numba** generalized universal function single thread](https://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator) optimized.
+- `parallel` - [**numba** generalized universal function multi thread](https://numba.pydata.org/numba-doc/latest/reference/jit-compilation.html#numba.guvectorize) optimized. Please be advised all [threading layer specifics](https://numba.pydata.org/numba-doc/latest/user/threading-layer.html) apply.
+
+`densration` defaults to `cpu` when `numba` is available (to take heed of multi threading technicalities), or `numpy` otherwise.
+
+Although `numba` is not a requirement of `densratio_py`, its version ≥ `0.45.1` is necessary to set the calculation engine to `cpu` or `parallel`.
+
 ## 4. Multi Dimensional Data Samples
 
 So far, we have deal with one-dimensional data samples `x` and `y`.
@@ -305,7 +316,26 @@ plt.show()
 
 ![](README_files/figure-gfm/compare-2d-5.png)<!-- -->
 
-## 5. References
+## 5. `DensityRatio.alpha_normalize` function
+This is a supplementary function, that by applying the following linear transformation changes the lower `0` bound (infimum) of its `values` argument in the range of `[0, 1]` to `[alpha, 1]` :
+```math
+x' = (1 - alpha) * x + alpha =  x + alpha * (1 - x)
+```
+The function returns an array whose values are elements of the `[alpha, alpha^-1]` bounded set; the upper `alpha^-1` bound (supremum) is due to properties of the alpha-relative density ratio estimator. As long as the outcome is comprised of positive values only, when the `densratio_obj.alpha_normalize` function is applied to the `densratio_obj.compute_density_ratio` result, it renders the density ratio estimation eligible to be further processed on the logarithmic scale. 
+
+### 5.1. Implementation specifics
+To preserve vital estimator properties, there are 2 invariants always met by the function results:
+1. The number of unique values must not changed.
+2. The order of output values remains the same as in the input argument (as determined by [`numpy.argsort`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html)).
+
+Due to floating‑point arithmetic, if two consecutive input numbers that are not equal are to be transformed to the same outcome, in order to satisfy the constraints in the list above and yield different values as well, the [`numpy.nextafter`](https://numpy.org/doc/stable/reference/generated/numpy.nextafter.html) is employed in the direction of `0`.  
+By virtue of this implementation approach, in extreme cases there is a possibility of output values that:
+- are less than `alpha`, or
+- are nonpositive.
+
+Should it happen, a relevant warning is issued.
+
+## 6. References
 
 \[1\] Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T.
 **Statistical outlier detection using direct density ratio estimation.**
@@ -322,7 +352,7 @@ in Machine Learning.** Cambridge University Press 2012.
 Detection in Time-Series Data by Relative Density-Ratio Estimation**
 Neural Networks, 2013.
 
-## 6. Related Work
+## 7. Related Work
 
 -   densratio for R <https://github.com/hoxo-m/densratio>
 -   pykliep <https://github.com/srome/pykliep>
diff --git a/README.rmd b/README.rmd
@@ -196,6 +196,17 @@ print(result)
 - **Kernel weights(theta)** are theta parameters in the linear kernel model. You can find these values in `result.theta`.
 - **The function to estimate the alpha-relative density ratio** is named `compute_density_ratio()`.
 
+### 3.4. Setting Gaussian kernel calculation engine
+
+When working out Gaussian kernels, linear algebra calculations can be done either with `numpy` or `numba` packages. The `densratio.set_compute_kernel_target` is a function that accepts a single `str` argument to globally `target` a specified engine:
+- `numpy` - [**numpy** broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html#module-numpy.doc.broadcasting) optimized. It must be noted the underlying BLAS library (e.g. Intel's MKL) can take advantage of [multi threading model](https://software.intel.com/content/www/us/en/develop/documentation/mkl-linux-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/using-additional-threading-control.html).
+- `cpu` - [**numba** generalized universal function single thread](https://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator) optimized.
+- `parallel` - [**numba** generalized universal function multi thread](https://numba.pydata.org/numba-doc/latest/reference/jit-compilation.html#numba.guvectorize) optimized. Please be advised all [threading layer specifics](https://numba.pydata.org/numba-doc/latest/user/threading-layer.html) apply.
+
+`densration` defaults to `cpu` when `numba` is available (to take heed of multi threading technicalities), or `numpy` otherwise.
+
+Although `numba` is not a requirement of `densratio_py`, its version ≥ `0.45.1` is necessary to set the calculation engine to `cpu` or `parallel`.
+
 ## 4. Multi Dimensional Data Samples
 
 So far, we have deal with one-dimensional data samples `x` and `y`.
@@ -251,7 +262,26 @@ plt.title("Estimated Alpha-Relative Density Ratio")
 plt.show()
 ```
 
-## 5. References
+## 5. `DensityRatio.alpha_normalize` function
+This is a supplementary function, that by applying the following linear transformation changes the lower `0` bound (infimum) of its `values` argument in the range of `[0, 1]` to `[alpha, 1]` :
+```math
+x' = (1 - alpha) * x + alpha =  x + alpha * (1 - x)
+```
+The function returns an array whose values are elements of the `[alpha, alpha^-1]` bounded set; the upper `alpha^-1` bound (supremum) is due to properties of the alpha-relative density ratio estimator. As long as the outcome is comprised of positive values only, when the `densratio_obj.alpha_normalize` function is applied to the `densratio_obj.compute_density_ratio` result, it renders the density ratio estimation eligible to be further processed on the logarithmic scale.
+
+### 5.1. Implementation specifics
+To preserve vital estimator properties, there are 2 invariants always met by the function results:
+1. The number of unique values must not changed.
+2. The order of output values remains the same as in the input argument (as determined by [`numpy.argsort`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html)).
+
+Due to floating‑point arithmetic, if two consecutive input numbers that are not equal are to be transformed to the same outcome, in order to satisfy the constraints in the list above and yield different values as well, the [`numpy.nextafter`](https://numpy.org/doc/stable/reference/generated/numpy.nextafter.html) is employed in the direction of `0`.
+By virtue of this implementation approach, in extreme cases there is a possibility of output values that:
+- are less than `alpha`, or
+- are nonpositive.
+
+Should it happen, a relevant warning is issued.
+
+## 6. References
 
 [1] Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T.
 **Statistical outlier detection using direct density ratio estimation.**
@@ -268,7 +298,7 @@ Cambridge University Press 2012.
 **Change-Point Detection in Time-Series Data by Relative Density-Ratio Estimation**
 Neural Networks, 2013.
 
-## 6. Related Work
+## 7. Related Work
 
 - densratio for R https://github.com/hoxo-m/densratio
 - pykliep https://github.com/srome/pykliep
diff --git a/densratio/RuLSIF.py b/densratio/RuLSIF.py
@@ -10,12 +10,12 @@
         Journal of Machine Learning Research 10 (2009) 1391-1445.
 """
 
-from numpy import array, asarray, asmatrix, diag, diagflat, empty, exp, inf, log, matrix, multiply, ones, power, sum
-from numpy.random import randint
+from numpy import array, asarray, diag, diagflat, empty, exp, inf, log, multiply, ones, power, sum
 from numpy.linalg import solve
 from warnings import warn
 from .density_ratio import DensityRatio, KernelInfo
-from .helpers import guvectorize_compute, np_float, to_ndarray
+from .helpers import guvectorize_compute, np_float, semi_stratified_sample, to_ndarray
+from .helpers import alpha_normalize as static_alpha_normalize
 
 
 def RuLSIF(x, y, alpha, sigma_range, lambda_range, kernel_num=100, verbose=True):
@@ -26,8 +26,8 @@ def RuLSIF(x, y, alpha, sigma_range, lambda_range, kernel_num=100, verbose=True)
     p_alpha(x) = alpha * p(x) + (1 - alpha) * q(x)
 
     Arguments:
-        x (numpy.matrix): Sample from p(x).
-        y (numpy.matrix): Sample from q(x).
+        x (numpy.ndarray): Sample from p(x).
+        y (numpy.ndarray): Sample from q(x).
         alpha (float): Mixture parameter.
         sigma_range (list<float>): Search range of Gaussian kernel bandwidth.
         lambda_range (list<float>): Search range of regularization parameter.
@@ -46,7 +46,7 @@ def RuLSIF(x, y, alpha, sigma_range, lambda_range, kernel_num=100, verbose=True)
     kernel_num = min(kernel_num, nx)
 
     # Randomly take a subset of x, to identify centers for the kernels.
-    centers = x[randint(nx, size=kernel_num)]
+    centers = x[semi_stratified_sample(x, size=kernel_num)]
 
     if verbose:
         print("RuLSIF starting...")
@@ -87,6 +87,21 @@ def alpha_density_ratio(coordinates):
 
         return alpha_density_ratio
 
+    def alpha_normalize(values: array) -> array:
+        """
+        Normalizes values less than 1 so the minimum value to replace 0 is symmetrical to alpha^-1
+        with respect to the natural logarithm.
+
+        Arguments:
+            values (numpy.array): A vector to normalize.
+
+        Returns:
+            Normalized numpy.array object that preserves the order and the number of unique input argument values.
+        """
+
+        return static_alpha_normalize(values, alpha)
+
+
     # Compute the approximate alpha-relative PE-divergence, given samples x and y from the respective distributions.
     def alpha_PE_divergence(x, y):
         # This is Y, in Reference 1.
@@ -112,23 +127,24 @@ def alpha_KL_divergence(x, y):
         x = to_ndarray(x)
 
         # Obtain alpha-relative density ratio at these points.
-        g_x = alpha_density_ratio(x)
+        g_x = alpha_normalize(alpha_density_ratio(x))
 
         # Compute the alpha-relative KL-divergence.
         n = x.shape[0]
-        divergence = log(g_x).sum(axis=0) / n
+        divergence = log(g_x).sum(axis=0) / n if (g_x > 0.).all() else '[not calculated]'
         return divergence
 
     alpha_PE = alpha_PE_divergence(x, y)
     alpha_KL = alpha_KL_divergence(x, y)
 
     if verbose:
         print("Approximate alpha-relative PE-divergence = {:03.2f}".format(alpha_PE))
-        print("Approximate alpha-relative KL-divergence = {:03.2f}".format(alpha_KL))
+        alpha_KL_format = '{}' if isinstance(alpha_KL, str) else '{:03.2f}'
+        print("Approximate alpha-relative KL-divergence =", alpha_KL_format.format(alpha_KL))
 
     kernel_info = KernelInfo(kernel_type="Gaussian", kernel_num=kernel_num, sigma=sigma, centers=centers)
     result = DensityRatio(method="RuLSIF", alpha=alpha, theta=theta, lambda_=lambda_, alpha_PE=alpha_PE, alpha_KL=alpha_KL,
-                          kernel_info=kernel_info, compute_density_ratio=alpha_density_ratio)
+                          kernel_info=kernel_info, compute_density_ratio=alpha_density_ratio, alpha_normalize=alpha_normalize)
 
     if verbose:
         print("RuLSIF completed.")
@@ -191,12 +207,8 @@ def _compute_kernel_Gaussian(x_list, y_row, neg_gamma, res) -> None:
 
 def _target_numpy_wrapper(x_list, y_list, neg_gamma):
     res = empty((y_list.shape[0], x_list.shape[0]), np_float)
-    if isinstance(x_list, matrix) or isinstance(y_list, matrix):
-        res = asmatrix(res)
-
     for j, y_row in enumerate(y_list):
-        # `.T` aligns shapes for matrices, does nothing for 1D ndarray.
-        _compute_kernel_Gaussian(x_list, y_row, neg_gamma, res[j].T)
+        _compute_kernel_Gaussian(x_list, y_row, neg_gamma, res[j])
 
     return res
 
@@ -208,7 +220,7 @@ def _target_numpy_wrapper(x_list, y_list, neg_gamma):
 _compute_function = _compute_functions['cpu' if 'cpu' in _compute_functions else 'numpy']
 
 
-# Returns a 2D numpy matrix of kernel evaluated at the gridpoints with coordinates from x_list and y_list.
+# Returns a 2D numpy ndarray of kernel evaluated at the gridpoints with coordinates from x_list and y_list.
 def compute_kernel_Gaussian(x_list, y_list, sigma):
     return _compute_function(x_list, y_list, -.5 * sigma ** -2).T
 

diff --git a/densratio/__init__.py b/densratio/__init__.py
@@ -1,7 +1,9 @@
 from warnings import filterwarnings
 from .core import densratio
+from .helpers import alpha_normalize
 from .RuLSIF import set_compute_kernel_target
 
 
 filterwarnings('default', message='\'numba\'', category=ImportWarning, module='densratio')
-__all__ = ['densratio', 'set_compute_kernel_target']
+filterwarnings('always', message='Normalized vector contains', category=RuntimeWarning, module=r'densratio\.helpers')
+__all__ = ['alpha_normalize', 'densratio', 'set_compute_kernel_target']
diff --git a/densratio/density_ratio.py b/densratio/density_ratio.py
@@ -4,7 +4,7 @@
 
 class DensityRatio:
     """Density Ratio."""
-    def __init__(self, method, alpha, theta, lambda_, alpha_PE, alpha_KL, kernel_info, compute_density_ratio):
+    def __init__(self, method, alpha, theta, lambda_, alpha_PE, alpha_KL, kernel_info, compute_density_ratio, alpha_normalize):
         self.method = method
         self.alpha = alpha
         self.theta = theta
@@ -13,6 +13,7 @@ def __init__(self, method, alpha, theta, lambda_, alpha_PE, alpha_KL, kernel_inf
         self.alpha_KL = alpha_KL
         self.kernel_info = kernel_info
         self.compute_density_ratio = compute_density_ratio
+        self.alpha_normalize = alpha_normalize
 
     def __str__(self):
         return """

diff --git a/densratio/helpers.py b/densratio/helpers.py
@@ -1,4 +1,7 @@
-from numpy import array, matrix, ndarray, result_type
+import numpy as np
+
+from numpy import array, ndarray, result_type
+from warnings import warn
 
 
 np_float = result_type(float)
@@ -30,6 +33,112 @@ def to_ndarray(x):
     elif str(type(x)) == "<class 'pandas.core.frame.DataFrame'>":
         return x.values
     elif not x:
-        raise ValueError("Cannot transform to numpy.matrix.")
+        raise ValueError("Cannot transform to numpy.ndarray.")
     else:
         return to_ndarray(array(x))
+
+
+def alpha_normalize(values: ndarray, alpha: float) -> ndarray:
+    """
+    Normalizes values less than 1 so the minimum value to replace 0 is symmetrical to alpha^-1
+    with respect to the natural logarithm.
+
+    Arguments:
+        values (numpy.array): A vector to normalize.
+        alpha (float): The nonnegative normalization term.
+
+    Returns:
+        Normalized numpy.array object that preserves the order and the number of unique input argument values.
+    """
+    values = np.asarray(values)
+    if values.ndim != 1:
+        raise ValueError('\'values\' must a 1d vector.')
+
+    if alpha <= 0.:
+        if alpha < 0.:
+            warn('\'alpha\' is negative, normalization aborted.', RuntimeWarning)
+
+        return values
+
+    a = 1. - alpha
+    inserted = last_value = 1.
+    outcome = np.empty(values.shape, dtype=values.dtype)
+
+    values_argsort = np.argsort(values)
+    for i in np.flip(values_argsort):
+        value = values[i]
+        if value >= 1.:
+            outcome[i] = value
+            continue
+
+        if value < last_value:
+            new_value = inserted - a * (last_value - value)
+            inserted = np.nextafter(inserted, 0.) if new_value == inserted else new_value
+            last_value = value
+        else:
+            assert value == last_value
+
+        outcome[i] = inserted
+
+    if inserted <= 0.:
+        warn(f'Normalized vector contains at least one nonpositive [min={inserted}] value.', RuntimeWarning)
+    elif inserted < alpha:
+        warn(f'Normalized vector contains at least one value [min={inserted}] less than alpha [{alpha}].',
+             RuntimeWarning)
+
+    return outcome
+
+
+def semi_stratified_sample(data: ndarray, size: int) -> ndarray:
+    if data.ndim > 2:
+        raise ValueError('Only single and 2d arrays are supported.')
+    if not size:
+        return np.empty(0)
+
+    data_length = data.shape[0]
+    result = np.arange(data_length, dtype=int)
+    if size == data_length:
+        np.random.shuffle(result)
+        return result
+    if size < 0:
+        raise ValueError('Sample size must be a non-negative integer number.')
+    if size > data_length:
+        raise ValueError('Sample size cannot exceed the shape of input data.')
+
+    dims = data.shape[1]
+    indexed = np.column_stack((data, result))
+    result = np.empty(0, dtype=indexed.dtype)
+
+    samples_no = size // dims
+    if samples_no:
+        percentiles = np.linspace(0., 100., num=samples_no, endpoint=False)[1:]
+
+        for d in range(dims):
+            column = indexed[..., d]
+            quantiles = np.append(column.min(), np.percentile(column, percentiles))
+            indices = []
+            i, sample_size = 0, 1
+
+            while i < samples_no:
+                left = quantiles[i]
+                i += 1
+                right = np.Inf if i == samples_no else quantiles[i]
+                try:
+                    indices.extend(np.random.choice(
+                        indexed[(left <= column) & (column < right), dims],
+                        size=sample_size,
+                        replace=False))
+                except ValueError:
+                    sample_size += 1
+                    continue
+                else:
+                    sample_size = 1
+
+            indexed = indexed[~np.isin(indexed[:, dims], indices)]
+            result = np.append(result, indices)
+
+    result = np.append(
+        result,
+        np.random.choice(indexed[..., dims], size=size - result.size, replace=False)).astype(int)
+    np.random.shuffle(result)
+    return result
diff --git a/tests/test_RuLSIF.py b/tests/test_RuLSIF.py
@@ -1,7 +1,7 @@
 import unittest
 
 from scipy.stats import norm, multivariate_normal
-from numpy import linspace
+from numpy import linspace, mgrid
 from .context import densratio
 
 
@@ -45,7 +45,9 @@ def test_alphadensratio_2d(self):
         y = multivariate_normal.rvs(size=300, mean=[1, 1], cov=[[1./2, 0], [0, 2]], random_state=71)
         result = densratio(x, y, alpha=0.5)
         self.assertIsNotNone(result)
-        density_ratio = result.compute_density_ratio(linspace(-1, 3))
+        space_range = slice(-1, 3, 50j)
+        space_2d = mgrid[space_range, space_range].reshape(2, -1).T
+        density_ratio = result.compute_density_ratio(space_2d)
         self.assertTrue((density_ratio >= 0).all())
 
     def test_densratio_dimension_error(self):