Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregated pull request #23

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
391da6d
alpha_normalize: normalize densratio values < 1 so the minimum value …
mierzejk Aug 22, 2023
5fda5b4
Change how kernel centres are selected:
mierzejk Aug 22, 2023
1a1ed58
Fixes tests.test_RuLSIF::test_alphadensratio_2d by replacing 1d linsp…
mierzejk Aug 23, 2023
3a332fb
Deleting leftovers of `numpy.matrix` type.
mierzejk Aug 23, 2023
7089d89
Merge branch 'fix/remove_matrix_type'
mierzejk Aug 23, 2023
fe01dda
Always print 'Normalized vector contains…' warnings.
mierzejk Aug 23, 2023
9e08899
Refactoring of warning. Add tests.
mierzejk Aug 24, 2023
5f7d387
Refactoring of warnings. Add tests.
mierzejk Aug 24, 2023
a9228d9
Merge remote-tracking branch 'origin/alpha_normalize' into alpha_norm…
mierzejk Aug 24, 2023
451d657
KL_divergence numerator guard.
mierzejk Aug 24, 2023
76152be
Merge branch 'alpha_normalize'
mierzejk Aug 24, 2023
8dedd4d
Handle alpha_KL_divergence '[not calculated]' str value for verbose l…
mierzejk Aug 24, 2023
f800912
Merge branch 'alpha_normalize'
mierzejk Aug 24, 2023
477b28a
semi_stratified_sample argument name refactoring: samples → size.
mierzejk Aug 27, 2023
d818fab
Refactoring: remove redundant code.
mierzejk Aug 27, 2023
58f5eb6
Merge branch 'semi_stratified_sample'
mierzejk Aug 27, 2023
42a37bd
densratio.set_compute_kernel_target documentation.
mierzejk Aug 27, 2023
330342b
Update README with section on DensityRatio.alpha_normalize function.
mierzejk Aug 27, 2023
68aee95
Merge branch 'alpha_normalize'
mierzejk Aug 27, 2023
ca7142e
Merge branch 'update_readme'
mierzejk Aug 27, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 32 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,17 @@ compute the alpha-relative density ratio at the passed coordinates.
- **The function to estimate the alpha-relative density ratio** is
named `compute_density_ratio()`.

### 3.4. Setting Gaussian kernel calculation engine

When working out Gaussian kernels, linear algebra calculations can be done either with `numpy` or `numba` packages. The `densratio.set_compute_kernel_target` is a function that accepts a single `str` argument to globally `target` a specified engine:
- `numpy` - [**numpy** broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html#module-numpy.doc.broadcasting) optimized. It must be noted the underlying BLAS library (e.g. Intel's MKL) can take advantage of [multi threading model](https://software.intel.com/content/www/us/en/develop/documentation/mkl-linux-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/using-additional-threading-control.html).
- `cpu` - [**numba** generalized universal function single thread](https://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator) optimized.
- `parallel` - [**numba** generalized universal function multi thread](https://numba.pydata.org/numba-doc/latest/reference/jit-compilation.html#numba.guvectorize) optimized. Please be advised all [threading layer specifics](https://numba.pydata.org/numba-doc/latest/user/threading-layer.html) apply.

`densration` defaults to `cpu` when `numba` is available (to take heed of multi threading technicalities), or `numpy` otherwise.

Although `numba` is not a requirement of `densratio_py`, its version ≥ `0.45.1` is necessary to set the calculation engine to `cpu` or `parallel`.

## 4. Multi Dimensional Data Samples

So far, we have deal with one-dimensional data samples `x` and `y`.
Expand Down Expand Up @@ -305,7 +316,26 @@ plt.show()

![](README_files/figure-gfm/compare-2d-5.png)<!-- -->

## 5. References
## 5. `DensityRatio.alpha_normalize` function
This is a supplementary function, that by applying the following linear transformation changes the lower `0` bound (infimum) of its `values` argument in the range of `[0, 1]` to `[alpha, 1]` :
```math
x' = (1 - alpha) * x + alpha = x + alpha * (1 - x)
```
The function returns an array whose values are elements of the `[alpha, alpha^-1]` bounded set; the upper `alpha^-1` bound (supremum) is due to properties of the alpha-relative density ratio estimator. As long as the outcome is comprised of positive values only, when the `densratio_obj.alpha_normalize` function is applied to the `densratio_obj.compute_density_ratio` result, it renders the density ratio estimation eligible to be further processed on the logarithmic scale.

### 5.1. Implementation specifics
To preserve vital estimator properties, there are 2 invariants always met by the function results:
1. The number of unique values must not changed.
2. The order of output values remains the same as in the input argument (as determined by [`numpy.argsort`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html)).

Due to floating‑point arithmetic, if two consecutive input numbers that are not equal are to be transformed to the same outcome, in order to satisfy the constraints in the list above and yield different values as well, the [`numpy.nextafter`](https://numpy.org/doc/stable/reference/generated/numpy.nextafter.html) is employed in the direction of `0`.
By virtue of this implementation approach, in extreme cases there is a possibility of output values that:
- are less than `alpha`, or
- are nonpositive.

Should it happen, a relevant warning is issued.

## 6. References

\[1\] Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T.
**Statistical outlier detection using direct density ratio estimation.**
Expand All @@ -322,7 +352,7 @@ in Machine Learning.** Cambridge University Press 2012.
Detection in Time-Series Data by Relative Density-Ratio Estimation**
Neural Networks, 2013.

## 6. Related Work
## 7. Related Work

- densratio for R <https://github.com/hoxo-m/densratio>
- pykliep <https://github.com/srome/pykliep>
34 changes: 32 additions & 2 deletions README.rmd
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,17 @@ print(result)
- **Kernel weights(theta)** are theta parameters in the linear kernel model. You can find these values in `result.theta`.
- **The function to estimate the alpha-relative density ratio** is named `compute_density_ratio()`.

### 3.4. Setting Gaussian kernel calculation engine

When working out Gaussian kernels, linear algebra calculations can be done either with `numpy` or `numba` packages. The `densratio.set_compute_kernel_target` is a function that accepts a single `str` argument to globally `target` a specified engine:
- `numpy` - [**numpy** broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html#module-numpy.doc.broadcasting) optimized. It must be noted the underlying BLAS library (e.g. Intel's MKL) can take advantage of [multi threading model](https://software.intel.com/content/www/us/en/develop/documentation/mkl-linux-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/using-additional-threading-control.html).
- `cpu` - [**numba** generalized universal function single thread](https://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator) optimized.
- `parallel` - [**numba** generalized universal function multi thread](https://numba.pydata.org/numba-doc/latest/reference/jit-compilation.html#numba.guvectorize) optimized. Please be advised all [threading layer specifics](https://numba.pydata.org/numba-doc/latest/user/threading-layer.html) apply.

`densration` defaults to `cpu` when `numba` is available (to take heed of multi threading technicalities), or `numpy` otherwise.

Although `numba` is not a requirement of `densratio_py`, its version ≥ `0.45.1` is necessary to set the calculation engine to `cpu` or `parallel`.

## 4. Multi Dimensional Data Samples

So far, we have deal with one-dimensional data samples `x` and `y`.
Expand Down Expand Up @@ -251,7 +262,26 @@ plt.title("Estimated Alpha-Relative Density Ratio")
plt.show()
```

## 5. References
## 5. `DensityRatio.alpha_normalize` function
This is a supplementary function, that by applying the following linear transformation changes the lower `0` bound (infimum) of its `values` argument in the range of `[0, 1]` to `[alpha, 1]` :
```math
x' = (1 - alpha) * x + alpha = x + alpha * (1 - x)
```
The function returns an array whose values are elements of the `[alpha, alpha^-1]` bounded set; the upper `alpha^-1` bound (supremum) is due to properties of the alpha-relative density ratio estimator. As long as the outcome is comprised of positive values only, when the `densratio_obj.alpha_normalize` function is applied to the `densratio_obj.compute_density_ratio` result, it renders the density ratio estimation eligible to be further processed on the logarithmic scale.

### 5.1. Implementation specifics
To preserve vital estimator properties, there are 2 invariants always met by the function results:
1. The number of unique values must not changed.
2. The order of output values remains the same as in the input argument (as determined by [`numpy.argsort`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html)).

Due to floating‑point arithmetic, if two consecutive input numbers that are not equal are to be transformed to the same outcome, in order to satisfy the constraints in the list above and yield different values as well, the [`numpy.nextafter`](https://numpy.org/doc/stable/reference/generated/numpy.nextafter.html) is employed in the direction of `0`.
By virtue of this implementation approach, in extreme cases there is a possibility of output values that:
- are less than `alpha`, or
- are nonpositive.

Should it happen, a relevant warning is issued.

## 6. References

[1] Hido, S., Tsuboi, Y., Kashima, H., Sugiyama, M., & Kanamori, T.
**Statistical outlier detection using direct density ratio estimation.**
Expand All @@ -268,7 +298,7 @@ Cambridge University Press 2012.
**Change-Point Detection in Time-Series Data by Relative Density-Ratio Estimation**
Neural Networks, 2013.

## 6. Related Work
## 7. Related Work

- densratio for R https://github.com/hoxo-m/densratio
- pykliep https://github.com/srome/pykliep
44 changes: 28 additions & 16 deletions densratio/RuLSIF.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@
Journal of Machine Learning Research 10 (2009) 1391-1445.
"""

from numpy import array, asarray, asmatrix, diag, diagflat, empty, exp, inf, log, matrix, multiply, ones, power, sum
from numpy.random import randint
from numpy import array, asarray, diag, diagflat, empty, exp, inf, log, multiply, ones, power, sum
from numpy.linalg import solve
from warnings import warn
from .density_ratio import DensityRatio, KernelInfo
from .helpers import guvectorize_compute, np_float, to_ndarray
from .helpers import guvectorize_compute, np_float, semi_stratified_sample, to_ndarray
from .helpers import alpha_normalize as static_alpha_normalize


def RuLSIF(x, y, alpha, sigma_range, lambda_range, kernel_num=100, verbose=True):
Expand All @@ -26,8 +26,8 @@ def RuLSIF(x, y, alpha, sigma_range, lambda_range, kernel_num=100, verbose=True)
p_alpha(x) = alpha * p(x) + (1 - alpha) * q(x)

Arguments:
x (numpy.matrix): Sample from p(x).
y (numpy.matrix): Sample from q(x).
x (numpy.ndarray): Sample from p(x).
y (numpy.ndarray): Sample from q(x).
alpha (float): Mixture parameter.
sigma_range (list<float>): Search range of Gaussian kernel bandwidth.
lambda_range (list<float>): Search range of regularization parameter.
Expand All @@ -46,7 +46,7 @@ def RuLSIF(x, y, alpha, sigma_range, lambda_range, kernel_num=100, verbose=True)
kernel_num = min(kernel_num, nx)

# Randomly take a subset of x, to identify centers for the kernels.
centers = x[randint(nx, size=kernel_num)]
centers = x[semi_stratified_sample(x, size=kernel_num)]

if verbose:
print("RuLSIF starting...")
Expand Down Expand Up @@ -87,6 +87,21 @@ def alpha_density_ratio(coordinates):

return alpha_density_ratio

def alpha_normalize(values: array) -> array:
"""
Normalizes values less than 1 so the minimum value to replace 0 is symmetrical to alpha^-1
with respect to the natural logarithm.

Arguments:
values (numpy.array): A vector to normalize.

Returns:
Normalized numpy.array object that preserves the order and the number of unique input argument values.
"""

return static_alpha_normalize(values, alpha)


# Compute the approximate alpha-relative PE-divergence, given samples x and y from the respective distributions.
def alpha_PE_divergence(x, y):
# This is Y, in Reference 1.
Expand All @@ -112,23 +127,24 @@ def alpha_KL_divergence(x, y):
x = to_ndarray(x)

# Obtain alpha-relative density ratio at these points.
g_x = alpha_density_ratio(x)
g_x = alpha_normalize(alpha_density_ratio(x))

# Compute the alpha-relative KL-divergence.
n = x.shape[0]
divergence = log(g_x).sum(axis=0) / n
divergence = log(g_x).sum(axis=0) / n if (g_x > 0.).all() else '[not calculated]'
return divergence

alpha_PE = alpha_PE_divergence(x, y)
alpha_KL = alpha_KL_divergence(x, y)

if verbose:
print("Approximate alpha-relative PE-divergence = {:03.2f}".format(alpha_PE))
print("Approximate alpha-relative KL-divergence = {:03.2f}".format(alpha_KL))
alpha_KL_format = '{}' if isinstance(alpha_KL, str) else '{:03.2f}'
print("Approximate alpha-relative KL-divergence =", alpha_KL_format.format(alpha_KL))

kernel_info = KernelInfo(kernel_type="Gaussian", kernel_num=kernel_num, sigma=sigma, centers=centers)
result = DensityRatio(method="RuLSIF", alpha=alpha, theta=theta, lambda_=lambda_, alpha_PE=alpha_PE, alpha_KL=alpha_KL,
kernel_info=kernel_info, compute_density_ratio=alpha_density_ratio)
kernel_info=kernel_info, compute_density_ratio=alpha_density_ratio, alpha_normalize=alpha_normalize)

if verbose:
print("RuLSIF completed.")
Expand Down Expand Up @@ -191,12 +207,8 @@ def _compute_kernel_Gaussian(x_list, y_row, neg_gamma, res) -> None:

def _target_numpy_wrapper(x_list, y_list, neg_gamma):
res = empty((y_list.shape[0], x_list.shape[0]), np_float)
if isinstance(x_list, matrix) or isinstance(y_list, matrix):
res = asmatrix(res)

for j, y_row in enumerate(y_list):
# `.T` aligns shapes for matrices, does nothing for 1D ndarray.
_compute_kernel_Gaussian(x_list, y_row, neg_gamma, res[j].T)
_compute_kernel_Gaussian(x_list, y_row, neg_gamma, res[j])

return res

Expand All @@ -208,7 +220,7 @@ def _target_numpy_wrapper(x_list, y_list, neg_gamma):
_compute_function = _compute_functions['cpu' if 'cpu' in _compute_functions else 'numpy']


# Returns a 2D numpy matrix of kernel evaluated at the gridpoints with coordinates from x_list and y_list.
# Returns a 2D numpy ndarray of kernel evaluated at the gridpoints with coordinates from x_list and y_list.
def compute_kernel_Gaussian(x_list, y_list, sigma):
return _compute_function(x_list, y_list, -.5 * sigma ** -2).T

Expand Down
4 changes: 3 additions & 1 deletion densratio/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
from warnings import filterwarnings
from .core import densratio
from .helpers import alpha_normalize
from .RuLSIF import set_compute_kernel_target


filterwarnings('default', message='\'numba\'', category=ImportWarning, module='densratio')
__all__ = ['densratio', 'set_compute_kernel_target']
filterwarnings('always', message='Normalized vector contains', category=RuntimeWarning, module=r'densratio\.helpers')
__all__ = ['alpha_normalize', 'densratio', 'set_compute_kernel_target']
3 changes: 2 additions & 1 deletion densratio/density_ratio.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

class DensityRatio:
"""Density Ratio."""
def __init__(self, method, alpha, theta, lambda_, alpha_PE, alpha_KL, kernel_info, compute_density_ratio):
def __init__(self, method, alpha, theta, lambda_, alpha_PE, alpha_KL, kernel_info, compute_density_ratio, alpha_normalize):
self.method = method
self.alpha = alpha
self.theta = theta
Expand All @@ -13,6 +13,7 @@ def __init__(self, method, alpha, theta, lambda_, alpha_PE, alpha_KL, kernel_inf
self.alpha_KL = alpha_KL
self.kernel_info = kernel_info
self.compute_density_ratio = compute_density_ratio
self.alpha_normalize = alpha_normalize

def __str__(self):
return """
Expand Down
113 changes: 111 additions & 2 deletions densratio/helpers.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
from numpy import array, matrix, ndarray, result_type
import numpy as np

from numpy import array, ndarray, result_type
from warnings import warn


np_float = result_type(float)
Expand Down Expand Up @@ -30,6 +33,112 @@ def to_ndarray(x):
elif str(type(x)) == "<class 'pandas.core.frame.DataFrame'>":
return x.values
elif not x:
raise ValueError("Cannot transform to numpy.matrix.")
raise ValueError("Cannot transform to numpy.ndarray.")
else:
return to_ndarray(array(x))


def alpha_normalize(values: ndarray, alpha: float) -> ndarray:
"""
Normalizes values less than 1 so the minimum value to replace 0 is symmetrical to alpha^-1
with respect to the natural logarithm.

Arguments:
values (numpy.array): A vector to normalize.
alpha (float): The nonnegative normalization term.

Returns:
Normalized numpy.array object that preserves the order and the number of unique input argument values.
"""
values = np.asarray(values)
if values.ndim != 1:
raise ValueError('\'values\' must a 1d vector.')

if alpha <= 0.:
if alpha < 0.:
warn('\'alpha\' is negative, normalization aborted.', RuntimeWarning)

return values

a = 1. - alpha
inserted = last_value = 1.
outcome = np.empty(values.shape, dtype=values.dtype)

values_argsort = np.argsort(values)
for i in np.flip(values_argsort):
value = values[i]
if value >= 1.:
outcome[i] = value
continue

if value < last_value:
new_value = inserted - a * (last_value - value)
inserted = np.nextafter(inserted, 0.) if new_value == inserted else new_value
last_value = value
else:
assert value == last_value

outcome[i] = inserted

if inserted <= 0.:
warn(f'Normalized vector contains at least one nonpositive [min={inserted}] value.', RuntimeWarning)
elif inserted < alpha:
warn(f'Normalized vector contains at least one value [min={inserted}] less than alpha [{alpha}].',
RuntimeWarning)

return outcome


def semi_stratified_sample(data: ndarray, size: int) -> ndarray:
if data.ndim > 2:
raise ValueError('Only single and 2d arrays are supported.')
if not size:
return np.empty(0)

data_length = data.shape[0]
result = np.arange(data_length, dtype=int)
if size == data_length:
np.random.shuffle(result)
return result
if size < 0:
raise ValueError('Sample size must be a non-negative integer number.')
if size > data_length:
raise ValueError('Sample size cannot exceed the shape of input data.')

dims = data.shape[1]
indexed = np.column_stack((data, result))
result = np.empty(0, dtype=indexed.dtype)

samples_no = size // dims
if samples_no:
percentiles = np.linspace(0., 100., num=samples_no, endpoint=False)[1:]

for d in range(dims):
column = indexed[..., d]
quantiles = np.append(column.min(), np.percentile(column, percentiles))
indices = []
i, sample_size = 0, 1

while i < samples_no:
left = quantiles[i]
i += 1
right = np.Inf if i == samples_no else quantiles[i]
try:
indices.extend(np.random.choice(
indexed[(left <= column) & (column < right), dims],
size=sample_size,
replace=False))
except ValueError:
sample_size += 1
continue
else:
sample_size = 1

indexed = indexed[~np.isin(indexed[:, dims], indices)]
result = np.append(result, indices)

result = np.append(
result,
np.random.choice(indexed[..., dims], size=size - result.size, replace=False)).astype(int)
np.random.shuffle(result)
return result
6 changes: 4 additions & 2 deletions tests/test_RuLSIF.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import unittest

from scipy.stats import norm, multivariate_normal
from numpy import linspace
from numpy import linspace, mgrid
from .context import densratio


Expand Down Expand Up @@ -45,7 +45,9 @@ def test_alphadensratio_2d(self):
y = multivariate_normal.rvs(size=300, mean=[1, 1], cov=[[1./2, 0], [0, 2]], random_state=71)
result = densratio(x, y, alpha=0.5)
self.assertIsNotNone(result)
density_ratio = result.compute_density_ratio(linspace(-1, 3))
space_range = slice(-1, 3, 50j)
space_2d = mgrid[space_range, space_range].reshape(2, -1).T
density_ratio = result.compute_density_ratio(space_2d)
self.assertTrue((density_ratio >= 0).all())

def test_densratio_dimension_error(self):
Expand Down
Loading