Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added functionality to allow X to be a precomputed distance matrix #55

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions debacl/level_set_tree.py
Original file line number Diff line number Diff line change
Expand Up @@ -1260,6 +1260,72 @@ def construct_tree(X, k, prune_threshold=None, num_levels=None, verbose=False):
return tree


def construct_tree_from_precomputed_matrix(X, p, k, prune_threshold=None, num_levels=None, verbose=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change "precomputed_matrix" to "distance_matrix"? I think it would be a little more explicit.

"""
Construct a level set tree from tabular data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"from tabular data" should be updated to "from a precomputed distance matrix."


Parameters
----------
X : 2-dimensional numpy array
A square distance matrix

p: int
Underlying dimensionality of the data that was used to create X, the distance matrix

k : int
Number of observations to consider as neighbors to a given point.

prune_threshold : int, optional
Leaf nodes with fewer than this number of members are recursively
merged into larger nodes. If 'None' (the default), then no pruning
is performed.

num_levels : int, optional
Number of density levels in the constructed tree. If None (default),
`num_levels` is internally set to be the number of rows in `X`.

verbose : bool, optional
If True, a progress indicator is printed at every 100th level of tree
construction.

Returns
-------
T : LevelSetTree
A pruned level set tree.

See Also
--------
construct_tree, construct_tree_from_graph, LevelSetTree

Examples
--------
>>> X = numpy.random.rand(100, 2)
>>> X = scipy.spatial.distance.pdist(data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is super pedantic, but I prefer not to overwrite the original data. Maybe call this D or distance_matrix?

>>> X = scipy.spatial.distance.squareform(matrix)
>>> tree = debacl.construct_tree_from_precomputed_matrix(X, p=2, k=8, prune_threshold=5)
>>> print(tree)
+----+-------------+-----------+------------+----------+------+--------+----------+
| id | start_level | end_level | start_mass | end_mass | size | parent | children |
+----+-------------+-----------+------------+----------+------+--------+----------+
| 0 | 0.000 | 0.870 | 0.000 | 0.450 | 100 | None | [3, 4] |
| 3 | 0.870 | 3.364 | 0.450 | 0.990 | 17 | 0 | [] |
| 4 | 0.870 | 1.027 | 0.450 | 0.520 | 35 | 0 | [7, 8] |
| 7 | 1.027 | 1.755 | 0.520 | 0.870 | 8 | 4 | [] |
| 8 | 1.027 | 3.392 | 0.520 | 1.000 | 23 | 4 | [] |
+----+-------------+-----------+------------+----------+------+--------+----------+
"""

sim_graph, radii = _utl.knn_graph(X, k, method='precomputed')

n, _ = X.shape
density = _utl.knn_density(radii, n, p, k)

tree = construct_tree_from_graph(adjacency_list=sim_graph, density=density,
prune_threshold=prune_threshold,
num_levels=num_levels, verbose=verbose)
return tree


def construct_tree_from_graph(adjacency_list, density, prune_threshold=None,
num_levels=None, verbose=False):
"""
Expand Down
23 changes: 20 additions & 3 deletions debacl/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,13 @@ def knn_graph(X, k, method='brute_force', leaf_size=30):
Parameters
----------
X : numpy array | list [numpy arrays]
Data points, with each row as an observation.
Data points, with each row as an observation
OR a distance matrix if method='precomputed'

k : int
The number of points to consider as neighbors of any given observation.

method : {'brute-force', 'kd-tree', 'ball-tree'}, optional
method : {'brute-force', 'kd-tree', 'ball-tree', 'precomputed'}, optional
Computing method.

- 'brute-force': computes the (Euclidean) distance between all O(n^2)
Expand All @@ -61,11 +62,14 @@ def knn_graph(X, k, method='brute_force', leaf_size=30):
distances. Typically much faster than 'brute-force', and works with
up to a few hundred dimensions. Requires the scikit-learn library.

- 'precomputed': used when the matrix X is a precomputed distance
matrix.

leaf_size : int, optional
For the 'kd-tree' and 'ball-tree' methods, the number of observations
in the leaf nodes. Leaves are not split further, so distance
computations within leaf nodes are done by brute force. 'leaf_size' is
ignored for the 'brute-force' method.
ignored for the 'brute-force' and 'precomputed' methods.

Returns
-------
Expand Down Expand Up @@ -109,6 +113,19 @@ def knn_graph(X, k, method='brute_force', leaf_size=30):
raise ImportError("The scikit-learn library could not be loaded." +
" It is required for the 'ball-tree' method.")

if method == 'precomputed':
# Pseudo-sort each row of the distance matrix so that the indices of the closest k elements are listed first
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably validate here that input X is actually n x n.

# The entries [...,:k] are not in sorted order by distance
neighbors = _np.argpartition(X, k)[:, 0:k]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - I didn't know numpy had this! On a 1,000 x 1,000 matrix, my crude ipython timing says this is 4x faster than argsort.


# Get the distance values for the first k elements, and then sort 'neighbors' by those distance values
radii = X[_np.arange(n)[:, None], neighbors[:]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading the documentation for np.argpartition correctly, I don't think we need lines 122-127, because the kth-nearest neighbor is guaranteed to be in position k. I think the simpler way to do it would be

neighbors = _np.argpartition(X, k - 1)[:, :k]   # note the slight difference from line 119
radii = D[_np.arange(n), neighbors[:, -1]]

order = radii.argsort(axis=1)
neighbors = neighbors[_np.arange(n)[:, None], order]

radii = radii[_np.arange(n)[:, None], order]
radii = radii[:, -1]

else: # assume brute-force
if not _HAS_SCIPY:
raise ImportError("The 'scipy' module could not be loaded. " +
Expand Down