Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of Frank Wolfe method, add support for predictions without the k budget constraint, implemented some online experiments #2

Merged
merged 29 commits into from
Apr 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
f19aec7
Update weighted prediction to support a and b arguments
mwydmuch Jan 17, 2024
fc8f1ca
Refactor Frank Wolfe method
mwydmuch Jan 21, 2024
42e1179
Rename the Frank Wolfe submodule
mwydmuch Jan 21, 2024
3271e3e
Add support for predictions without the k budget constraint
mwydmuch Jan 23, 2024
ab03c9a
Add some online experiments
mwydmuch Jan 23, 2024
603e87e
Fix pre-commit checks
mwydmuch Jan 25, 2024
8ddd387
Add new numba methods, improve performance of BC and Frank-Wolfe
mwydmuch Jan 27, 2024
60fcb52
Add more numba methods to make the implementation of other algorithms…
mwydmuch Feb 3, 2024
2151fea
Add support for torch.tensors to weighted methods, improve performanc…
mwydmuch Feb 10, 2024
3162fe3
Format block_coordinate.py, remove wrappers for specific metrics
mwydmuch Feb 19, 2024
a1b7c4e
Format some other submodules of the xcolumns package
mwydmuch Feb 19, 2024
f2facf3
Reorganize implementation of metrics
mwydmuch Feb 19, 2024
0ca1e72
Improve types annotations, add autograd support for Frank Wolfe, intr…
mwydmuch Feb 25, 2024
354c6a0
Add first draft of quick_usage_guide.ipynb
mwydmuch Feb 25, 2024
da56fb8
Init sphinx docs
mwydmuch Feb 25, 2024
e23c508
Add basic tests
mwydmuch Feb 25, 2024
8ae4ff0
Add support for torch.tensors to Frank Wolfe algorithm, improve dtype…
mwydmuch Mar 5, 2024
3ed886c
Clean up Frank Wolfe module, rename arguments of block coordinate-bas…
mwydmuch Mar 5, 2024
781469f
Update the tests
mwydmuch Mar 6, 2024
ac69f9e
Update the main module, the tests, and NeurIPS experiments
mwydmuch Mar 10, 2024
9a55288
Update .gitignore
mwydmuch Mar 12, 2024
9812a18
Update the experiments
mwydmuch Mar 23, 2024
bc70003
Update xcolumns, add better doc strings, add micro/macro variants for…
mwydmuch Mar 24, 2024
d2293d1
Improve docstrings, add wrapper methods for different metrics and FW …
mwydmuch Mar 28, 2024
7cd3258
Update online experiments
mwydmuch Mar 29, 2024
ddc5a59
Update the documentation
mwydmuch Apr 2, 2024
525dc42
Update the experiments
mwydmuch Apr 4, 2024
98ab851
Update the docs
mwydmuch Apr 6, 2024
a6640dc
Update the main module
mwydmuch Apr 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 103 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,106 @@
experiments/datasets
experiments/predictions
experiments/results*
lightning_logs
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

__pycache__
# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py


# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# VS Code
.vscode

# PyCharm
.idea
11 changes: 7 additions & 4 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,13 @@ repos:
- id: check-shebang-scripts-are-executable
- id: detect-private-key
- id: debug-statements
- repo: https://github.com/codespell-project/codespell
rev: v2.2.4
hooks:
- id: codespell
# - repo: https://github.com/codespell-project/codespell
# rev: v2.2.4
# hooks:
# - id: codespell
# name: codespell
# entry: codespell
# args: ["xcolumns"]
# - repo: https://github.com/PyCQA/flake8
# rev: 6.0.0
# hooks:
Expand Down
72 changes: 56 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,82 @@
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://pre-commit.com/) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![PyPI version](https://badge.fury.io/py/xcolumns.svg)](https://badge.fury.io/py/xcolumns)
[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://pre-commit.com/)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)


<p align="center">
<img src="https://raw.githubusercontent.com/mwydmuch/xCOLUMNs/master/xCOLUMNs_logo.png" width="500px"/>
<img src="https://raw.githubusercontent.com/mwydmuch/xCOLUMNs/master/docs/_static/xCOLUMNs_logo.png" width="500px"/>
</p>

# x **Consistent Optimization of Label-wise Utilities in Multi-label classificatioN** s

xCOLUMNs is a small Python library aims to implement different methods for optimization of general family of label-wise utilities (performance metrics) in multi-label classification, which scale to large (extreme) datasets.
xCOLUMNs is a small Python library that aims to implement different methods for the optimization of a general family of
metrics that can be defined on multi-label classification matrices.
These include, but are not limited to, label-wise metrics.
The library provides an efficient implementation of the different optimization methods that easily scale to the extreme multi-label classification (XMLC) - problems with a very large number of labels and instances.

All the methods operate on conditional probability estimates of the labels, which are the output of the multi-label classification models.
Based on these estimates, the methods aim to find the optimal prediction for a given test set or to find the optimal population classifier as a plug-in rule on top of the conditional probability estimator.
This makes the library very flexible and allows to use it with any multi-label classification model that provides conditional probability estimates.
The library directly supports numpy arrays, PyTorch tensors, and sparse CSR matrices from scipy as input/output data types.

For more details, please see our short usage guide, the documentation, and/or the papers that describe the methods implemented in the library.


## Installation
## Quick start

### Installation

The library can be installed using pip:
```sh
pip install xcolumns
```
It should work on all major platforms (Linux, Windows, Mac) and with Python 3.8+.
It should work on all major platforms (Linux, macOS, Windows) and with Python 3.8+.


## Repository structure
### Usage

The repository is organized as follows:
- `docs/` - Sphinx documentation (work in progress)
- `experiments/` - a code for reproducing experiments from the papers
- `xcolumns/` - Python package with the library
We provide a short usage guide for the library in [short_usage_guide.ipynb](https://github.com/mwydmuch/xCOLUMNs/blob/master/short_usage_guide.ipynb) notebook.
You can also check the documentation for more details.


## Methods, usage, and how to cite

The library implements the following methods:

### Block Coordinate Ascent/Descent (BCA/BCD)
### Instance-wise weighted prediction

The library implements a set of methods for instance-wise weighted prediction, that include optimal prediction strategies for different metrics, such as:
- Precision at k
- Propensity-scored precision at k
- Macro-averaged recall at k
- Macro-averaged balanced accuracy at k
- and others ...

### Optimization of prediction for a given test set using Block Coordinate Ascent/Descent (BCA/BCD)

The method is described in the paper:
> [Erik Schultheis, Marek Wydmuch, Wojciech Kotłowski, Rohit Babbar, Krzysztof Dembczyński. Generalized test utilities for long-tail performance in
extreme multi-label classification. NeurIPS 2023.](https://arxiv.org/abs/2311.05081)
The method aims to optimize the prediction for a given test set using the block coordinate ascent/descent algorithm.

The method was first introduced and described in the paper:
> [Erik Schultheis, Marek Wydmuch, Wojciech Kotłowski, Rohit Babbar, Krzysztof Dembczyński. Generalized test utilities for long-tail performance in extreme multi-label classification. NeurIPS 2023.](https://arxiv.org/abs/2311.05081)

### Finding optimal population classifier via Frank-Wolfe (FW)

The method was first introduced and described in the paper:
> [Erik Schultheis, Wojciech Kotłowski, Marek Wydmuch, Rohit Babbar, Strom Borman, Krzysztof Dembczyński. Consistent algorithms for multi-label classification with macro-at-k metrics. ICLR 2024.](https://arxiv.org/abs/2401.16594)


## Repository structure

The repository is organized as follows:
- `docs/` - Sphinx documentation (work in progress)
- `experiments/` - a code for reproducing experiments from the papers, see the README.md file in the directory for details
- `xcolumns/` - Python package with the library
- `tests/` - tests for the library (the coverage is bit limited at the moment, but these test should guarantee that the main components of the library works as expected)


### Frank-Wolfe (FW)
## Development and contributing

Description is work in progress.
The library was created as a part of our research projects.
We are happy to share it with the community and we hope that someone will find it useful.
If you have any questions or suggestions or if you found a bug, please open an issue.
We are also happy to accept contributions in the form of pull requests.
20 changes: 20 additions & 0 deletions docs/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build

# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
13 changes: 13 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Documentation

Documentation for xCOLUMNs is generated using [Sphinx](https://www.sphinx-doc.org/)
After each commit on `master`, documentation is updated and published to [Read the Docs](https://xcolumns.readthedocs.io).

You can build the documentation locally. Just install Sphinx and run in ``docs`` directory:

```
pip install -r requirements.txt
make html
```

Documentation will be created in `docs/_build` directory.
Binary file added docs/_static/favicon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
52 changes: 39 additions & 13 deletions generate_logo.py → docs/_static/generate_logo.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,14 +55,15 @@ def create_logo_image(grid, filled_color, column_gradients, cell_size):
draw = ImageDraw.Draw(img)

# Create a gradient for each column
for i in range(len(grid[0])):
gradient_image = create_gradient(
cell_size[0],
image_height,
column_gradients[i % len(column_gradients)],
pi / 2,
)
img.paste(gradient_image, (i * cell_size[0], 0), gradient_image)
if column_gradients is not None:
for i in range(len(grid[0])):
gradient_image = create_gradient(
cell_size[0],
image_height,
column_gradients[i % len(column_gradients)],
pi / 2,
)
img.paste(gradient_image, (i * cell_size[0], 0), gradient_image)

for i, row in enumerate(grid):
# Define the starting and ending y coordinates for the row
Expand Down Expand Up @@ -146,7 +147,7 @@ def color_mod_val():
)

# Logo with the same number of filled cells in each row (k=13)
grid = """
logo_grid = """
....................................
.....XXX.XXX.X.......X...X.X..X..XX.
.....X...X.X.X.....X.XX.XX.XX.X.X...
Expand All @@ -160,6 +161,20 @@ def color_mod_val():
"\n"
)

favicon_grid = """
.........
.....XXX.
.....X...
.....X...
.X.X.X...
..X..X...
.X.X.X...
.X.X.XXX.
.........
""".strip().split(
"\n"
)

# Count the number of filled cells in each row
# for i, row in enumerate(grid):
# print(f"Full cells in row {i}: {row.count('X')}")
Expand All @@ -171,8 +186,19 @@ def color_mod_val():
)

# Generate the gradient image
logo_image = create_logo_image(grid, filled_color, columns_gradients, cell_size)

# Save the image or display it
logo_image = create_logo_image(
logo_grid, filled_color, columns_gradients, cell_size
)
logo_image.save("xCOLUMNs_logo.png") # Save the image as 'xCOLUMNs_logo.png'
logo_image.show() # Show the image

# Generate the gradient image
logo_image = create_logo_image(logo_grid, filled_color, None, cell_size)
logo_image.save(
"xCOLUMNs_logo_nobg.png"
) # Save the image as 'xCOLUMNs_logo_nobg.png'

# Generate the favicon image
favicon_image = create_logo_image(
favicon_grid, filled_color, columns_gradients, cell_size
)
favicon_image.save("favicon.png") # Save the image as 'favicon.png'
File renamed without changes
Binary file added docs/_static/xCOLUMNs_logo_nobg.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
35 changes: 35 additions & 0 deletions docs/api/block_coordinate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Block Coordinate-based prediction methods

`xcolumns.block_coordinate` module implements the methods for finding the optimal prediction for given test set using the Block Coordinate Ascend/Desend algorithm with 0-th order approximation of expected utility.
The method was first introduced and described in the paper:
> [Erik Schultheis, Marek Wydmuch, Wojciech Kotłowski, Rohit Babbar, Krzysztof Dembczyński. Generalized test utilities for long-tail performance in extreme multi-label classification. NeurIPS 2023.](https://arxiv.org/abs/2311.05081)

Note: BCA/BCD with 0-approximationuses tp, fp, fn, tn matrices parametrization of the confussion matrix,
as opposed to algorithms presented in the paper, which use :math:`t, q, p` parametrization. However both algorithms are equivalent.

The main function of the module is [**predict_using_bc_with_0approx**](#xcolumns.block_coordinate.predict_using_bc_with_0approx):

```{eval-rst}
.. autofunction:: xcolumns.block_coordinate.predict_using_bc_with_0approx
```

## Wrapper functions for specific metrics

The module provides the wrapper functions for specific metrics that can be used as arguments for the `predict_using_bc_with_0approx` function as well as factory function for creating such wrapper functions.

```{eval-rst}
.. automodule:: xcolumns.block_coordinate
:members:
:exclude-members: predict_using_bc_with_0approx, predict_optimizing_coverage_using_bc
:undoc-members:
:show-inheritance:
```


## Special function for optimization of coverage

The module provides the special function for optimization of coverage metric that use other way of estimating the expected value of the metric than `predict_using_bc_with_0approx` function.e

```{eval-rst}
.. autofunction:: xcolumns.block_coordinate.predict_optimizing_coverage_using_bc
```
11 changes: 11 additions & 0 deletions docs/api/confusion_matrix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Confusion Matrix

`xcolumns.confusion_matrix` module implements confusion matrix object and functions that can be used to calculate it.
In xCOLUMNs, the confusion matrix is parametrized by four matrices: true positive (tp), false positive (fp), false negative (fn), and true negative (tn).

```{eval-rst}
.. automodule:: xcolumns.confusion_matrix
:members:
:undoc-members:
:show-inheritance:
```
Loading
Loading