Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topology.identical_molecule_groups scales poorly with many large molecules #2008

Open
mattwthompson opened this issue Feb 3, 2025 · 0 comments
Labels
protein-performance Possibly related to speed of loading or parametrizing proteins

Comments

@mattwthompson
Copy link
Member

Describe the bug

Topology.identical_molecule_groups scales super-linearly with molecule size when multiple large components are present in a system. This makes parametrizing large polymer systems unworkably slow. I believe this is a root cause of openforcefield/openff-interchange#1156

To Reproduce

This is a simple and imperfect reproduction, but it shows that when multiple copies of a large molecule are present, runtime explodes when that molecule is a few hundred or thousand heavy atoms, which is not necessarily a large system in materials science.

import time
import numpy
from matplotlib import pyplot
from openff.toolkit import Molecule, Topology
import pathlib


def time_isomorphism(
    num_heavy_atoms: int,
    num_molecule_copies: int) -> float:
    molecule = Molecule.from_smiles("C" * num_heavy_atoms)

    file_name = f"topologies/{num_heavy_atoms}-{num_molecule_copies}.json"

    if pathlib.Path(file_name).exists():
        topology = Topology.from_json(open(file_name).read())
    else:
        topology = Topology.from_molecules(num_molecule_copies * [molecule])

        open(file_name, "w").write(topology.to_json())
        
    start_time = time.perf_counter()
    topology.identical_molecule_groups

    return_time = time.perf_counter() - start_time

    print(f"n={num_heavy_atoms=}, m={num_molecule_copies}, time={round(return_time, 10)}")

    del topology

    return time.perf_counter() - start_time

fig, (ax1, ax2) = pyplot.subplots(2, 1)

for m in numpy.logspace(0, 2.0, num=3, dtype=int):
    n = numpy.unique(numpy.logspace(0, 4.0, num=41, dtype=int))
    times = numpy.array([time_isomorphism(_n, m) for _n in n])
    ax1.plot(n, times, 'o-', label=m)
    ax2.plot(n, times/m, 'o-', label=m)
    print(f"finished {m=}")

ax1.legend(loc=0, title='num copies of each molecule')
ax1.set_xlabel("Alkane length $C_n$")
ax1.set_ylabel("total runtime (seconds)")
ax1.set_xscale("log")
ax1.set_yscale("log")

ax2.set_xscale("log")
ax2.set_ylabel("runtime per copy (seconds)")

fig.show()

Output

Image

Additional context

This script runs into memory issues with large molecules, which I haven't written up yet. There might be easier ways to prepare these topologies but this is fast in human time.

@mattwthompson mattwthompson added the protein-performance Possibly related to speed of loading or parametrizing proteins label Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protein-performance Possibly related to speed of loading or parametrizing proteins
Projects
None yet
Development

No branches or pull requests

1 participant