`Topology.identical_molecule_groups` scales poorly with many large molecules #2008

mattwthompson · 2025-02-03T20:56:02Z

Describe the bug

Topology.identical_molecule_groups scales super-linearly with molecule size when multiple large components are present in a system. This makes parametrizing large polymer systems unworkably slow. I believe this is a root cause of openforcefield/openff-interchange#1156

To Reproduce

This is a simple and imperfect reproduction, but it shows that when multiple copies of a large molecule are present, runtime explodes when that molecule is a few hundred or thousand heavy atoms, which is not necessarily a large system in materials science.

import time
import numpy
from matplotlib import pyplot
from openff.toolkit import Molecule, Topology
import pathlib


def time_isomorphism(
    num_heavy_atoms: int,
    num_molecule_copies: int) -> float:
    molecule = Molecule.from_smiles("C" * num_heavy_atoms)

    file_name = f"topologies/{num_heavy_atoms}-{num_molecule_copies}.json"

    if pathlib.Path(file_name).exists():
        topology = Topology.from_json(open(file_name).read())
    else:
        topology = Topology.from_molecules(num_molecule_copies * [molecule])

        open(file_name, "w").write(topology.to_json())
        
    start_time = time.perf_counter()
    topology.identical_molecule_groups

    return_time = time.perf_counter() - start_time

    print(f"n={num_heavy_atoms=}, m={num_molecule_copies}, time={round(return_time, 10)}")

    del topology

    return time.perf_counter() - start_time

fig, (ax1, ax2) = pyplot.subplots(2, 1)

for m in numpy.logspace(0, 2.0, num=3, dtype=int):
    n = numpy.unique(numpy.logspace(0, 4.0, num=41, dtype=int))
    times = numpy.array([time_isomorphism(_n, m) for _n in n])
    ax1.plot(n, times, 'o-', label=m)
    ax2.plot(n, times/m, 'o-', label=m)
    print(f"finished {m=}")

ax1.legend(loc=0, title='num copies of each molecule')
ax1.set_xlabel("Alkane length $C_n$")
ax1.set_ylabel("total runtime (seconds)")
ax1.set_xscale("log")
ax1.set_yscale("log")

ax2.set_xscale("log")
ax2.set_ylabel("runtime per copy (seconds)")

fig.show()

Output

Additional context

This script runs into memory issues with large molecules, which I haven't written up yet. There might be easier ways to prepare these topologies but this is fast in human time.

The text was updated successfully, but these errors were encountered:

mattwthompson added the protein-performance Possibly related to speed of loading or parametrizing proteins label Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Topology.identical_molecule_groups` scales poorly with many large molecules #2008

`Topology.identical_molecule_groups` scales poorly with many large molecules #2008

mattwthompson commented Feb 3, 2025

Topology.identical_molecule_groups scales poorly with many large molecules #2008

Topology.identical_molecule_groups scales poorly with many large molecules #2008

Comments

mattwthompson commented Feb 3, 2025

`Topology.identical_molecule_groups` scales poorly with many large molecules #2008

`Topology.identical_molecule_groups` scales poorly with many large molecules #2008