Support for graph operations #107

jl-wynen · 2024-01-22T12:53:53Z

At the moment, Pipeline is largely a black box in terms of the graph it represents. It can build TaskGraph objects but only when the user knows suitable nodes to pass to build or get (as in tg = p.get(p._providers.keys()) below). We have several issues (open and closed) now that relate to graph operations and could be solved easily if we had such operations:

Detect unused parameters #43 can be solved with a graph partitioning algorithm (as suggested in Detect unused parameters #43 (comment))
save pipeline structure with parameters for reproduction #92 Potentially simple to implement with graph traversal functionality.
Graph visualisation including Suggestions for graph visualization #73 and existing graphviz converter.
Allow visualization of graphs with missing requirements #80 and Get missing params from a pipeline #83 are a bit trickier; we would have to build a partial graph and scan it for nodes that are missing an edge (i.e. missing provider for an arg)

NetworkX seems like a good choice for a graph implementation because it provides all basic algorithms out of the box. It becomes trivial to solve #43 as suggested by @YooSunYoung, #92 for several formats, and maybe visualisation (might get removed from NetworkX).

Current graph traversal

Unfortunately, it is currently cumbersome, expensive, and not technically allowed to traverse graphs. E.g., to build a NetworkX graph, we need something like (note the protected attributes and functions)

import networkx as nx
from sciline.visualize import _format_type as format_type
from sciline import Pipeline

def to_networkx(p: Pipeline) -> nx.DiGraph:
    tg = p.get(p._providers.keys())
    g = nx.DiGraph()
    for res, (provider, args) in tg._graph.items():
        g.add_node(hash(res), label=format_type(res).name)
        for arg in args:
            g.add_node(hash(arg), label=format_type(arg).name)
            g.add_edge(hash(arg), hash(res), label=provider.__qualname__)
    return g

Potential solutions

I talk about NetworkX here, but we can also use a different library or custom graph implementation. The general ideas remain the same.

1. Implement `to_networkx` in `Pipeline`

This is the least invasive and easiest to implement. But it would mean that operations that need to inspect the graph need to make a temporary nx.DiGraph which is wasteful.

2. Represent the internal state of `Pipeline` as a `DiGraph`

NetworkX allows storing arbitrary data in nodes and edges. So we can stuff everything we need into the graph. This would make graph operations first class citizens and would open up more possibilities. (E.g., we can use node metadata to indicate whether a provider is a param or not. Something that is currently difficult (impossible?) to determine.)
On the other hand, this would require a complete refactoring of Pipeline. And I don't even know if there is a straightforward way of representing everything as a graph. In particular when it comes to subproviders.

The text was updated successfully, but these errors were encountered:

SimonHeybrock · 2024-01-23T06:45:53Z

#74 (or #74 (comment)) can be added to your list, I think.

save pipeline structure with parameters for reproduction #92 Potentially simple to implement with graph traversal functionality.

This is somewhat different: We want to save task-graphs, which we know to be a DAG. Much less complexity there than what Pipeline has to do/support.

Graph visualisation including Suggestions for graph visualization #73 and existing graphviz converter.

Visualization is kind of unrelated to graphs. The problems we face there will not be solved by a graph library or something along those lines, imho.

Allow visualization of graphs with missing requirements #80 and Get missing params from a pipeline #83 are a bit trickier; we would have to build a partial graph and scan it for nodes that are missing an edge (i.e. missing provider for an arg)

Note that this is essentially solved already in the current implementation.

NetworkX seems like a good choice for a graph implementation because it provides all basic algorithms out of the box. It becomes trivial to solve #43 as suggested by @YooSunYoung, #92 for several formats, and maybe visualisation (might get removed from NetworkX).

Does it solve anything else apart from #43?

Current graph traversal

Unfortunately, it is currently cumbersome, expensive, and not technically allowed to traverse graphs. E.g., to build a NetworkX graph, we need something like (note the protected attributes and functions)

Why do you say it is expensive? Can you elaborate?

import networkx as nx
from sciline.visualize import _format_type as format_type
from sciline import Pipeline

def to_networkx(p: Pipeline) -> nx.DiGraph:
    tg = p.get(p._providers.keys())
    g = nx.DiGraph()
    for res, (provider, args) in tg._graph.items():
        g.add_node(hash(res), label=format_type(res).name)
        for arg in args:
            g.add_node(hash(arg), label=format_type(arg).name)
            g.add_edge(hash(arg), hash(res), label=provider.__qualname__)
    return g

Shouldn't providers also be treated as nodes?

Potential solutions

I talk about NetworkX here, but we can also use a different library or custom graph implementation. The general ideas remain the same.

1. Implement to_networkx in Pipeline

This is the least invasive and easiest to implement. But it would mean that operations that need to inspect the graph need to make a temporary nx.DiGraph which is wasteful.

Is it wasteful to a relevant amount?

2. Represent the internal state of Pipeline as a DiGraph

NetworkX allows storing arbitrary data in nodes and edges. So we can stuff everything we need into the graph. This would make graph operations first class citizens and would open up more possibilities. (E.g., we can use node metadata to indicate whether a provider is a param or not. Something that is currently difficult (impossible?) to determine.) On the other hand, this would require a complete refactoring of Pipeline. And I don't even know if there is a straightforward way of representing everything as a graph. In particular when it comes to subproviders.

I don't feel this is going to work, given the support for param tables and generic providers. Furthermore, it would make such a graph library a mandatory dependency. Until we have clear evidence that the benefit is large (see my question a couple paragraphs above, does it solve anything but #43?) I do not see a justification for that.

jl-wynen · 2024-01-23T08:24:29Z

This is somewhat different: We want to save task-graphs, which we know to be a DAG. Much less complexity there than what Pipeline has to do/support.

Can you show me an example of a pipeline that is not a DAG? Can this exist given that we can convert pipelines to task graphs?

Visualization is kind of unrelated to graphs. The problems we face there will not be solved by a graph library or something along those lines, imho.

Yes. But it would be helped by traversal functions. We can of course implement them ourselves.

Does it solve anything else apart from #43?

It provides the primitive operations needed to compose everything else. Plus high level ones that can make our lives easier. But yes, if you are looking for a single function that fixes an open issue, it's only #43

Why do you say it is expensive? Can you elaborate?

Because of tg = p.get(p._providers.keys()) which is O(nodes * O(build_graph)).

Shouldn't providers also be treated as nodes?

Can be. The choice here matches how our and Dasks's task graphs are implemented. But if this doesn't work in general, we can make providers nodes.

Is it wasteful to a relevant amount?

Depends on how many checks you run and whether they can share the intermediate graph. And, of course, what latency you are willing to allow for compute.

SimonHeybrock · 2024-01-23T08:54:36Z

This is somewhat different: We want to save task-graphs, which we know to be a DAG. Much less complexity there than what Pipeline has to do/support.

Can you show me an example of a pipeline that is not a DAG? Can this exist given that we can convert pipelines to task graphs?

I don't think thinking of a pipeline as a specific graph is possible in general. Due to (in particular) generic providers it is not possible to draw a graph without knowing which keys the users wants to compute. Furthermore it is possible to create a pipeline with cycles, etc., without errors.

Visualization is kind of unrelated to graphs. The problems we face there will not be solved by a graph library or something along those lines, imho.

Yes. But it would be helped by traversal functions. We can of course implement them ourselves.

Not sure what you have in mind. Are you talking about visualizing the pipeline as a graph (as opposed to the resulting task-graphs)?

Does it solve anything else apart from #43?

It provides the primitive operations needed to compose everything else. Plus high level ones that can make our lives easier. But yes, if you are looking for a single function that fixes an open issue, it's only #43

Compose what else, can you elaborate? And what "high level ones"?

Why do you say it is expensive? Can you elaborate?

Because of tg = p.get(p._providers.keys()) which is O(nodes * O(build_graph)).

I do not understand what you are computing there, and why it is needed. Why do we need to rebuild the graph for every provided type in the pipeline?

Shouldn't providers also be treated as nodes?

Can be. The choice here matches how our and Dasks's task graphs are implemented. But if this doesn't work in general, we can make providers nodes.

I think we need to differentiate here between Pipeline and TaskGraph. I agree for the latter, but the requirements listed initially are at least in parts unrelated. Therefore, I don't think how Dask or TaskGraph represent their graphs is necessarily meaningful.

Is it wasteful to a relevant amount?

Depends on how many checks you run and whether they can share the intermediate graph. And, of course, what latency you are willing to allow for compute.

I know it depends, that is why I am asking. ;)

jl-wynen · 2024-01-23T09:26:29Z

Are you talking about visualizing the pipeline as a graph (as opposed to the resulting task-graphs)?

Yes, the inspection algorithms (finding unused params, missing params) need to inspect the entire pipeline, no? So we need some way of operating on it. Currently, that is not really possible. We can only get a concrete task graph. The purpose of tg = p.get(p._providers.keys()) is to get a graph for the entire pipeline. (I know that it is an incomplete solution. It was meant to point out that we need support from Pipeline, not just TaskGraph.)

Due to (in particular) generic providers it is not possible to draw a graph without knowing which keys the users wants to compute.

I don't think this is correct. Generic providers are not truly generic but always restricted to a concrete (open) set of types. This means that given

def a() -> X[int]: ...
def b() -> X[float]: ...
def c(X[T]): ...

we can have the combined [a, b] as the incoming edges to X and c as an outgoing edge.
In other words, for any given pipeline, we know all providers that can potentially make any given generic type. So we can build a graph for them. The edges of that graph may be complex. But that is not a priori a problem. (Works just as well when providers are nodes.)

Furthermore it is possible to create a pipeline with cycles, etc., without errors.

Is it? How can such a pipeline ever produce a DAG?

Compose what else, can you elaborate? And what "high level ones"?

The primitive operations are traversal, adding and removing nodes and edges. We have the latter but not the former. By "high level", I mean pretty much everything else. We can, at least in principle, express those in terms of the primitives. E.g., connected_components, converting to another format (dot or other file formats), including task graphs, as well as scheduling task graphs.

SimonHeybrock · 2024-01-23T11:47:42Z

Are you talking about visualizing the pipeline as a graph (as opposed to the resulting task-graphs)?

Yes, the inspection algorithms (finding unused params, missing params) need to inspect the entire pipeline, no? So we need some way of operating on it. Currently, that is not really possible. We can only get a concrete task graph. The purpose of tg = p.get(p._providers.keys()) is to get a graph for the entire pipeline. (I know that it is an incomplete solution. It was meant to point out that we need support from Pipeline, not just TaskGraph.)

I am confused now, you appear to be mixing two of my comments (visualization of pipeline vs. cost of "graph" setup).

Due to (in particular) generic providers it is not possible to draw a graph without knowing which keys the users wants to compute.

I don't think this is correct. Generic providers are not truly generic but always restricted to a concrete (open) set of types.

Are they? How?

This means that given
def a() -> X[int]: ...
def b() -> X[float]: ...
def c(X[T]): ...
we can have the combined [a, b] as the incoming edges to X and c as an outgoing edge. In other words, for any given pipeline, we know all providers that can potentially make any given generic type. So we can build a graph for them. The edges of that graph may be complex. But that is not a priori a problem. (Works just as well when providers are nodes.)

What about X[str], etc.? Again, I don't understand. From my point of view a generic provider can provide an a priori infinite number of output types.

Furthermore it is possible to create a pipeline with cycles, etc., without errors.

Is it? How can such a pipeline ever produce a DAG?

The pipeline does not build a graph upon creation (as you have experienced this is non trivial, without knowing, e.g., the output keys). Even if there are cycles, the pipeline may have a-cyclic parts that can produce a DAG.

Compose what else, can you elaborate? And what "high level ones"?

The primitive operations are traversal, adding and removing nodes and edges. We have the latter but not the former.

You mean we do not have traversal? Is this what we do when a pipeline builds a task graph, or do you have something else in mind?

By "high level", I mean pretty much everything else. We can, at least in principle, express those in terms of the primitives. E.g., connected_components, converting to another format (dot or other file formats), including task graphs, as well as scheduling task graphs.

connected_components is a generalization of #43, right? For the other point I want to reiterate that we have to distinguish between Pipeline and TaskGraph. Are you talking about the "largest possible graph" that a pipeline could produce, which may in principle have infinitely many nodes/edges (unless we "compactify" generic providers).?

jl-wynen · 2024-01-23T15:24:20Z

For generic providers, the output type is still restricted by the sciline.Scope, so we can find all providers that can consume data from the generic provider. And conversely, for a given generic type, we know all providers that can provide a matching type. Those two points mean that we can connect generic types and providers in a graph with multi-edges or multiple nodes per provider argument (essentially forming an 'overload set').

Even if there are cycles, the pipeline may have a-cyclic parts that can produce a DAG.

Yes, but is the cyclic part valid? Can you give an example where a cycle in a pipeline is both valid and useful? (I.e. not in a part of the pipeline that is never used for a task graph.)

You mean we do not have traversal? Is this what we do when a pipeline builds a task graph, or do you have something else in mind?

In this case, the traversal is entangled with the build logic. So it is not reusable.

Are you talking about the "largest possible graph" that a pipeline could produce, which may in principle have infinitely many nodes/edges (unless we "compactify" generic providers).?

Yes, if by 'largest possible' you mean the entire pipeline. We don't need to instantiate generic parameters for this but connect the generics directly with in the graph.

SimonHeybrock · 2024-01-24T05:15:42Z

For generic providers, the output type is still restricted by the sciline.Scope, so we can find all providers that can consume data from the generic provider. And conversely, for a given generic type, we know all providers that can provide a matching type. Those two points mean that we can connect generic types and providers in a graph with multi-edges or multiple nodes per provider argument (essentially forming an 'overload set').

It is not that I disagree on all aspects here, but I still don't think it is possible to think about a sciline.Pipeline in this way, without knowing what we want to compute. Consider

from typing import TypeVar
import sciline

T = TypeVar('T')

def to_list(x: T) -> list[T]:
    return [x, x, x]

def compute(x: int) -> float:
    return 0.5 * x

pl = sciline.Pipeline((to_list, compute))
pl[int] = 3
pl.visualize(list[int])
pl.visualize(list[float])

Basically, you can attach the generic provider at any node in the "graph". Unless you know what to compute, what graph would you draw, for the pipeline?

We can even go wild:

jl-wynen · 2024-01-24T08:09:05Z

How about this?

graph TD;
compute([compute])
to_list([to_list])
list["list[T]"]
int-->compute-->float;
int-->to_list;
float-->to_list;
list-->to_list;
to_list-->list;

SimonHeybrock · 2024-01-24T08:19:21Z

If you draw the loop from list[T] to to_list (instead of from itself) I agree. But is this useful in a graph with tens of nodes, or multiple generics?

jl-wynen · 2024-01-24T08:35:18Z

If you draw the loop from list[T] to to_list (instead of from itself) I agree.

Yes, my bad, I fixed it.

But is this useful in a graph with tens of nodes, or multiple generics?

Not necessarily useful for people to look at (except experts). But useful for inspecting the pipeline.

SimonHeybrock · 2024-01-25T06:43:27Z

Let us assume for now one can internally represent a Pipeline using a graph (I am still not convinced it is feasible in general with generic providers), we use it for the following:

Considering data nodes,

$0$ incoming edges $\leftrightarrow$ UnsatisfiedRequirement
$>1$ incoming edges $\leftrightarrow$ AmbiguousProvider
$0$ outgoing edges $\leftrightarrow$ might be an output, unless...
$0$ outgoing edges and incoming edge connected to nullary compute node $\leftrightarrow$ probably unused param

SimonHeybrock · 2024-01-25T13:44:45Z

After staring some more at #111 I think that our problems may be solved under the assumption that all generics use typevars that are constrained. We can then, e.g., represent a provider for A[T1, T2] as a set of nodes for all possible combinations of the constraints of T1 and T2.

This should then allow for running analysis on the graph, e.g., to tell which combinations of T1 and T2 miss inputs, ...

There may also be some solution for bound (not constrained) typevars, but as we use those little I am not going to worry about this until I am confident that may thoughts for the constrained case work out.

SimonHeybrock · 2024-06-10T08:53:23Z

Kind of done (or outdated) by rewrite.

SimonHeybrock mentioned this issue Jan 25, 2024

Experiment with building graph from pipeline #111

Closed

SimonHeybrock added this to Development Board Jan 31, 2024

github-project-automation bot moved this to Next in Development Board Jan 31, 2024

SimonHeybrock moved this from Next to In progress in Development Board Jan 31, 2024

jl-wynen mentioned this issue Jan 31, 2024

Support requesting TaskGraph in providers #113

Closed

SimonHeybrock moved this from In progress to Triage in Development Board Feb 26, 2024

SimonHeybrock mentioned this issue Mar 12, 2024

Add graph with map and reduce operations. scipp/cyclebane#1

Merged

SimonHeybrock closed this as completed Jun 10, 2024

github-project-automation bot moved this from Triage to Done in Development Board Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for graph operations #107

Support for graph operations #107

jl-wynen commented Jan 22, 2024

SimonHeybrock commented Jan 23, 2024 •

edited

Loading

Current graph traversal

Potential solutions

1. Implement `to_networkx` in `Pipeline`

2. Represent the internal state of `Pipeline` as a `DiGraph`

jl-wynen commented Jan 23, 2024

SimonHeybrock commented Jan 23, 2024

jl-wynen commented Jan 23, 2024

SimonHeybrock commented Jan 23, 2024

jl-wynen commented Jan 23, 2024

SimonHeybrock commented Jan 24, 2024

jl-wynen commented Jan 24, 2024 •

edited

Loading

SimonHeybrock commented Jan 24, 2024

jl-wynen commented Jan 24, 2024

SimonHeybrock commented Jan 25, 2024

SimonHeybrock commented Jan 25, 2024 •

edited

Loading

SimonHeybrock commented Jun 10, 2024

Support for graph operations #107

Support for graph operations #107

Comments

jl-wynen commented Jan 22, 2024

Current graph traversal

Potential solutions

1. Implement to_networkx in Pipeline

2. Represent the internal state of Pipeline as a DiGraph

SimonHeybrock commented Jan 23, 2024 • edited Loading

Current graph traversal

Potential solutions

1. Implement to_networkx in Pipeline

2. Represent the internal state of Pipeline as a DiGraph

jl-wynen commented Jan 23, 2024

SimonHeybrock commented Jan 23, 2024

jl-wynen commented Jan 23, 2024

SimonHeybrock commented Jan 23, 2024

jl-wynen commented Jan 23, 2024

SimonHeybrock commented Jan 24, 2024

jl-wynen commented Jan 24, 2024 • edited Loading

SimonHeybrock commented Jan 24, 2024

jl-wynen commented Jan 24, 2024

SimonHeybrock commented Jan 25, 2024

SimonHeybrock commented Jan 25, 2024 • edited Loading

SimonHeybrock commented Jun 10, 2024

1. Implement `to_networkx` in `Pipeline`

2. Represent the internal state of `Pipeline` as a `DiGraph`

SimonHeybrock commented Jan 23, 2024 •

edited

Loading

1. Implement `to_networkx` in `Pipeline`

2. Represent the internal state of `Pipeline` as a `DiGraph`

jl-wynen commented Jan 24, 2024 •

edited

Loading

SimonHeybrock commented Jan 25, 2024 •

edited

Loading