Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for graph operations #107

Closed
jl-wynen opened this issue Jan 22, 2024 · 13 comments
Closed

Support for graph operations #107

jl-wynen opened this issue Jan 22, 2024 · 13 comments

Comments

@jl-wynen
Copy link
Member

At the moment, Pipeline is largely a black box in terms of the graph it represents. It can build TaskGraph objects but only when the user knows suitable nodes to pass to build or get (as in tg = p.get(p._providers.keys()) below). We have several issues (open and closed) now that relate to graph operations and could be solved easily if we had such operations:

NetworkX seems like a good choice for a graph implementation because it provides all basic algorithms out of the box. It becomes trivial to solve #43 as suggested by @YooSunYoung, #92 for several formats, and maybe visualisation (might get removed from NetworkX).

Current graph traversal

Unfortunately, it is currently cumbersome, expensive, and not technically allowed to traverse graphs. E.g., to build a NetworkX graph, we need something like (note the protected attributes and functions)

import networkx as nx
from sciline.visualize import _format_type as format_type
from sciline import Pipeline

def to_networkx(p: Pipeline) -> nx.DiGraph:
    tg = p.get(p._providers.keys())
    g = nx.DiGraph()
    for res, (provider, args) in tg._graph.items():
        g.add_node(hash(res), label=format_type(res).name)
        for arg in args:
            g.add_node(hash(arg), label=format_type(arg).name)
            g.add_edge(hash(arg), hash(res), label=provider.__qualname__)
    return g

Potential solutions

I talk about NetworkX here, but we can also use a different library or custom graph implementation. The general ideas remain the same.

1. Implement to_networkx in Pipeline

This is the least invasive and easiest to implement. But it would mean that operations that need to inspect the graph need to make a temporary nx.DiGraph which is wasteful.

2. Represent the internal state of Pipeline as a DiGraph

NetworkX allows storing arbitrary data in nodes and edges. So we can stuff everything we need into the graph. This would make graph operations first class citizens and would open up more possibilities. (E.g., we can use node metadata to indicate whether a provider is a param or not. Something that is currently difficult (impossible?) to determine.)
On the other hand, this would require a complete refactoring of Pipeline. And I don't even know if there is a straightforward way of representing everything as a graph. In particular when it comes to subproviders.

@SimonHeybrock
Copy link
Member

SimonHeybrock commented Jan 23, 2024

#74 (or #74 (comment)) can be added to your list, I think.

save pipeline structure with parameters for reproduction #92 Potentially simple to implement with graph traversal functionality.

This is somewhat different: We want to save task-graphs, which we know to be a DAG. Much less complexity there than what Pipeline has to do/support.

Graph visualisation including Suggestions for graph visualization #73 and existing graphviz converter.

Visualization is kind of unrelated to graphs. The problems we face there will not be solved by a graph library or something along those lines, imho.

Allow visualization of graphs with missing requirements #80 and Get missing params from a pipeline #83 are a bit trickier; we would have to build a partial graph and scan it for nodes that are missing an edge (i.e. missing provider for an arg)

Note that this is essentially solved already in the current implementation.

NetworkX seems like a good choice for a graph implementation because it provides all basic algorithms out of the box. It becomes trivial to solve #43 as suggested by @YooSunYoung, #92 for several formats, and maybe visualisation (might get removed from NetworkX).

Does it solve anything else apart from #43?

Current graph traversal

Unfortunately, it is currently cumbersome, expensive, and not technically allowed to traverse graphs. E.g., to build a NetworkX graph, we need something like (note the protected attributes and functions)

Why do you say it is expensive? Can you elaborate?

import networkx as nx
from sciline.visualize import _format_type as format_type
from sciline import Pipeline

def to_networkx(p: Pipeline) -> nx.DiGraph:
    tg = p.get(p._providers.keys())
    g = nx.DiGraph()
    for res, (provider, args) in tg._graph.items():
        g.add_node(hash(res), label=format_type(res).name)
        for arg in args:
            g.add_node(hash(arg), label=format_type(arg).name)
            g.add_edge(hash(arg), hash(res), label=provider.__qualname__)
    return g

Shouldn't providers also be treated as nodes?

Potential solutions

I talk about NetworkX here, but we can also use a different library or custom graph implementation. The general ideas remain the same.

1. Implement to_networkx in Pipeline

This is the least invasive and easiest to implement. But it would mean that operations that need to inspect the graph need to make a temporary nx.DiGraph which is wasteful.

Is it wasteful to a relevant amount?

2. Represent the internal state of Pipeline as a DiGraph

NetworkX allows storing arbitrary data in nodes and edges. So we can stuff everything we need into the graph. This would make graph operations first class citizens and would open up more possibilities. (E.g., we can use node metadata to indicate whether a provider is a param or not. Something that is currently difficult (impossible?) to determine.) On the other hand, this would require a complete refactoring of Pipeline. And I don't even know if there is a straightforward way of representing everything as a graph. In particular when it comes to subproviders.

I don't feel this is going to work, given the support for param tables and generic providers. Furthermore, it would make such a graph library a mandatory dependency. Until we have clear evidence that the benefit is large (see my question a couple paragraphs above, does it solve anything but #43?) I do not see a justification for that.

@jl-wynen
Copy link
Member Author

This is somewhat different: We want to save task-graphs, which we know to be a DAG. Much less complexity there than what Pipeline has to do/support.

Can you show me an example of a pipeline that is not a DAG? Can this exist given that we can convert pipelines to task graphs?

Visualization is kind of unrelated to graphs. The problems we face there will not be solved by a graph library or something along those lines, imho.

Yes. But it would be helped by traversal functions. We can of course implement them ourselves.

Does it solve anything else apart from #43?

It provides the primitive operations needed to compose everything else. Plus high level ones that can make our lives easier. But yes, if you are looking for a single function that fixes an open issue, it's only #43

Why do you say it is expensive? Can you elaborate?

Because of tg = p.get(p._providers.keys()) which is O(nodes * O(build_graph)).

Shouldn't providers also be treated as nodes?

Can be. The choice here matches how our and Dasks's task graphs are implemented. But if this doesn't work in general, we can make providers nodes.

Is it wasteful to a relevant amount?

Depends on how many checks you run and whether they can share the intermediate graph. And, of course, what latency you are willing to allow for compute.

@SimonHeybrock
Copy link
Member

This is somewhat different: We want to save task-graphs, which we know to be a DAG. Much less complexity there than what Pipeline has to do/support.

Can you show me an example of a pipeline that is not a DAG? Can this exist given that we can convert pipelines to task graphs?

I don't think thinking of a pipeline as a specific graph is possible in general. Due to (in particular) generic providers it is not possible to draw a graph without knowing which keys the users wants to compute. Furthermore it is possible to create a pipeline with cycles, etc., without errors.

Visualization is kind of unrelated to graphs. The problems we face there will not be solved by a graph library or something along those lines, imho.

Yes. But it would be helped by traversal functions. We can of course implement them ourselves.

Not sure what you have in mind. Are you talking about visualizing the pipeline as a graph (as opposed to the resulting task-graphs)?

Does it solve anything else apart from #43?

It provides the primitive operations needed to compose everything else. Plus high level ones that can make our lives easier. But yes, if you are looking for a single function that fixes an open issue, it's only #43

Compose what else, can you elaborate? And what "high level ones"?

Why do you say it is expensive? Can you elaborate?

Because of tg = p.get(p._providers.keys()) which is O(nodes * O(build_graph)).

I do not understand what you are computing there, and why it is needed. Why do we need to rebuild the graph for every provided type in the pipeline?

Shouldn't providers also be treated as nodes?

Can be. The choice here matches how our and Dasks's task graphs are implemented. But if this doesn't work in general, we can make providers nodes.

I think we need to differentiate here between Pipeline and TaskGraph. I agree for the latter, but the requirements listed initially are at least in parts unrelated. Therefore, I don't think how Dask or TaskGraph represent their graphs is necessarily meaningful.

Is it wasteful to a relevant amount?

Depends on how many checks you run and whether they can share the intermediate graph. And, of course, what latency you are willing to allow for compute.

I know it depends, that is why I am asking. ;)

@jl-wynen
Copy link
Member Author

Are you talking about visualizing the pipeline as a graph (as opposed to the resulting task-graphs)?

Yes, the inspection algorithms (finding unused params, missing params) need to inspect the entire pipeline, no? So we need some way of operating on it. Currently, that is not really possible. We can only get a concrete task graph. The purpose of tg = p.get(p._providers.keys()) is to get a graph for the entire pipeline. (I know that it is an incomplete solution. It was meant to point out that we need support from Pipeline, not just TaskGraph.)

Due to (in particular) generic providers it is not possible to draw a graph without knowing which keys the users wants to compute.

I don't think this is correct. Generic providers are not truly generic but always restricted to a concrete (open) set of types. This means that given

def a() -> X[int]: ...
def b() -> X[float]: ...
def c(X[T]): ...

we can have the combined [a, b] as the incoming edges to X and c as an outgoing edge.
In other words, for any given pipeline, we know all providers that can potentially make any given generic type. So we can build a graph for them. The edges of that graph may be complex. But that is not a priori a problem. (Works just as well when providers are nodes.)

Furthermore it is possible to create a pipeline with cycles, etc., without errors.

Is it? How can such a pipeline ever produce a DAG?

Compose what else, can you elaborate? And what "high level ones"?

The primitive operations are traversal, adding and removing nodes and edges. We have the latter but not the former. By "high level", I mean pretty much everything else. We can, at least in principle, express those in terms of the primitives. E.g., connected_components, converting to another format (dot or other file formats), including task graphs, as well as scheduling task graphs.

@SimonHeybrock
Copy link
Member

Are you talking about visualizing the pipeline as a graph (as opposed to the resulting task-graphs)?

Yes, the inspection algorithms (finding unused params, missing params) need to inspect the entire pipeline, no? So we need some way of operating on it. Currently, that is not really possible. We can only get a concrete task graph. The purpose of tg = p.get(p._providers.keys()) is to get a graph for the entire pipeline. (I know that it is an incomplete solution. It was meant to point out that we need support from Pipeline, not just TaskGraph.)

I am confused now, you appear to be mixing two of my comments (visualization of pipeline vs. cost of "graph" setup).

Due to (in particular) generic providers it is not possible to draw a graph without knowing which keys the users wants to compute.

I don't think this is correct. Generic providers are not truly generic but always restricted to a concrete (open) set of types.

Are they? How?

This means that given

def a() -> X[int]: ...
def b() -> X[float]: ...
def c(X[T]): ...

we can have the combined [a, b] as the incoming edges to X and c as an outgoing edge. In other words, for any given pipeline, we know all providers that can potentially make any given generic type. So we can build a graph for them. The edges of that graph may be complex. But that is not a priori a problem. (Works just as well when providers are nodes.)

What about X[str], etc.? Again, I don't understand. From my point of view a generic provider can provide an a priori infinite number of output types.

Furthermore it is possible to create a pipeline with cycles, etc., without errors.

Is it? How can such a pipeline ever produce a DAG?

The pipeline does not build a graph upon creation (as you have experienced this is non trivial, without knowing, e.g., the output keys). Even if there are cycles, the pipeline may have a-cyclic parts that can produce a DAG.

Compose what else, can you elaborate? And what "high level ones"?

The primitive operations are traversal, adding and removing nodes and edges. We have the latter but not the former.

You mean we do not have traversal? Is this what we do when a pipeline builds a task graph, or do you have something else in mind?

By "high level", I mean pretty much everything else. We can, at least in principle, express those in terms of the primitives. E.g., connected_components, converting to another format (dot or other file formats), including task graphs, as well as scheduling task graphs.

connected_components is a generalization of #43, right? For the other point I want to reiterate that we have to distinguish between Pipeline and TaskGraph. Are you talking about the "largest possible graph" that a pipeline could produce, which may in principle have infinitely many nodes/edges (unless we "compactify" generic providers).?

@jl-wynen
Copy link
Member Author

For generic providers, the output type is still restricted by the sciline.Scope, so we can find all providers that can consume data from the generic provider. And conversely, for a given generic type, we know all providers that can provide a matching type. Those two points mean that we can connect generic types and providers in a graph with multi-edges or multiple nodes per provider argument (essentially forming an 'overload set').

Even if there are cycles, the pipeline may have a-cyclic parts that can produce a DAG.

Yes, but is the cyclic part valid? Can you give an example where a cycle in a pipeline is both valid and useful? (I.e. not in a part of the pipeline that is never used for a task graph.)

You mean we do not have traversal? Is this what we do when a pipeline builds a task graph, or do you have something else in mind?

In this case, the traversal is entangled with the build logic. So it is not reusable.

Are you talking about the "largest possible graph" that a pipeline could produce, which may in principle have infinitely many nodes/edges (unless we "compactify" generic providers).?

Yes, if by 'largest possible' you mean the entire pipeline. We don't need to instantiate generic parameters for this but connect the generics directly with in the graph.

@SimonHeybrock
Copy link
Member

For generic providers, the output type is still restricted by the sciline.Scope, so we can find all providers that can consume data from the generic provider. And conversely, for a given generic type, we know all providers that can provide a matching type. Those two points mean that we can connect generic types and providers in a graph with multi-edges or multiple nodes per provider argument (essentially forming an 'overload set').

It is not that I disagree on all aspects here, but I still don't think it is possible to think about a sciline.Pipeline in this way, without knowing what we want to compute. Consider

from typing import TypeVar
import sciline

T = TypeVar('T')

def to_list(x: T) -> list[T]:
    return [x, x, x]

def compute(x: int) -> float:
    return 0.5 * x

pl = sciline.Pipeline((to_list, compute))
pl[int] = 3
pl.visualize(list[int])
pl.visualize(list[float])
image image

Basically, you can attach the generic provider at any node in the "graph". Unless you know what to compute, what graph would you draw, for the pipeline?

We can even go wild:
image

@jl-wynen
Copy link
Member Author

jl-wynen commented Jan 24, 2024

How about this?

graph TD;
compute([compute])
to_list([to_list])
list["list[T]"]
int-->compute-->float;
int-->to_list;
float-->to_list;
list-->to_list;
to_list-->list;
Loading

@SimonHeybrock
Copy link
Member

If you draw the loop from list[T] to to_list (instead of from itself) I agree. But is this useful in a graph with tens of nodes, or multiple generics?

@jl-wynen
Copy link
Member Author

If you draw the loop from list[T] to to_list (instead of from itself) I agree.

Yes, my bad, I fixed it.

But is this useful in a graph with tens of nodes, or multiple generics?

Not necessarily useful for people to look at (except experts). But useful for inspecting the pipeline.

@SimonHeybrock
Copy link
Member

Let us assume for now one can internally represent a Pipeline using a graph (I am still not convinced it is feasible in general with generic providers), we use it for the following:

Considering data nodes,

  • $0$ incoming edges $\leftrightarrow$ UnsatisfiedRequirement
  • $>1$ incoming edges $\leftrightarrow$ AmbiguousProvider
  • $0$ outgoing edges $\leftrightarrow$ might be an output, unless...
  • $0$ outgoing edges and incoming edge connected to nullary compute node $\leftrightarrow$ probably unused param

@SimonHeybrock
Copy link
Member

SimonHeybrock commented Jan 25, 2024

After staring some more at #111 I think that our problems may be solved under the assumption that all generics use typevars that are constrained. We can then, e.g., represent a provider for A[T1, T2] as a set of nodes for all possible combinations of the constraints of T1 and T2.

This should then allow for running analysis on the graph, e.g., to tell which combinations of T1 and T2 miss inputs, ...

There may also be some solution for bound (not constrained) typevars, but as we use those little I am not going to worry about this until I am confident that may thoughts for the constrained case work out.

@SimonHeybrock
Copy link
Member

Kind of done (or outdated) by rewrite.

@github-project-automation github-project-automation bot moved this from Triage to Done in Development Board Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants