Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pool-to-pool ETL #30

Merged
merged 3 commits into from
Nov 17, 2021
Merged

add pool-to-pool ETL #30

merged 3 commits into from
Nov 17, 2021

Conversation

mccanne
Copy link
Collaborator

@mccanne mccanne commented Nov 6, 2021

This commit is a rough first draft for Debezium-style ETL on
CDC logs. It handles denormalization of two tables into one
as well as stateless transforms on CDC logs.

We renamed the "sync from" and "sync to" sub-commands to
"from-kafka" and "to-kafka", respectively.

We also updated the zed pointer and ported the code to use
zed.Value instead of zed.Record.

The README contains a demo walkthrough of the basics.

This commit is a rough first draft for Debezium-style ETL on
CDC logs.  It handles denormalization of two tables into one
as well as stateless transforms on CDC logs.

We renamed the "sync from" and "sync to" sub-commands to
"from-kafka" and "to-kafka", respectively.

We also updated the zed pointer and ported the code to use
zed.Value instead of zed.Record.

The README contains a demo walkthrough of the basics.
@mccanne mccanne requested a review from nwt November 6, 2021 15:01
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
etl/pipeline.go Outdated Show resolved Hide resolved
etl/pipeline.go Show resolved Hide resolved
etl/pool.go Outdated Show resolved Hide resolved
etl/yaml.go Outdated Show resolved Hide resolved
fifo/from.go Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated
Comment on lines 55 to 56
This transform the Zed input to Avro and posts it to the topic.
The consumer then converts the Avro back to Zed and displays it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This transform the Zed input to Avro and posts it to the topic.
The consumer then converts the Avro back to Zed and displays it.
This transforms the ZSON input to Avro and posts it to the topic.
The consumer then converts the Avro back to ZSON and displays it.

README.md Outdated

`zync sync from` formats records received from Kafka using the Zed envelope
`zync from-kafka` formats records received from Kafka using the Zed envelope
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`zync from-kafka` formats records received from Kafka using the Zed envelope
`zync from-kafka` encapsulates records received from Kafka using the envelope

README.md Outdated
overlapping offsets, only one will succeed. The others will detect the conflict,
recompute the `kafka.offset`'s accounting for the data provided in the
conflicting commit, and retry the commit.
all data committed by zync writers must have monotonically increasing `kafka.offset`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
all data committed by zync writers must have monotonically increasing `kafka.offset`
all data committed by `zync` writers must have monotonically increasing `kafka.offset`

README.md Outdated
of multiple tables into one.

The model here is that `zync etl` processes data from an input pool to an output
pool where `from-kafka` is populating the input pool and `to-kafka` is processing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pool where `from-kafka` is populating the input pool and `to-kafka` is processing
pool where `zync from-kafka` is populating the input pool and `zync to-kafka` is processing

etl/pool.go Outdated
}

func (*adaptor) NewScheduler(context.Context, *zed.Context, dag.Source, extent.Span, zbuf.Filter, *dag.Filter) (proc.Scheduler, error) {
return nil, fmt.Errorf("mock.Lake.NewScheduler() should not be called")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("mock.Lake.NewScheduler() should not be called")
return nil, errors.New("etl.adaptor.NewScheduler should not be called")

etl/pool.go Outdated

type adaptor struct{}

func (*adaptor) Layout(_ context.Context, src dag.Source) order.Layout {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (*adaptor) Layout(_ context.Context, src dag.Source) order.Layout {
func (*adaptor) Layout(context.Context, dag.Source) order.Layout {

etl/pool.go Outdated
Comment on lines 184 to 185
func (*adaptor) Open(_ context.Context, _ *zed.Context, _ string, _ zbuf.Filter) (zbuf.PullerCloser, error) {
return nil, fmt.Errorf("mock.Lake.Open() should not be called")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (*adaptor) Open(_ context.Context, _ *zed.Context, _ string, _ zbuf.Filter) (zbuf.PullerCloser, error) {
return nil, fmt.Errorf("mock.Lake.Open() should not be called")
func (*adaptor) Open(context.Context, *zed.Context, string, zbuf.Filter) (zbuf.PullerCloser, error) {
return nil, errors.New("etl.adaptor.Open should not be called")

return ksuid.Nil, nil
}

func (*adaptor) CommitObject(_ context.Context, _ ksuid.KSUID, _ string) (ksuid.KSUID, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (*adaptor) CommitObject(_ context.Context, _ ksuid.KSUID, _ string) (ksuid.KSUID, error) {
func (*adaptor) CommitObject(context.Context, ksuid.KSUID, string) (ksuid.KSUID, error) {

etl/pool.go Outdated
Comment on lines 170 to 172
func (*batchDriver) Warn(warning string) error { return nil }
func (*batchDriver) Stats(stats zbuf.ScannerStats) error { return nil }
func (*batchDriver) ChannelEnd(cid int) error { return nil }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (*batchDriver) Warn(warning string) error { return nil }
func (*batchDriver) Stats(stats zbuf.ScannerStats) error { return nil }
func (*batchDriver) ChannelEnd(cid int) error { return nil }
func (*batchDriver) Warn(string) error { return nil }
func (*batchDriver) Stats(zbuf.ScannerStats) error { return nil }
func (*batchDriver) ChannelEnd(int) error { return nil }

@mccanne mccanne merged commit 81b1f65 into main Nov 17, 2021
@mccanne mccanne deleted the transform branch November 17, 2021 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants