Skip to content

Commit 94c8812

Browse files
author
Shlomi Noach
authored
Merge pull request #74 from github/status-clueanup-comments
Retries, better visibility, documentation
2 parents 9197eed + f0b012b commit 94c8812

File tree

10 files changed

+304
-75
lines changed

10 files changed

+304
-75
lines changed

doc/triggerless-design.md

+138
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Triggerless design
2+
3+
A breakdown of the logic and algorithm behind `gh-ost`'s triggerless design, followed by the implications, advantages and disadvantages of such design.
4+
5+
### Trigger-based migrations background
6+
7+
It is worthwhile to consider two popular existing online schema change solutions:
8+
9+
- [pt-online-schema-change](https://www.percona.com/doc/percona-toolkit/2.2/pt-online-schema-change.html)
10+
- [Facebook OSC](https://www.facebook.com/notes/mysql-at-facebook/online-schema-change-for-mysql/430801045932/)
11+
12+
The former uses a synchronous design: it adds three triggers (`AFTER INSERT`, `AFTER UPDATE`, `AFTER DELETE`) on the original table. Each such trigger relays the operation onto the ghost table. So for every `UPDATE` on the original table, an `UPDATE` executes on the ghost table. A `DELETE` on the original table triggers a `DELETE` on the ghost table. Same for `INSERT`. The triggers live in the same transaction space as the original query.
13+
14+
The latter uses an asynchronous design: it adds three triggers (`AFTER INSERT`, `AFTER UPDATE`, `AFTER DELETE`) on the original table. It also creates a _changelog_ table. The triggers do not relay operations directly to the ghost table. Instead, they each add an entry to the changelog table. An `UPDATE` on the original table makes for an `INSERT` on the changelog table saying "There was an UPDATE on the original table with this and that values"; likewise for `INSERT` and `DELETE`.
15+
A background process tails the changelog table and applies the changes onto the ghost table. This approach is asynchronous in that the applier does not live in the same transaction space as the original table, and may operate on a change event seconds or more after said event was written.
16+
It is noteworthy that the writes to the changelog table still live in the same transaction space as the writes on the original table.
17+
18+
### Triggerless based asynchronous migrations
19+
20+
`gh-ost`'s triggerless design uses an asynchronous approach. However it does not require triggers because it does not require having a _changelog_ table like the FB tool does. The reason it does not require a changelog table is that it finds the changelog in another place: the binary logs.
21+
22+
In particular, it reads Row Based Replication (RBR) entries (you can still [use it with Statement Based Replication!](migrating-with-sbr.md)) and searches for entries that apply to the original table.
23+
24+
RBR entries are very convenient for this job: they break complex statements, potentially multi-table, into distinct, per-table, per-row entries, which are easy to read and apply.
25+
26+
`gh-ost` pretends to be a MySQL replica: it connects to the MySQL server and begins requesting for binlog events as though it were a real replication server. Thus, it gets a continuous streaming of the binary logs, and filters out those events that apply to the original table.
27+
28+
`gh-ost` can connect directly to the master, but prefers to connect to one of its replicas. Such a replica would need to use `log-slave-updates` and use `binlog-format=ROW` (`gh-ost` can change the latter setting for you).
29+
30+
Reading from the binary log, specially in the case of reading those on a replica, further stresses the asynchronous nature of the algorithm. While the transaction _may_ (based on configuration) be synced with the binlog entry write, it will take time until `gh-ost` - pretending to be a replica - will get notification for that, copy the event downstream and apply it.
31+
32+
The asynchronous design implies many noteworthy outcomes, to be discussed later on.
33+
34+
### Workflow overview
35+
36+
The workflow includes reading table data from the server, reading event data from the binary log, checking for replication lag or other throttling parameters, applying changes onto the server (typically the master), sending hints through the binary log stream and more.
37+
38+
Some flow breakdown:
39+
40+
#### Initial setup & validation
41+
Initial setup is a no-concurrency operation
42+
43+
- Connecting to replica/master, detecting master identify
44+
- Pre-validating `alter` statement
45+
- Initial sanity: privileges, existence of tables
46+
- Creation of changelog and ghost tables.
47+
- Applying `alter` on ghost table
48+
- Comparing structure of original & ghost table. Looking for shared columns, shared unique keys, validating foreign keys. Choosing shared unique key, the key by which we chunk the table and process it.
49+
- Setting up the binlog listener; begin listening on changelog events
50+
- Injecting a "good to go" ebtry onto the changelog table (to be intercepted via binary logs)
51+
- Begin listening on binlog events for original table DMLs
52+
- Reading original table's chosen key min/max values
53+
54+
#### Copy flow
55+
This setup includes multiple moving parts, all acting concurrently with some coordination
56+
57+
- Setting up a heartbeat mechanism: frequent writes on the changelog table (we consider this to be low, negligible write load for throttling purposes)
58+
- Continuously updating status
59+
- Periodically (frequently) checking for potential throttle scenarios or hints
60+
- Work through the original table's rows range, chunk by chunk, queueing copy tasks onto the ghost table
61+
- Reading DML events from the binlogs, queueing apply tasks onto the ghost table
62+
- Processing the copy tasks queue and the apply tasks queue and sequentially applying onto ghost table
63+
- Suspending by throttle state
64+
- Injecting/intercepting "copy all done" once full row-copy range has been exhausted
65+
- Stall/postpone while `postpone-cut-over-flag-file` exists (we keep apply ongoing DMLs)
66+
67+
#### Cut-over and completion
68+
69+
- Locking the original table for writes, working on what remains on the binlog event backlog (recall this is an asynchronous operation, and so even as the table is locked, we still have unhandled events in our pipe).
70+
- Swapping the original table out, the ghost table in
71+
- Cleanup: potential drop of tables
72+
73+
### Asynchronous design implications
74+
75+
#### Cut-over phase
76+
77+
A complication the asynchronous approach presents is the cut-over phase: the swapping of the tables. In the synchronous approach, the two tables are kept in sync thanks to the transaction-space in which the triggers operate. Thus, a simple, atomic `rename table original to _original_old, ghost to original` suffices and is valid.
78+
79+
In the asynchronous approach, as we lock the original table, we often still have events in the pipeline, changes in the binary log we still need to apply onto the ghost table. An atomic swap would be a premature and incorrect solution, since it would imply the write load would immediately proceed to operate on what used to be the ghost table, even before we completed applying those last changes.
80+
81+
The Facebook solution uses an "outage", two-step rename:
82+
83+
- Lock the original table, work on backlog
84+
- Rename original table to `_old`
85+
- Rename ghost table to original
86+
87+
In between those two renames there's a point in time where the table does not exist, hence there's a "table outage".
88+
89+
`gh-ost` solves this by using an optimistic three-step locking algorithm. It is optimistic in that if no connection gets killed throughout this process, the cut-over is locking; queries are blocking on the original table and are unblocked after the ghost table has taken its place. Should any of the participating connections get killed throughout this process, the algorithm resort to "table outage" which is then rolled back.
90+
91+
Read more on the [cut-over](cut-over.md) documentation.
92+
93+
#### Decoupling
94+
95+
The most impacting change the triggerless, asynchronous approach provides is the decoupling of workload. With triggers, either synchronous or asynchronous, every write on your table implied an immediate write on another table.
96+
97+
We will break down the meaning of workload decoupling, shortly. But it is important to understand that `gh-ost` interprets the situation in its own time and acts in its own time, yet still makes this an online operation.
98+
99+
The decoupling is important not only as the tool's logic goes, but very importantly as the master server sees it. As far as the master knows, write to the table and writes to the ghost table are unrelated.
100+
101+
#### Writer load
102+
103+
Not using triggers means the master no longer needs to overload multiple, concurrent writes with stored routine interpretation combined with lock contention on the ghost table.
104+
105+
The responsibility for applying data to the ghost table is completely `gh-ost`'s. As such, `gh-ost` decides which data gets to be written to the ghost table and when. We are decoupled from the original table's write load, and choose to write to the ghost table in a single thread.
106+
107+
MySQL does not perform well on multiple concurrent massive writes to a specific table. Locking becomes an issue. This is why we choose to alternate between the massive row-copy and the ongoing binlog events backlog such that the server only sees writes from a single connection.
108+
109+
It is also interesting to observe that `gh-ost` is the only application writing to the ghost table. No one else is even aware of its existence. Thus, the trigger originated problem of high concurrency, high contention writes simply does not exist in `gh-ost`.
110+
111+
#### Pausability
112+
113+
When `gh-ost` pauses (throttles), it issues no writes on the ghost table. Because there are no triggers, write workload is decoupled from the `gh-ost` write workload. And because we're using an asynchronous approach, the algorithm already handles a time difference between a master write time and the ghost apply time. A difference of a few microseconds is no different from a difference of minutes or hours.
114+
115+
When `gh-ost` [throttles](throttle.md), either by replication lag, `max-load` setting or and explicit [interactive user command](interactive-commands.md), the master is back to normal. It sees no more writes on the ghost table.
116+
An exception is the ongoing heartbeat writes onto the changelog table, which we consider to be negligible.
117+
118+
#### Testability
119+
120+
We are able to test the migration process: as we've decoupled the migration operation from the master's workload, we are good to apply the changes not to the master, but to one of its replicas. We are able to migrate a table on a replica.
121+
122+
This in itself is a nice feature; but it also presents us with testability: just as we complete the migration, we stop replication on the replica. We cut-over but rollback again. We do not drop any table. The result is both the original and ghost table exist on the replica, which is not taking any further changes. We have time to examine the two tables and compare them to our satisfaction.
123+
124+
This is the method used by GitHub to continuously validate the tool's integrity: multiple production replicas are continuously and repeatedly doing a "trivial migration" (no actually change of column) on all our production tables. Each migration is followed by a checksum of the entire table data, on both original and ghost tables. We expect the checksums to be identical and we log the results. We expect zero failures.
125+
126+
#### Multiple, concurrent migrations
127+
128+
`gh-ost` was designed with having multiple concurrent migration running in parallel (no two on the same table, of course). The asynchronous approach supports that design by not caring when data is being shipped to the ghost table. The fact no triggers exist means multiple migrations appear to the master (or other migrated host) just as multiple connections, each writing to some otherwise unknown table. Each can throttle in its own time, or we can throttle all together.
129+
130+
#### Going outside the server space
131+
132+
More to come as we make progress.
133+
134+
#### Code complexity
135+
136+
With the synchronous, trigger based approach, the role of the migration tool is relatively small. A lot of the migration is based on the triggers doing their job within the transaction space. Issues such as rollback, datatypes, cut-over are implicitly taken care of by the database. With `gh-ost`'s asynchronous approach, the tool turns complex. It connects to the master and onto a replica; it imposes as a replicating server; it writes heartbeat events; it reads binlog data into the app to be written again onto the migrated host; it need to manage connection failures, replication lag, and more.
137+
138+
The tool has therefore a larger codebase and a more complicated asynchronous, concurrent logic. But we jumped the opportunity to add some [perks](perks.md) and completely redesign how an online migration tool should work.

go/base/context.go

+29-7
Original file line numberDiff line numberDiff line change
@@ -34,10 +34,6 @@ const (
3434
CutOverTwoStep = iota
3535
)
3636

37-
const (
38-
maxRetries = 60
39-
)
40-
4137
// MigrationContext has the general, global state of migration. It is used by
4238
// all components throughout the migration process.
4339
type MigrationContext struct {
@@ -58,6 +54,7 @@ type MigrationContext struct {
5854
CliUser string
5955
CliPassword string
6056

57+
defaultNumRetries int64
6158
ChunkSize int64
6259
MaxLagMillisecondsThrottleThreshold int64
6360
ReplictionLagQuery string
@@ -92,6 +89,7 @@ type MigrationContext struct {
9289
ApplierConnectionConfig *mysql.ConnectionConfig
9390
StartTime time.Time
9491
RowCopyStartTime time.Time
92+
RowCopyEndTime time.Time
9593
LockTablesStartTime time.Time
9694
RenameTablesStartTime time.Time
9795
RenameTablesEndTime time.Time
@@ -143,6 +141,7 @@ func init() {
143141

144142
func newMigrationContext() *MigrationContext {
145143
return &MigrationContext{
144+
defaultNumRetries: 60,
146145
ChunkSize: 1000,
147146
InspectorConnectionConfig: mysql.NewConnectionConfig(),
148147
ApplierConnectionConfig: mysql.NewConnectionConfig(),
@@ -202,8 +201,18 @@ func (this *MigrationContext) HasMigrationRange() bool {
202201
return this.MigrationRangeMinValues != nil && this.MigrationRangeMaxValues != nil
203202
}
204203

205-
func (this *MigrationContext) MaxRetries() int {
206-
return maxRetries
204+
func (this *MigrationContext) SetDefaultNumRetries(retries int64) {
205+
this.throttleMutex.Lock()
206+
defer this.throttleMutex.Unlock()
207+
if retries > 0 {
208+
this.defaultNumRetries = retries
209+
}
210+
}
211+
func (this *MigrationContext) MaxRetries() int64 {
212+
this.throttleMutex.Lock()
213+
defer this.throttleMutex.Unlock()
214+
retries := this.defaultNumRetries
215+
return retries
207216
}
208217

209218
func (this *MigrationContext) IsTransactionalTable() bool {
@@ -227,7 +236,20 @@ func (this *MigrationContext) ElapsedTime() time.Duration {
227236

228237
// ElapsedRowCopyTime returns time since starting to copy chunks of rows
229238
func (this *MigrationContext) ElapsedRowCopyTime() time.Duration {
230-
return time.Now().Sub(this.RowCopyStartTime)
239+
this.throttleMutex.Lock()
240+
defer this.throttleMutex.Unlock()
241+
242+
if this.RowCopyEndTime.IsZero() {
243+
return time.Now().Sub(this.RowCopyStartTime)
244+
}
245+
return this.RowCopyEndTime.Sub(this.RowCopyStartTime)
246+
}
247+
248+
// ElapsedRowCopyTime returns time since starting to copy chunks of rows
249+
func (this *MigrationContext) MarkRowCopyEndTime() {
250+
this.throttleMutex.Lock()
251+
defer this.throttleMutex.Unlock()
252+
this.RowCopyEndTime = time.Now()
231253
}
232254

233255
// GetTotalRowsCopied returns the accurate number of rows being copied (affected)

go/cmd/gh-ost/main.go

+2
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ func main() {
6969

7070
flag.BoolVar(&migrationContext.SwitchToRowBinlogFormat, "switch-to-rbr", false, "let this tool automatically switch binary log format to 'ROW' on the replica, if needed. The format will NOT be switched back. I'm too scared to do that, and wish to protect you if you happen to execute another migration while this one is running")
7171
chunkSize := flag.Int64("chunk-size", 1000, "amount of rows to handle in each iteration (allowed range: 100-100,000)")
72+
defaultRetries := flag.Int64("default-retries", 60, "Default number of retries for various operations before panicking")
7273

7374
flag.Int64Var(&migrationContext.MaxLagMillisecondsThrottleThreshold, "max-lag-millis", 1500, "replication lag at which to throttle operation")
7475
flag.StringVar(&migrationContext.ReplictionLagQuery, "replication-lag-query", "", "Query that detects replication lag in seconds. Result can be a floating point (by default gh-ost issues SHOW SLAVE STATUS and reads Seconds_behind_master). If you're using pt-heartbeat, query would be something like: SELECT ROUND(UNIX_TIMESTAMP() - MAX(UNIX_TIMESTAMP(ts))) AS delay FROM my_schema.heartbeat")
@@ -165,6 +166,7 @@ func main() {
165166
migrationContext.ServeSocketFile = fmt.Sprintf("/tmp/gh-ost.%s.%s.sock", migrationContext.DatabaseName, migrationContext.OriginalTableName)
166167
}
167168
migrationContext.SetChunkSize(*chunkSize)
169+
migrationContext.SetDefaultNumRetries(*defaultRetries)
168170
migrationContext.ApplyCredentials()
169171

170172
log.Infof("starting gh-ost %+v", AppVersion)

0 commit comments

Comments
 (0)