Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent Checkpointing (#2184) #3918

Closed
wants to merge 5 commits into from

Conversation

theIDinside
Copy link
Contributor

This is the first version of a PR that attempts to provide the functionality requested by issue #2184

rr create-checkpoints -i <some_interval> [-s <some_start_event>, -e <some_end_event>] where last two params are optional.
rr replay -g <evt> spawns the session from the most recent PCP before <evt>. It also spaws the most recent PCP, if -f <pid> is used, i.e. it finds when <pid> is created and spawns first PCP before that.
rr replay uses PCP during reverse execution
rr rerun uses PCP as well

Both the replay and rerun command now takes --ignore-pcp to ignore any PCP's and I've made spawning from PCP the default behavior of both commands.

2 commands has also been added to the spawned GDB; write-checkpoints and load-checkpoints.

The last point about persistent checkpoints being created at record time is not provided by this PR, but I'm willing to attempt to add that in a future PR, now that I have a little insight into how this would/could/maybe should work.

At this time, little to no optimizations are performed. Each mapping in the process address space is serialized to disk and it is currently not compressed in any way. Compressing data that goes into anonymous mappings should be fairly simple to implement as this data will get copied into memory during restore of a PCP, while file backed mappings (like executable data for instance) can not be compressed as easily. One wants to map as much file backed as possible, as this is not necessarily committed to physical memory immediately, which is the case with copying data into mappings.

Other optimizations that possibly could be done, is to instead of creating each checkpoints "from scratch", is to during restore of PCP's, reconstitute the first one (at event N), then when reconstituting the following checkpoint, fork the first and make the changes required to that one. As it stands right now, it creates a new session for each checkpoint. Theoretically this consumes more memory. Forking checkpoint N+1 from N and changing the address space where needed, I think would mean that less memory is used, I think.

Also, if anybody has any ideas on how one could possibly write tests for something like this, they would be most welcome to share those thoughts with me.

This is a new PR that continues from #3406 because it just refused to not claim that the PR had merge-commits.

- Moved into util.cc
- Added forward_to to skip trace data to some arbitrary point in time

Getters required to expose data

We need to be able to expose this data so it can
be serialized.

Find original exe for ReplayTask

Digs out original executable image that this task was forked
from, or in the case of exec, exec'd on.

This is required for persistent checkpointing, so that the names in the
proc fs corresponds to a correct name at replay time (i.e. has the same
behavior/looks the same in proc fs as a normal replay). The thread name is
not what should be showing up in /proc/tid/comm, but the actual
executable. So we need to be able to find this "original exe" of the
task.

Check if Event is checkpointable

Required for the create checkpoints command, etc. to determine what
events in the trace are checkpointable, when not having a live session.

In future commits/PRs, remove the static function in ReplaySession.cc`
that does the same thing and use this member function on Event instead.

Additional proc fs query paths

Gets additional proc fs paths for a task, in this case
/mem. Required for persistent checkpointing to figure out
on how to handle mappings and what to serialize (and what not to
serialize).

Lifted CloneCompletion out of Session

The function extract_name will also be required for setting up syscall
buffer stuff in coming commits.

Getters/setters required for PCP

Need to be able to set this data when restoring an address space.

Persistent checkpointing

Added persistent checkpoint schema for capnproto rr_pcp.capnp,
as well a compile command for it in CMakeLists.txt, that works like
the other one (rr_trace.capnp)

CheckpointInfo and MarkData types works as intermediaries between a
serialized checkpoint and a deserialized "live" one. MarkData is used for
copying the contents of Mark, InternalMark, ProtoMark and it's various
data into, for serialization as well when deserializing, to reconstruct
those types.

The reasoning for adding MarkData is to not intrude in Mark/InternalMark/ProtoMark
interface and possibly break some guarantees or invariants they provide.
If something goes wrong now, it's constrained only to persistent
checkpointing not reconstituting a session properly.

GDB spawned by RR now has 2 additional commands, write-checkpoints, which
serializes any checkpoints set by the `checkpoint` command and
load-checkpoints.

Added the rr create-checkpoints command which create persistent checkpoints
on a specified interval, which it attempts to honor as closely as possible.

RerunCommand and ReplayCommand are now aware of PCPs.

Replay sessions get spawned from persistent checkpoints if they
exist on disk when using `-g <evt>` or when using `-f <pid>` and that
"task" was created some time after a persistent checkpoint.

Added the --ignore-pcp flag to these commands, which ignores pcps
and spawns sessions normally.

fixup for can_checkpoint_at

Restored comments, that existed in static function in ReplaySession.cc
Change all use of this to Event::can_checkpoint_at
Removed static can_checkpoint_at in ReplaySession.cc

Fix preferred include & unnecessary check for partial init

Since checkpoints are partially initialized, checking that they are is pointless.

Added cmake command looping over trace files per request by @khuey

remove init check of member variables.

Move extract_name from Session into util.h.

Removed stream_util, moved contents to util.h

make ignore-pcp not take up '-i'

Moved responsibility of de/ser into FdTable and FileMonitor

Deserializing and serializing an FdTable is now performed by the class itself instead of in a free function

FileMonitor has a public member function that is used for serialization.
Each derived type that requires special/additional logic, extends
the virtual member function serialize_type.

Remove skipMonitoringMappedFd

not necessary for serialization, as FdTable is separately restored.

Refactor task OS-name setting

Task::copy_state sets the OS name of a task in the same fashion that
persistent checkpointing sets name. Refactored this functionality into
Task::set_name.

Also removed the unnecessary `update_prname` from Task::copy_state.

update_prname is not a "write to tracee"-operation but a "read from tracee"-operation; and since
we already know what name we want to set Task::prname to, we skip this reading from the tracee
in Task::copy_state and just set it to the parameter passed in to Task::set_name

Add const qualifier

Fixes rr-debugger#3678

Refactor so that marks_with_checkpoints is just changed in one place, not arbitrarily access it. Ref counts had the same changes in a previous commit.

Fixes a bug for loaded persistent checkpoints where the re-created checkpoints did not get their reference counting correct.

This closes rr-debugger#3678
pread may or may not read the requested size. This was not
taken into account for. If the data is large,
we probably can't read it, in one go.
Previous work assumed everything as native. That doesn't work when
debugging a 32-bit mode application on a 64-bit machine. That means some
`SupportedArch` fields needs to be serialized as well, so that if the
process was a 32-bit application it gets re-created correctly. This was
instantly made visible by the provided test (although it did not test it
explicitly).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant