Skip to content

Compressed pointer addressing in single and 2-tier mode #188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

guptask
Copy link
Contributor

@guptask guptask commented Jan 17, 2023

In this change, compressed pointer stores the tier index, slab index and alloc index. With slab worth 22 bits of data, if we have the min allocation size as 64 bytes, that requires 16 bits for storing the alloc index. The tier id occupies the 32nd bit only since its value cannot exceed 2. This leaves the remaining 15 bits for the slab index. Hence we can index 128 GiB of memory per tier in multi-tier configuration or index 256 GiB in single-tier configuration.
Backward compatibility has been ensured in this change. Till the multi-tier code is up or review, modular change such as compressed pointer ensures backward compatibility. This code will undergo some change when the multi-tier code is ready for review.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 17, 2023
Copy link
Contributor

@therealgymmy therealgymmy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. I'm not expecting a perf change since the boolean flag is either always true or false at runtime, but will run a sanity test to be sure.

@@ -36,7 +36,8 @@ CCacheAllocator::CCacheAllocator(MemoryAllocator& allocator,
currentChunksIndex_(0) {
auto& currentChunks = chunks_[currentChunksIndex_];
for (auto chunk : *object.chunks()) {
currentChunks.push_back(allocator_.unCompress(CompressedPtr(chunk)));
// TODO : pass multi-tier flag when compact cache supports multi-tier config
currentChunks.push_back(allocator_.unCompress(CompressedPtr(chunk), false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit: false /* isMultiTier */

Can you also annotate in other places we pass in the boolean for compress and uncompress calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 35 to 42
// We compress pointers by storing the tier index, slab index and alloc index of
// the allocation inside the slab. With slab worth kNumSlabBits (22 bits) of
// data, if we have the min allocation size as 64 bytes, that requires
// kNumSlabBits - 6 = 16 bits for storing the alloc index. The tier id occupies
// the 32nd bit only since its value cannot exceed kMaxTiers (2). This leaves
// the remaining (32 - (kNumSlabBits - 6) - 1 bit for tier id) = 15 bits for
// the slab index. Hence we can index 128 GiB of memory per tier in multi-tier
// configuration or index 256 GiB in single-tier configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add more details to the pointer math for both single-tier and multi-tier config?

e.g.

// Single-tier Pointer Compression
// With slab worth kNumSlabBits of data, if we
// have the min allocation size as 64 bytes, that requires kNumSlabBits - 6
// bits for storing the alloc index. This leaves the remaining (32 -
// (kNumSlabBits - 6)) bits for the slab index.  Hence we can index 256 GiB
// of memory in slabs and index anything more than 64 byte allocations inside
// the slab using a 32 bit representation.

// Multi-tier Point Compression
// With slab worth kNumSlabBits (22 bits) of
// data, if we have the min allocation size as 64 bytes, that requires
// kNumSlabBits - 6 = 16 bits for storing the alloc index.  The tier id occupies
// the 32nd bit only since its value cannot exceed kMaxTiers (2). This leaves
// the remaining (32 - (kNumSlabBits - 6) - 1 bit for tier id) =  15 bits  for
// the  slab  index. Hence we can index 128 GiB of memory per tier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 116 to 127
// Number of bits for the slab index. This will be the top 16 bits of the
// compressed ptr.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comments here. With multi-tier, the slab index is top 15 bits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

const SlabIdx slabIndex = ptr.getSlabIdx();
const uint32_t allocIdx = ptr.getAllocIdx();
const SlabIdx slabIndex = ptr.getSlabIdx(isMultiTiered);
const uint32_t allocIdx = ptr.getAllocIdx(isMultiTiered);
const Slab* slab = &slabMemoryStart_[slabIndex];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment/TODO here just to clarify multi-tier has no effect here in accessing slab memory since we haven't incorporated the actual multi-tier logic yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@facebook-github-bot
Copy link
Contributor

@therealgymmy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@jaesoo-fb jaesoo-fb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to figure out if this change makes sense without the succeeding changes. Why don't you include other commits in the stack in the pull request?

: ptr_(compress(slabIdx, allocIdx)) {}
CompressedPtr(uint32_t slabIdx,
uint32_t allocIdx,
bool isMultiTiered,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isMultiTiered looks redundant. We are storing the tid at the MSB of the compressed ptr. In this case, we can just interpret isMultiTiered = false if tid == 0. Additional logic in CompressedPtr can be simplified a lot in this way.

Copy link
Contributor Author

@guptask guptask Feb 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bit-packing format changes for multi-tier compressed pointer. The 32nd bit is reserved for tid only for multi-tiered compressed pointer. When config is single-tiered, compressed pointer bit packing used the original design - 32nd bit is not reserved for tid.
tid ==0 means cachelib may or may not be multi-tiered.

Comment on lines 147 to 162
auto noTierIdPtr = isMultiTiered ? ptr_ & ~kTierIdxMask : ptr_;
return static_cast<uint32_t>(noTierIdPtr & kAllocIdxMask);
Copy link
Contributor

@jaesoo-fb jaesoo-fb Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not needed. MSB 16 bits of kAllocIdxMask should be 0. For the parameter isMultiTiered, it would not be used, but looks OK if you want to have for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. I pushed up the correct version now.

Comment on lines +156 to +172
void setTierId(TierId tid) noexcept {
ptr_ += static_cast<uint32_t>(tid) << kNumTierIdxOffset;
Copy link
Contributor

@jaesoo-fb jaesoo-fb Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use of this? Is this needed for succeeding changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, It will be referred to in subsequent upstream PRs. For complete picture, please refer to intel#56

Comment on lines -249 to +250
return CompressedPtr{slabIndex, allocIdx};
return CompressedPtr{slabIndex, allocIdx, isMultiTiered};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, tier id is not set here. How are you going to provide the tier id?

I think you are going to have a separate MemoryAllocator per tier, so separate SlabAllocator. If so, I think the tier ID could be a member variable of the SlabAllocator?

What about the PtrCompressor then? Would CacheAllocator create separate PtrCompressor per MemoryAllocator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to this commit intel@2704ac8#diff-a6542b6dbf2cfb5e03e82205ee960757ab2b50de7bc25085089a3cffba40ae87 which shows how tier Id is passed to the alloctor.compress() methods.

This change is due to be sent for upstream review soon.

@facebook-github-bot
Copy link
Contributor

@guptask has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@guptask has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@guptask has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@guptask has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@guptask has updated the pull request. You must reimport the pull request before landing.

@guptask
Copy link
Contributor Author

guptask commented Feb 16, 2023

It is hard to figure out if this change makes sense without the succeeding changes. Why don't you include other commits in the stack in the pull request?

Decision was made to upstream the multi-tier changes in phases. This upstream plan has been discussed with Meta as well. Please refer to intel#56 for the phased plan.

@facebook-github-bot
Copy link
Contributor

@guptask has updated the pull request. You must reimport the pull request before landing.

@guptask guptask requested review from therealgymmy and jaesoo-fb and removed request for therealgymmy and jaesoo-fb February 16, 2023 19:27
@facebook-github-bot
Copy link
Contributor

@therealgymmy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@therealgymmy
Copy link
Contributor

@guptask: Hey I didn't expect this but I have observed about 1.5% CPU regression. (I did A/B and then B/A in our internal test setups to verify the difference exists).

Have you tried any cachebench runs on your end to see if throughput is affected with this change?

@guptask
Copy link
Contributor Author

guptask commented Mar 14, 2023

@guptask: Hey I didn't expect this but I have observed about 1.5% CPU regression. (I did A/B and then B/A in our internal test setups to verify the difference exists).

Have you tried any cachebench runs on your end to see if throughput is affected with this change?

@therealgymmy I ran several sets of experiment with compressed_ptr changes (rebased to latest main) and the main branch:

  1. graph cache follower fbobj with opRatePerSec set to 1m.
  2. CDN workload with opRatePerSec set to 500k.
  3. benchmark-test-PtrCompressionBench with -bm_min_iters=10000

In all experiments, I'm not seeing any regression in stats and CPU usage.

Can you please provide me with more details on how to reproduce this regression ?

…ng in single tier mode and use 31 bits for addressing in multi-tier mode
@facebook-github-bot
Copy link
Contributor

@guptask has updated the pull request. You must reimport the pull request before landing.

@guptask guptask requested review from therealgymmy and removed request for jaesoo-fb March 14, 2023 23:29
@guptask
Copy link
Contributor Author

guptask commented Mar 31, 2023

@guptask: Hey I didn't expect this but I have observed about 1.5% CPU regression. (I did A/B and then B/A in our internal test setups to verify the difference exists).
Have you tried any cachebench runs on your end to see if throughput is affected with this change?

@therealgymmy I ran several sets of experiment with compressed_ptr changes (rebased to latest main) and the main branch:

  1. graph cache follower fbobj with opRatePerSec set to 1m.
  2. CDN workload with opRatePerSec set to 500k.
  3. benchmark-test-PtrCompressionBench with -bm_min_iters=10000

In all experiments, I'm not seeing any regression in stats and CPU usage.

Can you please provide me with more details on how to reproduce this regression ?

HI @therealgymmy is there any further update on this regression issue from your end ? I didn't observe any performance regression after rebasing the upstream branch.

@therealgymmy
Copy link
Contributor

@guptask: hey i haven't had a chance to re-run the A/B test. Will kick off a run today.

@facebook-github-bot
Copy link
Contributor

@therealgymmy has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@therealgymmy
Copy link
Contributor

Looks like I made a mistake of not building A/B packages close together. Last time my "control" A-side was the production package whereas "test" B-side has many more diffs including this one. This time, A and B only differ in this diff, and the cpu difference is consistent during A/B and after I flipped it to B/A.

Screenshot 2023-04-05 at 11 24 15 AM

@facebook-github-bot
Copy link
Contributor

@therealgymmy merged this pull request in 7f66996.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants