-
Notifications
You must be signed in to change notification settings - Fork 194
[WIP] Does this Cython translation look correct? #1
base: master
Are you sure you want to change the base?
Conversation
Thanks for the contribution! Before we can merge this, we need @honnibal to sign the Salesforce Contributor License Agreement. |
prev_step = (ts-1) * nO * nB | ||
this_step = (ts-0) * nO * nB | ||
for b in range(nB): | ||
prev_step += b * nO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this "overshoot" as we're updating prev_step
/this_step
in both the range(nO)
loop and the range(nB)
loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had fancy indexing we'd want this_step = (ts, b, h)
and prev_step = (ts-1, b, h)
right?
So we need this_step = ts * timestamp_stride + b * batch_stride + h * h_stride
My intention was to set to ts * timestamp_stride
in the outer loop, and add in the batch stride in the middle loop. The inner calculation is so simple it seemed we would be doing more work on index calculation than actual computation.
Excited to see what this is like Cythonized :) I'm going to focus on the forward ( The easiest way to test this at scale is to potentially use it in place of the |
Hmm well I think the indexing is correct. It'll die rudely if it's not, so that'll be no problem. I mostly wanted to make sure I had the loop structure and the inner equations correct, because I don't speak CUDA very well. Is the CNN fast in PyTorch? I've been working on some faster code for my CNN in spaCy: The code for that is here: https://github.com/explosion/thinc/blob/feature/fast-mwe/thinc/neural/_fast_maxout_cnn.pyx I'm getting about 2x tagger speed with this instead of my numpy-based implementation, so I'm not sure it will be worth the effort. The problem is that the speed strategy relies on turning off the threading for BLAS, and instead parallelising the loop over minibatch examples. If PyTorch makes it easy to make a C-level call to |
Hey,
I'd like to try this out in spaCy, so I'm porting to Thinc. First step is to understand it by implementing ^_^.
I translated the naming into my own conventions, which is admittedly obnoxious. I can translate it back once it's correct. For now, does this look like it's doing the right thing?