Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

[WIP] Does this Cython translation look correct? #1

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

honnibal
Copy link

@honnibal honnibal commented Oct 9, 2017

Hey,

I'd like to try this out in spaCy, so I'm porting to Thinc. First step is to understand it by implementing ^_^.

I translated the naming into my own conventions, which is admittedly obnoxious. I can translate it back once it's correct. For now, does this look like it's doing the right thing?

@salesforce-cla
Copy link

salesforce-cla bot commented Oct 9, 2017

Thanks for the contribution! Before we can merge this, we need @honnibal to sign the Salesforce Contributor License Agreement.

prev_step = (ts-1) * nO * nB
this_step = (ts-0) * nO * nB
for b in range(nB):
prev_step += b * nO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this "overshoot" as we're updating prev_step/this_step in both the range(nO) loop and the range(nB) loop?

Copy link
Author

@honnibal honnibal Oct 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had fancy indexing we'd want this_step = (ts, b, h) and prev_step = (ts-1, b, h) right?

So we need this_step = ts * timestamp_stride + b * batch_stride + h * h_stride

My intention was to set to ts * timestamp_stride in the outer loop, and add in the batch stride in the middle loop. The inner calculation is so simple it seemed we would be doing more work on index calculation than actual computation.

@Smerity
Copy link
Contributor

Smerity commented Oct 10, 2017

Excited to see what this is like Cythonized :)

I'm going to focus on the forward (recurrent_forget_mult) as it's saner for working out the indexing, which is sadly most of the hair pulling -_- lol

The easiest way to test this at scale is to potentially use it in place of the CPUForgetMult and then run the code to see if they reach the same conclusion.

@honnibal
Copy link
Author

honnibal commented Oct 10, 2017

Hmm well I think the indexing is correct. It'll die rudely if it's not, so that'll be no problem. I mostly wanted to make sure I had the loop structure and the inner equations correct, because I don't speak CUDA very well.

Is the CNN fast in PyTorch? I've been working on some faster code for my CNN in spaCy:

The code for that is here: https://github.com/explosion/thinc/blob/feature/fast-mwe/thinc/neural/_fast_maxout_cnn.pyx

I'm getting about 2x tagger speed with this instead of my numpy-based implementation, so I'm not sure it will be worth the effort. The problem is that the speed strategy relies on turning off the threading for BLAS, and instead parallelising the loop over minibatch examples. If PyTorch makes it easy to make a C-level call to sgemm from Cython, it might be worth trying this out. Otherwise I think the normal CNN implementation seems sound.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants