[WIP] Does this Cython translation look correct? #1

honnibal · 2017-10-09T23:47:43Z

Hey,

I'd like to try this out in spaCy, so I'm porting to Thinc. First step is to understand it by implementing ^_^.

I translated the naming into my own conventions, which is admittedly obnoxious. I can translate it back once it's correct. For now, does this look like it's doing the right thing?

salesforce-cla · 2017-10-09T23:47:46Z

Thanks for the contribution! Before we can merge this, we need @honnibal to sign the Salesforce Contributor License Agreement.

Smerity · 2017-10-10T00:30:37Z

torchqrnn/_cpu_forget_mult.pyx

+        prev_step = (ts-1) * nO * nB
+        this_step = (ts-0) * nO * nB
+        for b in range(nB):
+            prev_step += b * nO


Won't this "overshoot" as we're updating prev_step/this_step in both the range(nO) loop and the range(nB) loop?

If we had fancy indexing we'd want this_step = (ts, b, h) and prev_step = (ts-1, b, h) right?

So we need this_step = ts * timestamp_stride + b * batch_stride + h * h_stride

My intention was to set to ts * timestamp_stride in the outer loop, and add in the batch stride in the middle loop. The inner calculation is so simple it seemed we would be doing more work on index calculation than actual computation.

Smerity · 2017-10-10T00:31:30Z

Excited to see what this is like Cythonized :)

I'm going to focus on the forward (recurrent_forget_mult) as it's saner for working out the indexing, which is sadly most of the hair pulling -_- lol

The easiest way to test this at scale is to potentially use it in place of the CPUForgetMult and then run the code to see if they reach the same conclusion.

honnibal · 2017-10-10T03:20:50Z

Hmm well I think the indexing is correct. It'll die rudely if it's not, so that'll be no problem. I mostly wanted to make sure I had the loop structure and the inner equations correct, because I don't speak CUDA very well.

Is the CNN fast in PyTorch? I've been working on some faster code for my CNN in spaCy:

The code for that is here: https://github.com/explosion/thinc/blob/feature/fast-mwe/thinc/neural/_fast_maxout_cnn.pyx

I'm getting about 2x tagger speed with this instead of my numpy-based implementation, so I'm not sure it will be worth the effort. The problem is that the speed strategy relies on turning off the threading for BLAS, and instead parallelising the loop over minibatch examples. If PyTorch makes it easy to make a C-level call to sgemm from Cython, it might be worth trying this out. Otherwise I think the normal CNN implementation seems sound.

honnibal added 3 commits October 10, 2017 01:38

Sketch out Cython translation. Likely incorrect

95211f7

Add cython to setup.py

bf4a7e5

Make Cython code compile

d40d0fe

salesforce-cla bot added the cla:missing label Oct 9, 2017

Smerity reviewed Oct 10, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Does this Cython translation look correct? #1

[WIP] Does this Cython translation look correct? #1

honnibal commented Oct 9, 2017

salesforce-cla bot commented Oct 9, 2017

Smerity Oct 10, 2017

honnibal Oct 10, 2017 •

edited

Loading

Smerity commented Oct 10, 2017 •

edited

Loading

honnibal commented Oct 10, 2017 •

edited

Loading

[WIP] Does this Cython translation look correct? #1

Are you sure you want to change the base?

[WIP] Does this Cython translation look correct? #1

Conversation

honnibal commented Oct 9, 2017

salesforce-cla bot commented Oct 9, 2017

Smerity Oct 10, 2017

Choose a reason for hiding this comment

honnibal Oct 10, 2017 • edited Loading

Choose a reason for hiding this comment

Smerity commented Oct 10, 2017 • edited Loading

honnibal commented Oct 10, 2017 • edited Loading

honnibal Oct 10, 2017 •

edited

Loading

Smerity commented Oct 10, 2017 •

edited

Loading

honnibal commented Oct 10, 2017 •

edited

Loading