Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dfxp: examine the contents of div tags for timing information (cbs) #2

Closed
wants to merge 7 commits into from

Conversation

as
Copy link
Collaborator

@as as commented May 18, 2020


Original Pull Request: kaltura#1136 (comment)

The current dfxp parser does not examine the contents of <div tags for timing information. This causes it to generate an empty manifest for some valid dfxp files. It is valid to have timing information in the <div, and I can provide more details and examples on request, but here are two files:

FILE A: http://flv.io/ABC142908_id.dfxp.xml
FILE B: http://flv.io/ABC142908_en-US.dfxp.xml
FILE B currently generates an empty manifest. This change fixes this.

I have made changes to two functions that examine timing information:

dfxp_get_duration, which traverses the entire XML to compute the ultimate duration of the subtitle's content.
dfxp_parse_frames, which traverses the XML in a similar fashion. It now looks at <p tags and reverts to the last-visited <div if there exists no timing information in the <p tag.
Additionally, there is set of helper functions to remove duplicate code, as reconciling my changes to work inline resulted in cruft. The resulting changes allow FILE B to be parsed, as well as combinations of FILE A and FILE B.

Additional Feedback: kaltura#1136 (comment)


We have a few things to do in this branch based on the external PR feedback, but I believe they are not business-critical. Let's review this internally and merge it into our copy of the repo, and create another task to handle the changes suggested by the reviewer.

Style

Good to have to get the PR merged and keep the code consistent, but not necessary for things to work.

Efficiency:

These are nits, the routines do most of their work looking for timecodes in irrelevant places, for example, parsing head. We will not get much value out of these optimizations, because benchmarks on the corpus of our subtitle files before and after the patch have similar runtimes.

Bug: last_div is not reset when walking up in the tree - meaning - it may fall back to using a div that is not an ancestor of the current p element.

I'm glad the reviewer pointed this out. I think the only time this could happen is if the user submits a file that isn't going to work anyway:

<div>
    <div timecode> <p text /> </div>
    <p text /> // uses timecode, even though its not a child, but there would have to be no <div, and no timecode in the p-tag
</div>

It could be nice to fix, but it's an edge-case, so I recommend a separate, lower-priority task for these modifications.

@as as requested review from zsiec and flavioribeiro May 18, 2020 19:48
@as as self-assigned this May 18, 2020
@as as changed the title Fix/dfxp timecode (cbs) dfxp: examine the contents of div tags for timing information (cbs) May 18, 2020
@as as requested a review from juce May 18, 2020 19:56
Copy link

@juce juce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, although admittedly i lack the domain knowledge to judge the logic itself. It seems right... ;-). Fwiw, I think the "style" changes are all pretty small and easy to implement, so maybe makes sense to have them done now rather than later. Can't really comment on the "efficiency" remarks, i trust Andre's notes in the response. The "bug" scenario does seem legit.. should we fix it perhaps by keeping track of div hierarchy in a stack and pushing/popping elements there as the parser encounters div start- and end-tags?

@as
Copy link
Collaborator Author

as commented May 19, 2020

@juce I was thinking we could mark the "level" at which last_div was encountered (with an integer) and then set last_div to null when the parser ascended above that level. What do you think about that approach? Right now we are only looking at the previous last_div anyway, so it might be simpler to do that than keep a stack of them.

@as
Copy link
Collaborator Author

as commented Jul 6, 2020

This pull request has been superseded because the changes were merged out-of-band into our fork's master branch. This branch's code is simpler, the other code that was merged conforms better to the coding style of the vod module, but should be equivalent. This branch is also missing a bug-fix.

Please reopen if you disagree.

@as as closed this Jul 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants