Skip to content
This repository has been archived by the owner on Dec 15, 2021. It is now read-only.

Failure to parse RSS feeds that use unencoded <body> #32

Open
daveaglick opened this issue May 7, 2018 · 3 comments
Open

Failure to parse RSS feeds that use unencoded <body> #32

daveaglick opened this issue May 7, 2018 · 3 comments

Comments

@daveaglick
Copy link

Some feeds that use <body> instead of encoded <content> elements are failing. It appears that the parser gets confused about open/close using the XmlReader in the <body> element since it attempts to construct SyndicationContent for all the nested XHTML. This results in exceptions like:

'Element' is an invalid XmlNodeType

For example, see http://feeds.feedburner.com/RockfordLhotka. I'm going to try and fix this by skipping <body> elements and using their inner XML as the value for the outer SyndicationContent. Will submit a PR if that works.

daveaglick added a commit to daveaglick/SyndicationFeedReaderWriter that referenced this issue May 8, 2018
@jimmyca15
Copy link
Member

@daveaglick Can you send a link to the spec for the <body> element. I'm having trouble finding it.

@daveaglick
Copy link
Author

It looks like it was briefly considered as a replacement for content. I looked and couldn't find an official mention in any specifications, but there's some references to it around the 2004 time frame: https://web.archive.org/web/20040217110945/http://www.thearchitect.co.uk/weblog/archives/2003/03/000116.html

Probably more important is that some blog engines continue to produce it, official specification or not. And that it's valid RSS: https://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Ffeeds.feedburner.com%2FRockfordLhotka

The root of the problem in SyndicationFeedReaderWriter is that the body element often contains mixed content (because XHTML is mixed) but the streaming read mode of the parser requires matched node opens and closes. My PR solves the specific body problem by assuming that element is likely to have mixed content and treats it separately by loading it into a single SyndicationContent value without attempting to read it's children.

@jimmyca15
Copy link
Member

@drago-draganov

Looks like xhtml:body is used in RSS feeds. Given an element in a feed with the xhtml namespace and element name body, readers should expect unencoded xhtml.

An example

<rss version="2.0" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <channel>
    <title>xhtml test feed</title>
    <link>http://example.org</link>
    <description>Test feed for xhtml body element</description>
    <item>
      <title>Test Item</title> 
      <link>http://example.org/post1</link> 
      <description>Here we go!</description> 
      <xhtml:body><xhtml:div><xhtml:p>content here demonstrating use of &lt;xhtml:body&gt; element.</xhtml:p></xhtml:div></xhtml:body>
    </item>
    <item>
      <title>Test Item single xmlns</title> 
      <link>http://example.org/post1</link> 
      <description>Here we go!</description> 
      <body xmlns="http://www.w3.org/1999/xhtml"><div><p>content here demonstrating use of &lt;body&gt; element.</p></div></body>
    </item>
  </channel>
</rss>

I think we can add this behavior in.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants