Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation around Splitting files #30

Open
meticulo3366 opened this issue Feb 4, 2021 · 1 comment
Open

Documentation around Splitting files #30

meticulo3366 opened this issue Feb 4, 2021 · 1 comment

Comments

@meticulo3366
Copy link

Hi, I would like to know if there is any examples of splitting large files? I would like to implement it some how.

  • Converts any XSD to a proper usable Avro schema (Avsc)
  • Converts any XML to avro using the provided schema. What can it do? See the list below.
    • Handle any large size XML (even in GigaBytes), as it streams the xml
    • Read xml from stdin and output to stdout
    • Validate the XML with XSD
    • Split the data at any specified element (can have any no.of splits)
    • Handle multiple documents in single file (useful when streaming continuous data)
    • Write out failed documents without killing the whole process
    • Completely configurable
@GeethanadhP
Copy link
Owner

GeethanadhP commented Feb 6, 2021

see https://github.com/GeethanadhP/xml-avro/blob/master/example/config.yml for a sample config, but that doesn't include config for splitting

Below is the config section for splitting the data

split:                        # Split the avro records based on specifed list
    -
      by: "bookName"            # Split tag name
      avscFile: "name.avsc"     # Avsc File for the split part
      avroFile: "name.avro"     # Avro file name to save to
    -
      by: "bookPublisher"
      avscFile: "publisher.avsc"
      avroFile: "publisher.avro"

Assuming a file having

<bookName>
    </bunch_of_data>
</bookName>
</bookPublisher>
    </bunch_of_data>
</bookPublisher>

so the first bunch goes into name.avro
and second bunch goes into publisher.avro
but you might have to struggle with the avsc part, Frankly don't remember much, its been around 3 years since i used ithe tool

  1. It handles gigabytes of data also very easily because it streams the data tag by tag instead of the whole xml at once
  2. multiple documents in the sense of (in the below example, assume book is your root tag, and this file has 2 messages (2 books), so each book will be stored as a record in output avro.. Generaly cases you woud only get one root tag for a file, i had this option for usage with flume where it combines a bunch of messages and saves in a single file
<book id="b001">
        <author>Brandon Sanderson</author>
        <title>Mistborn</title>
        <genre>Fantasy</genre>
        <price>50</price>
        <pub_date>2006-12-17T09:30:47.0Z</pub_date>
        <review>
            <title>Wonderful</title>
            <content>I love the plot twist and the new magic</content>
        </review>
        <review>
            <title>Unbelievable twist</title>
            <content>The best book i ever read</content>
        </review>
        <sold>10</sold>
    </book>
    <book id="b002">
        <author>Brandon Sanderson</author>
        <title>Way of Kings</title>
        <genre>Fantasy</genre>
        <price>50</price>
        <pub_date>2006-12-17T09:30:47.0Z</pub_date>
        <!--<alias>-->
            <!--<title>Way of the kings</title>-->
        <!--</alias>-->
        <!--<website>-->
            <!--<url></url>-->
        <!--</website>-->
        <sold>10</sold>
    </book>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants