Documentation around Splitting files #30

meticulo3366 · 2021-02-04T19:18:51Z

Hi, I would like to know if there is any examples of splitting large files? I would like to implement it some how.

Converts any XSD to a proper usable Avro schema (Avsc)
Converts any XML to avro using the provided schema. What can it do? See the list below.
- Handle any large size XML (even in GigaBytes), as it streams the xml
- Read xml from stdin and output to stdout
- Validate the XML with XSD
- Split the data at any specified element (can have any no.of splits)
- Handle multiple documents in single file (useful when streaming continuous data)
- Write out failed documents without killing the whole process
- Completely configurable

GeethanadhP · 2021-02-06T00:12:36Z

see https://github.com/GeethanadhP/xml-avro/blob/master/example/config.yml for a sample config, but that doesn't include config for splitting

Below is the config section for splitting the data

split:                        # Split the avro records based on specifed list
    -
      by: "bookName"            # Split tag name
      avscFile: "name.avsc"     # Avsc File for the split part
      avroFile: "name.avro"     # Avro file name to save to
    -
      by: "bookPublisher"
      avscFile: "publisher.avsc"
      avroFile: "publisher.avro"

Assuming a file having

<bookName>
    </bunch_of_data>
</bookName>
</bookPublisher>
    </bunch_of_data>
</bookPublisher>

so the first bunch goes into name.avro
and second bunch goes into publisher.avro
but you might have to struggle with the avsc part, Frankly don't remember much, its been around 3 years since i used ithe tool

It handles gigabytes of data also very easily because it streams the data tag by tag instead of the whole xml at once
multiple documents in the sense of (in the below example, assume book is your root tag, and this file has 2 messages (2 books), so each book will be stored as a record in output avro.. Generaly cases you woud only get one root tag for a file, i had this option for usage with flume where it combines a bunch of messages and saves in a single file

<book id="b001">
        <author>Brandon Sanderson</author>
        <title>Mistborn</title>
        <genre>Fantasy</genre>
        <price>50</price>
        <pub_date>2006-12-17T09:30:47.0Z</pub_date>
        <review>
            <title>Wonderful</title>
            <content>I love the plot twist and the new magic</content>
        </review>
        <review>
            <title>Unbelievable twist</title>
            <content>The best book i ever read</content>
        </review>
        <sold>10</sold>
    </book>
    <book id="b002">
        <author>Brandon Sanderson</author>
        <title>Way of Kings</title>
        <genre>Fantasy</genre>
        <price>50</price>
        <pub_date>2006-12-17T09:30:47.0Z</pub_date>
        <!--<alias>-->
            <!--<title>Way of the kings</title>-->
        <!--</alias>-->
        <!--<website>-->
            <!--<url></url>-->
        <!--</website>-->
        <sold>10</sold>
    </book>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation around Splitting files #30

Documentation around Splitting files #30

meticulo3366 commented Feb 4, 2021

GeethanadhP commented Feb 6, 2021 •

edited

Loading

Documentation around Splitting files #30

Documentation around Splitting files #30

Comments

meticulo3366 commented Feb 4, 2021

GeethanadhP commented Feb 6, 2021 • edited Loading

GeethanadhP commented Feb 6, 2021 •

edited

Loading