Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add text extractor for extracting unstructured output #11

Closed
wants to merge 15 commits into from

Conversation

mkumar1984
Copy link
Contributor

Dear DIL maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

  • My PR addresses the following JIRA issues and references them in the PR title.

Description

  • Here are some details about my PR, including screenshots (if applicable):
    Currently DIL supports many structured format like CSV, Json, Avro and also many compression formats. Unstructured text format is supported only through FileDumpExtractor, which dumps output to HDFS. With FileDumpExtractor, output cannot be passed to any converter. Text Extractor should be supported, which can extract output in any format and pass it to some converter for further ETL rather than directly pushing this to HDFS. This is useful in cases where we want to get some URL output and then apply some custom parsing to get the required output.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Added unit test cases.

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@chris9692 chris9692 force-pushed the master branch 4 times, most recently from 152b832 to 6c7a3cc Compare October 6, 2021 12:26
@soEricNG
Copy link
Collaborator

Could you pull the latest change and update the PR? There are a lot of other diff files included.

@booddu booddu closed this Oct 20, 2021
@mkumar1984
Copy link
Contributor Author

Created a new PR for this #13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants