-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NUTCH-2793 indexer-csv: make it work in distributed mode #534
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -44,17 +44,14 @@ | |
* index as CSV or tab-separated plain text table. Format (encoding, separators, | ||
* etc.) is configurable by a couple of options, see output of | ||
* {@link #describe()}. | ||
* | ||
* <p> | ||
* Note: works only in local mode, to be used with index option | ||
* <code>-noCommit</code>. | ||
* </p> | ||
* | ||
*/ | ||
public class CSVIndexWriter implements IndexWriter { | ||
|
||
public static final Logger LOG = LoggerFactory | ||
.getLogger(CSVIndexWriter.class); | ||
|
||
private String filename = "nutch.csv"; | ||
private Configuration config; | ||
|
||
/** ordered list of fields (columns) in the CSV file */ | ||
|
@@ -192,7 +189,7 @@ protected int find(String value, int start) { | |
|
||
@Override | ||
public void open(Configuration conf, String name) throws IOException { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method is deprecated since the switch to the XML-based index writer configuration (see NUTCH-1480 and the wiki page IndexWriters). "name" was just an arbitrary name not a file name indicating a task-specific output path. We would need a method which takes both: the IndexWriterParams and the output path. This would require changes in the IndexWriter interface and also the classes IndexWriters and IndexerMapReduce. I'm also not sure whether the output path alone is sufficient. We'll eventually need an OutputCommitter and need to think about situations if we have multiple index writers (eg. via exchanges). See also the discussion in NUTCH-1541. |
||
|
||
filename = name; | ||
} | ||
|
||
/** | ||
|
@@ -227,7 +224,7 @@ public void open(IndexWriterParams parameters) throws IOException { | |
LOG.info("Writing output to {}", outputPath); | ||
Path outputDir = new Path(outputPath); | ||
fs = outputDir.getFileSystem(config); | ||
csvLocalOutFile = new Path(outputDir, "nutch.csv"); | ||
csvLocalOutFile = new Path(outputDir, filename); | ||
if (!fs.exists(outputDir)) { | ||
fs.mkdirs(outputDir); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still "local filesystem"? Ev. we could the outpath to overcome the problem of multiple index writers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I did not understand that, could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I've mixed two points mixed together:
outpath
points to a different directoryoutpath
to write into distinct output directories or distinct subdirectories of one job-specific output directory