-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-19131. Assist reflection IO with WrappedOperations class #6686
HADOOP-19131. Assist reflection IO with WrappedOperations class #6686
Conversation
prepared parquet for this by renaming vectorio package to |
12c95ff
to
668c1ce
Compare
c1e52f5
to
827b41c
Compare
0dad2aa
to
e6241ab
Compare
128ba0c
to
128e2d7
Compare
This is in sync with apache/hadoop#6686 which has renamed one of the method names to load. The new DynamicWrappedIO class is based on one being written as part of that PR, as both are based on the Parquet DynMethods class a copy-and-paste is straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- think I might cut the new read forms (parquet, orc) from the read policy, though parquet/1 and parquet/3 may be good
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Options.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Options.java
Outdated
Show resolved
Hide resolved
...ls/hadoop-azure/src/test/java/org/apache/hadoop/fs/azurebfs/contract/ITestAbfsWrappedIO.java
Show resolved
Hide resolved
...s-project/hadoop-hdfs/src/test/java/org/apache/hadoop/fs/contract/hdfs/TestDFSWrappedIO.java
Show resolved
Hide resolved
@mukund-thakur this pr renames My iceberg PR apache/iceberg#10233 looks for the new name; it is now dynamic and should build link up if we can think of a way to test it (proposed: make it an option to use if present), default is true. |
b44df10
to
a60f769
Compare
💔 -1 overall
This message was automatically generated. |
javadocs
checkstyles are all about use of _ in method names, except for one
|
💔 -1 overall
This message was automatically generated. |
legitimate failure
|
Class WrappedIO extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable * test on supported filesystems (hdfs) * Plus tests with validation of degradation when IO methods are not found. Explicitly add read policies for columnar, parquet and orc Add IOStatistics context accessors and reset() * columnar * orc * parquet * avro This to make it clearer to the filesystem implementations that they should optimize for whatever their data traces recommend. Class DynamicWrappedIO to access the WrappedIO Methods through Parquet's DynMethods API. This class becomes easy to copy and paste into Parquet and Iceberg and then be immediately used. Class WrappedStatistics to provide an equivalent to access IOStatistics interfaces, objects and operations. Ability to * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. Tuned AbstractContractBulkDeleteTest * make setUp() an override of the existing setup(); this makes initialization more deterministic. * inline some variables in setup() Important: this change renames bulkDelete_PageSize to be bulkDelete_pageSize so it is consistent with all the new methods being added. This is sync with initial implementation of PARQUET-2493; tuning code to suit actual use. In particular -WrappedIO methods raise UncheckedIOEs -DynamicWrappedIO methods unwrap these -static method to switch between openFile() and open() based on method availability. Change-Id: Ib4f177d5409156217f4c3d14f1c99adfe82b96d2
+move the DynMethods and related classes under oah.utils.dynamic, marked as private. Change-Id: I9ff52ab02d51bf2175862a3020b41e969088fb65
+ boolean to enable/disable footer caching. These are all hints with no wiring up. Google GCS does footer cache, and abfs has it as a WiP; those clients can adopt as desired. The reason for the footer cache flag is that some query engines do their own caching -having the input stream try and "be helpful" is at best needless and at worst counterproductive. Change-Id: Ibf5914d9fa327438790b946b29b9369d098ae14c
Indicates that locations are generated client side and don't refer to real hosts. If found, list calls which return LocatedFileStatus are low cost Added for: file, s3a, abfs, oss Change-Id: Id94be4cbf1a41ac84818c7b2e061423b9b24d149
Got signature wrong. logging loading at debug to diagnose this. Change-Id: I9c96ffe61d123b9461636380ef77f55d8ddbe3a4
move the unchecking as default methods in CallableRaisingIOE, FunctionRaisingIOE etc makes for a clean and flexible design. some test enhancements. Change-Id: If25b6d0377bc9e4e8d4a6e689692ddfa96b1c756
javadoc, checkstyle and unit test for the new method Change-Id: Id16d01c193814c46215c81e8040ffa7a25720f1c
b59846b
to
3fe9cdb
Compare
declare that hbase is an hbase table; s3a maps to random IO. abfs recommends disabling prefetch for these files...it should do it automatically when support for read policies is wired up. Change-Id: I0823cd307a059bf0f3499e7555d9ccc87fb4ae70
3fe9cdb
to
76b0afc
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Change-Id: Ibd158b3a14bacc95059f0e4e86179e78bebdb53c
🎊 +1 overall
This message was automatically generated. |
All checkstyles are from underscores; I tried to set up a style rule to disable this but it didn't work right as there's no checkstyle overrides in hadoop-common right now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a big patch. Have been reviewing a few weeks ago and checked again today. Overall looks great to me +1
Just I don't understand why we added fs.capability.virtual.block.locations
in this patch?
it's to say "this fs makes up block locations". it means that cost of looking up block locations is a lot less (no remote calls) and you don't really need to schedule work elsewhere. now that hasPathCapability() is being exported to legacy code, I just felt this would be useful. Currently things look for the default (host == localhost) and go from there -but they only get to do that after the looup |
…he#6686) 1. The class WrappedIO has been extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable All these static methods raise UncheckedIOExceptions rather than checked ones. 2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics provides similar access to IOStatistics/IOStatisticsContext classes and operations. Allows callers to: * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * Mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. There are examples of the dynamic bindings to these classes in: org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics These use DynMethods and other classes in the package org.apache.hadoop.util.dynamic which are based on the Apache Parquet equivalents. This makes re-implementing these in that library and others which their own fork of the classes (example: Apache Iceberg) 3. The openFile() option "fs.option.openfile.read.policy" has added specific file format policies for the core filetypes * avro * columnar * csv * hbase * json * orc * parquet S3A chooses the appropriate sequential/random policy as a A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for any filesystem aware of it, falling back to the first entry in the list which the specific version of the filesystem recognizes 4. New Path capability fs.capability.virtual.block.locations Indicates that locations are generated client side and don't refer to real hosts. Contributed by Steve Loughran
1. The class WrappedIO has been extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable All these static methods raise UncheckedIOExceptions rather than checked ones. 2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics provides similar access to IOStatistics/IOStatisticsContext classes and operations. Allows callers to: * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * Mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. There are examples of the dynamic bindings to these classes in: org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics These use DynMethods and other classes in the package org.apache.hadoop.util.dynamic which are based on the Apache Parquet equivalents. This makes re-implementing these in that library and others which their own fork of the classes (example: Apache Iceberg) 3. The openFile() option "fs.option.openfile.read.policy" has added specific file format policies for the core filetypes * avro * columnar * csv * hbase * json * orc * parquet S3A chooses the appropriate sequential/random policy as a A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for any filesystem aware of it, falling back to the first entry in the list which the specific version of the filesystem recognizes 4. New Path capability fs.capability.virtual.block.locations Indicates that locations are generated client side and don't refer to real hosts. Contributed by Steve Loughran
…he#6686) 1. The class WrappedIO has been extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable All these static methods raise UncheckedIOExceptions rather than checked ones. 2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics provides similar access to IOStatistics/IOStatisticsContext classes and operations. Allows callers to: * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * Mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. There are examples of the dynamic bindings to these classes in: org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics These use DynMethods and other classes in the package org.apache.hadoop.util.dynamic which are based on the Apache Parquet equivalents. This makes re-implementing these in that library and others which their own fork of the classes (example: Apache Iceberg) 3. The openFile() option "fs.option.openfile.read.policy" has added specific file format policies for the core filetypes * avro * columnar * csv * hbase * json * orc * parquet S3A chooses the appropriate sequential/random policy as a A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for any filesystem aware of it, falling back to the first entry in the list which the specific version of the filesystem recognizes 4. New Path capability fs.capability.virtual.block.locations Indicates that locations are generated client side and don't refer to real hosts. Contributed by Steve Loughran
…he#6686) 1. The class WrappedIO has been extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable All these static methods raise UncheckedIOExceptions rather than checked ones. 2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics provides similar access to IOStatistics/IOStatisticsContext classes and operations. Allows callers to: * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * Mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. There are examples of the dynamic bindings to these classes in: org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics These use DynMethods and other classes in the package org.apache.hadoop.util.dynamic which are based on the Apache Parquet equivalents. This makes re-implementing these in that library and others which their own fork of the classes (example: Apache Iceberg) 3. The openFile() option "fs.option.openfile.read.policy" has added specific file format policies for the core filetypes * avro * columnar * csv * hbase * json * orc * parquet S3A chooses the appropriate sequential/random policy as a A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for any filesystem aware of it, falling back to the first entry in the list which the specific version of the filesystem recognizes 4. New Path capability fs.capability.virtual.block.locations Indicates that locations are generated client side and don't refer to real hosts. Contributed by Steve Loughran
…he#6686) 1. The class WrappedIO has been extended with more filesystem operations - openFile() - PathCapabilities - StreamCapabilities - ByteBufferPositionedReadable All these static methods raise UncheckedIOExceptions rather than checked ones. 2. The adjacent class org.apache.hadoop.io.wrappedio.WrappedStatistics provides similar access to IOStatistics/IOStatisticsContext classes and operations. Allows callers to: * Get a serializable IOStatisticsSnapshot from an IOStatisticsSource or IOStatistics instance * Save an IOStatisticsSnapshot to file * Convert an IOStatisticsSnapshot to JSON * Given an object which may be an IOStatisticsSource, return an object whose toString() value is a dynamically generated, human readable summary. This is for logging. * Separate getters to the different sections of IOStatistics. * Mean values are returned as a Map.Pair<Long, Long> of (samples, sum) from which means may be calculated. There are examples of the dynamic bindings to these classes in: org.apache.hadoop.io.wrappedio.impl.DynamicWrappedIO org.apache.hadoop.io.wrappedio.impl.DynamicWrappedStatistics These use DynMethods and other classes in the package org.apache.hadoop.util.dynamic which are based on the Apache Parquet equivalents. This makes re-implementing these in that library and others which their own fork of the classes (example: Apache Iceberg) 3. The openFile() option "fs.option.openfile.read.policy" has added specific file format policies for the core filetypes * avro * columnar * csv * hbase * json * orc * parquet S3A chooses the appropriate sequential/random policy as a A policy `parquet, columnar, vector, random, adaptive` will use the parquet policy for any filesystem aware of it, falling back to the first entry in the list which the specific version of the filesystem recognizes 4. New Path capability fs.capability.virtual.block.locations Indicates that locations are generated client side and don't refer to real hosts. Contributed by Steve Loughran
Hi @steveloughran Sorry I should have added that comment her itself. |
HADOOP-19131
Assist reflection IO with WrappedOperations class
How was this patch tested?
Needs new tests going through reflection, maybe some in openfile contract to guarantee full use.
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?