-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7146] Integrate secondary index on reader path #11162
Conversation
dc04695
to
49f3c3f
Compare
1c2c253
to
96b10a6
Compare
a602c9c
to
3c52961
Compare
df12fa5
to
b342d8f
Compare
hudi-common/src/main/java/org/apache/hudi/metadata/BaseTableMetadata.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
Show resolved
Hide resolved
...park-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SecondaryIndexSupport.scala
Show resolved
Hide resolved
...asource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSecondaryIndexWithSql.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, I have left some comments.
.../hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java
Show resolved
Hide resolved
7e5f353
to
7a0a21f
Compare
@@ -253,7 +253,7 @@ public HoodieMetadataLogRecordReader build() { | |||
} | |||
|
|||
private boolean shouldUseMetadataMergedLogRecordScanner() { | |||
return PARTITION_NAME_SECONDARY_INDEX.equals(partitionName); | |||
return partitionName.startsWith(PARTITION_NAME_SECONDARY_INDEX_PREFIX); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: this is main fix in the latest commit. The issue was not caught in the testing previously because I was only asserting count of records. I have improved the test. Now, we assert the secondary index records as well as check that file pruning happens. Please see TestSecondaryIndexPruning
} | ||
|
||
|
||
private def verifyFilePruning(opts: Map[String, String], dataFilter: Expression): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: the method here and below just help with verifying file pruning happens correctly i.e. with data skipping enabled, filtered file count < data files in latest snapshot, and with data skipping disabled they should be equal.
bb3bff3
to
fe352bc
Compare
@@ -85,6 +85,7 @@ public void create(HoodieTableMetaClient metaClient, String indexName, String in | |||
if (indexExists(metaClient, indexName)) { | |||
throw new HoodieFunctionalIndexException("Index already exists: " + indexName); | |||
} | |||
checkArgument(columns.size() == 1, "Only one column can be indexed for functional or secondary index."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: HUDI-7836 to track the support
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, except a minor comment.
more sec index tests disable func index test temp fix post rebase undo test disable Fix some tests after rebase fix spark 2.4
…e hfile reader for sec index
fe352bc
to
db219f6
Compare
): Option[Set[String]] = { | ||
val secondaryKeyConfigOpt = getSecondaryKeyConfig(queryReferencedColumns, metaClient) | ||
if (secondaryKeyConfigOpt.isEmpty) { | ||
Option.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return Option.empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our scala checkstyle does not allow to add return -
Line 75 in e0cf4b6
<check level="error" class="org.scalastyle.scalariform.ReturnChecker" enabled="true"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, got it. But this judgment logic has no effect here when secondaryKey is empty, will continue to execute the latter code
} | ||
lazy val (_, secondaryKeys) = if (isIndexAvailable) filterQueriesWithSecondaryKey(queryFilters, secondaryKeyConfigOpt.map(_._2)) else (List.empty, List.empty) | ||
if (isIndexAvailable && queryFilters.nonEmpty && secondaryKeys.nonEmpty) { | ||
val allFiles = fileIndex.inputFiles.map(strPath => new StoragePath(strPath)).toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use prunedPartitionsAndFileSlices
replace allFiles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let me do it in a followup PR
var fileIndex = HoodieFileIndex(spark, metaClient, None, commonOpts, includeLogFiles = true) | ||
val filteredPartitionDirectories = fileIndex.listFiles(Seq(), Seq(dataFilter)) | ||
val filteredFilesCount = filteredPartitionDirectories.flatMap(s => s.files).size | ||
assertTrue(filteredFilesCount < getLatestDataFilesCount(opts)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a new check assertTrue(filteredFilesCount > 0)
, in my test, it will be wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will check this. I also need to refactor and put these in some test util class as it is used in other index test as well.
Change Logs
This PR integrates the secondary index on the reader path as follows:
HoodieTableMetadata#readSecondaryIndex
APISecondaryIndexSupport
, a util class to get candidate files based on query filter and secondary indexHoodieFileIndex
to useSecondaryIndexSupport
Impact
Can use secondary index in queries.
Risk level (write none, low medium or high below)
medium
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist