Optimize token_classification/rank.py for performance #1078
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR partially addresses #862
[ ✏️ Write your summary here. ]
After profiling
get_label_quality_scores
it seems that the list comprehensions and theassert_valid_inputs
function were taking a significant time. One test was addedtest_assert_valid_class_labels_fails_with_str_labels
to ensure that an error is correctly raised when the labels are provided as a string. The validation is now much faster when the provided labels are numbers.In addition, most of the work can be batched so I added a test to ensure that the results are the same regardless of the batch size and the method choosen and I refactored the function to work in batches. I included the batch_size argument as in the other functions from other modules.
I hit an edge case when I was profiling
issues_from_scores
. The code below raised an IndexError:It was just about changing the order of the comparisons in the while loop, this is a very rare case but it is fixed now. In addition, by keeping two separate lists we can reduce the memory usage when sorting and filtering the objects.
For memory I used the memory-profiler library. The code I used for benchmarking is copied below. In addition I sorted the imports in the modified files.
Code Setup
Current version
This PR
Testing
References
Reviewer Notes