Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elbow finding #250

Open
wants to merge 48 commits into
base: master
Choose a base branch
from
Open

Conversation

ShawnCodesABit
Copy link

Incorporates a variety of heuristics for determining what is and isn't similar. This functionality is optional and is disabled by default. This is supported for all aspects of Top2Vec other than the topic by term array as that is currently stored as a single 2D array rather than a list of 1D arrays which would allow for different lengths.

Other changes:

  • Updated setup.py to just work.
  • Added in a dev option to setup.py.
  • Type hints through the parts of Top2Vec that I have modified.
  • Parameterized the maximum number of terms to describe a topic.
  • Gave an option to not lowercase tokens when querying. This can help when working with non-traditional text datasets or other tokenization methods.

…ing on using to describe topics. Stubs for making a sparse matrix.
…s within an embedding. Now with elbows. Also found some strange behavior when the curve crosses over the line.
…lity to function on first elbow and also ensure that only positive values are returned.
…nd that adding in a maximum percent difference for the first bin catches some of the cases where elbow finding performs poorly.
…o flip when we are and are not inclusive from an elbow index. Great. Also more cases where running twice gives us more accurate results than running on all the data. This may be solved with vocabulary reduction, but maybe I should include a recursive option.
…s a combination of the 2nd derivative and the distance from the line.
…rent heuristics. This commit has a rather large performance hit as it saves the y-delta for each point in order to determine the sign of the curve.
…s to the cutoff heuristics. Lots of changes on documentation. Changed name from elbow_finding to cutoff_heuristics as there are multiple options now.
… y distance between the value and the line doesn't work as well for finding an index. Massive speed-ups and simplification due to chopping off various metrics.
…s of changes to start supporting cutoff_heuristics within Top2Vec class.
ShawnCodesABit and others added 18 commits March 11, 2022 01:13
…nit tests. Refactoring all of the heuristics into their own submodule for cleanliness.
…Using some more type hints and refactoring that into its own file. Updates to test to get a bit more coverage.
… provided values when computing a cutoff. Changed default heuristic to recursive_elbow after testing with live data.
…perimenting with different data sets. Adding in a some more functionality to the plot submodule to show how to visualize a heuristic determination.
…ated into plot and word cloud so that heuristic based cutoffs work a bit smoother.
…erformance increases that are about to be added.

Making sure that files which are generated when running the test notebooks don't get added to git.
…array once, speed ups on the derivative calculations.
…and the ability to plot from arbitrary vectors.
…se tight oscillation cases it seems like the shifted second derivative works best.
Removing commented out code.
@jstremme
Copy link

jstremme commented Nov 9, 2022

This PR looks really useful! I might dig into it a bit. Wonder if it'll be merged 🐙

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants