Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] A small pipeline tweak: tokenization (x-ray) #174

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lopuhin
Copy link
Contributor

@lopuhin lopuhin commented Apr 27, 2017

"x-ray" was tokenized as "ray": fix that by changing default tokenizer.

Fixes #172. I also tried other changes mentioned in the issue:

  • made tokenizer parse "isn't" as a single token (allowing ' inside words), but "ins't" still isn't present in default scikit-learn stop-word list
  • tried bigrams - they made accuracy a bit worse and did not result in any noticeable changes in the example we are analyzing. And they also are slower to run, so I left them as an exercise to the reader :)

For some reason some other things changed a bit in the notebook - probably due to newer version. I only added blocks 17 and 18 (with text before and after them), and re-run the first half of the notebook. I didn't update the tutorial yet - if you like the changes, I'll update them and will re-run the whole notebook.

Here is the link to the notebook: https://github.com/TeamHG-Memex/eli5/blob/text-tutorial-x-ray/notebooks/Debugging%20scikit-learn%20text%20classification%20pipeline.ipynb

"x-ray" was tokenized as "ray": fix that by changing default
tokenizer.
@lopuhin lopuhin requested a review from kmike April 27, 2017 20:03
@codecov-io
Copy link

codecov-io commented Apr 27, 2017

Codecov Report

Merging #174 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #174   +/-   ##
=======================================
  Coverage   97.25%   97.25%           
=======================================
  Files          39       39           
  Lines        2405     2405           
  Branches      452      452           
=======================================
  Hits         2339     2339           
  Misses         34       34           
  Partials       32       32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more observations to sklearn text tutorial
2 participants