Skip to content

oskar-j/awesome-text-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 

Repository files navigation

Awesome software for Text ML Awesome

A curated list of awesome ML frameworks and text embeddings. Focused on SOTA libraries which are actively maintained on GitHub.

Frameworks and libraries

🐍 Python

Text processing

  • HanLP - Natural Language Processing for the next decade. Tokenization, Part-of-Speech Tagging, Named Entity Recognition, Syntactic & Semantic Dependency Parsing, Document Classification via one unified interface. https://bbs.hankcs.com/

  • flair - A powerful NLP library for state-of-the-art natural language processing (NLP) models, such as named entity recognition (NER), part-of-speech tagging (PoS), special support for biomedical data, sense disambiguation and classification.

  • sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.

  • stanza - Official Stanford NLP Python Library for Many Human Languages. https://stanfordnlp.github.io/stanza/

Pipelines / block-programming

Distributed computing

Machine Learning

  • sklearn - Scikit-learn is a Python module for machine learning built on top of SciPy, including tools for text vectorization and vector space compression. https://scikit-learn.org/stable/

  • gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community. https://radimrehurek.com/gensim/

  • nlpaug - Augmenting nlp for your machine learning projects.

  • AugLy - A data augmentations library from Facebook research for audio, image, text, and video.

Deep Learning

Natural Language Understanding

Text mining

  • dedupe - A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Visualizations

  • Scattertext - Beautiful visualizations of how language differs among document types.

Big language models

  • BIG-bench - Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models.

C++

Text processing

Currently empty 🪹

Knowledge 📚

Learning 101

  • Virgilio - Virgilio is an open-source initiative, aiming to mentor and guide anyone in the world of the Data Science.

Multiple languages

Python (and Python Notebooks)

  • practicalAI - A practical approach to machine learning to enable everyone to learn, explore and build. https://practicalai.me

  • nlp-recipes - Comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems.

No longer maintained