-
Vectorization
Vectorization
Text analysis pipeline
Most text mining and NLP modeling use bag of words or bag of n-grams methods. Despite their simplicity, these models usually demonstrate good performance on text c...
-
GloVe Word Embeddings
GloVe Word Embeddings
Word embeddings
After Tomas Mikolov et al. released the word2vec tool, there was a boom of articles about word vector representations. One of the best of these articles is Stanfo...
-
Documents similarity
Documents similarity
Documents similarity
Document similarity (or distance between documents) is a one of the central themes in Information Retrieval. How humans usually define how similar are documen...
-
text2vec
text2vec
text2vec:
Consistent - expose unified interfaces, no need to explore new interface for each task;
Flexible - allow to easily solve complex tasks;
Fast - maximize efficiency per single thread,...
-
Topic modeling
Topic modeling
2018-12-21
Topic modeling is technique to extract abstract topics from a collection of documents. In order to do that input Document-Term matrix usually decomposed into 2 low-rank matri...
-
Collocations
Collocations
2018-12-21
It this tutorial I will show how to extract phrases from text and how they can be used in downstream tasks. I will use
text8 dataset which is available for download here. It c...