Conversation
| ], | ||
| 'topic_modeling': [ | ||
| 'gensim', | ||
| 'spacy' |
There was a problem hiding this comment.
Do we not need a spacy corpus as well?
quantgov/estimator/structures.py
Outdated
| class QGLdaModel(BaseEstimator, TransformerMixin): | ||
| @check_gensim | ||
| @check_spacy | ||
| def __init__(self, word_regex=r'\b[A-z]{2,}\b', stop_words=STOP_WORDS): |
There was a problem hiding this comment.
I would think the options for stop_words should be:
None(default): No stop wordsTrue: use built-in stop words- A sequence: user-specified stop words.
There was a problem hiding this comment.
Not having any stop words seems to output a pretty unusable model - my thinking is it's best to have some default, and if the user chooses to override that default with None they can, but the defaults should be able to produce something usable - we could include some output if they don't provide any (e.g. "INFO: No stop words provided, using sklearn builtins"), and potentially a warning if None is passed
quantgov/estimator/structures.py
Outdated
| class QGLdaModel(BaseEstimator, TransformerMixin): | ||
| @check_gensim | ||
| @check_spacy | ||
| def __init__(self, word_regex=r'\b[A-z]{2,}\b', stop_words=STOP_WORDS): |
There was a problem hiding this comment.
word_regex should be word_pattern to match what already exists in SKL.
quantgov/estimator/structures.py
Outdated
| pass | ||
|
|
||
|
|
||
| class QGLdaModel(BaseEstimator, TransformerMixin): |
There was a problem hiding this comment.
I don't like either the prefix or the Model specifier. I'd call this GensimLDA or something like that.
quantgov/estimator/structures.py
Outdated
| import re | ||
|
|
||
| try: | ||
| from spacy.lang.en.stop_words import STOP_WORDS |
There was a problem hiding this comment.
If we're literally only using spacy here for the stopwords, can't we somehow find the sklearn stopwords used in the CountVectorizer? That's got to be importable from somewhere.
| for doc in driver.stream()]) | ||
| stop_ids = [self.dictionary.token2id[stopword] for stopword | ||
| in self.stop_words if stopword in self.dictionary.token2id] | ||
| once_ids = [tokenid for tokenid, docfreq in |
There was a problem hiding this comment.
Why are we doing this?
There was a problem hiding this comment.
Filtering out words that only occur once was recommended in the Gensim documentation - beyond that, I don't know if it actually improves the performance of the model.
| for i in self.word_regex | ||
| .finditer(doc.text)] | ||
| for doc in driver.stream()]) | ||
| stop_ids = [self.dictionary.token2id[stopword] for stopword |
There was a problem hiding this comment.
Wouldn't it be better to only pass the dictionary words that aren't in stop_words?
|
@OliverSherouse ready for follow up review |
No description provided.