Topic modeling by mgasvoda · Pull Request #37 · QuantGov/quantgov

mgasvoda · 2018-01-02T18:04:08Z

No description provided.

OliverSherouse · 2018-01-17T20:59:16Z

setup.py

+        ],
+        'topic_modeling': [
+            'gensim',
+            'spacy'


Do we not need a spacy corpus as well?

OliverSherouse · 2018-01-17T21:02:47Z

quantgov/estimator/structures.py

+class QGLdaModel(BaseEstimator, TransformerMixin):
+    @check_gensim
+    @check_spacy
+    def __init__(self, word_regex=r'\b[A-z]{2,}\b', stop_words=STOP_WORDS):


I would think the options for stop_words should be:

None (default): No stop words

True: use built-in stop words

A sequence: user-specified stop words.

Not having any stop words seems to output a pretty unusable model - my thinking is it's best to have some default, and if the user chooses to override that default with None they can, but the defaults should be able to produce something usable - we could include some output if they don't provide any (e.g. "INFO: No stop words provided, using sklearn builtins"), and potentially a warning if None is passed

OliverSherouse · 2018-01-17T21:03:34Z

quantgov/estimator/structures.py

+class QGLdaModel(BaseEstimator, TransformerMixin):
+    @check_gensim
+    @check_spacy
+    def __init__(self, word_regex=r'\b[A-z]{2,}\b', stop_words=STOP_WORDS):


word_regex should be word_pattern to match what already exists in SKL.

OliverSherouse · 2018-01-17T21:04:14Z

quantgov/estimator/structures.py

    pass
+
+
+class QGLdaModel(BaseEstimator, TransformerMixin):


I don't like either the prefix or the Model specifier. I'd call this GensimLDA or something like that.

OliverSherouse · 2018-01-17T21:18:20Z

quantgov/estimator/structures.py

+import re
+
+try:
+    from spacy.lang.en.stop_words import STOP_WORDS


If we're literally only using spacy here for the stopwords, can't we somehow find the sklearn stopwords used in the CountVectorizer? That's got to be importable from somewhere.

OliverSherouse · 2018-01-17T21:22:07Z

quantgov/estimator/structures.py

+                                      for doc in driver.stream()])
+        stop_ids = [self.dictionary.token2id[stopword] for stopword
+                    in self.stop_words if stopword in self.dictionary.token2id]
+        once_ids = [tokenid for tokenid, docfreq in


Why are we doing this?

Filtering out words that only occur once was recommended in the Gensim documentation - beyond that, I don't know if it actually improves the performance of the model.

OliverSherouse · 2018-01-17T21:22:28Z

quantgov/estimator/structures.py

+                                      for i in self.word_regex
+                                        .finditer(doc.text)]
+                                      for doc in driver.stream()])
+        stop_ids = [self.dictionary.token2id[stopword] for stopword


Wouldn't it be better to only pass the dictionary words that aren't in stop_words?

mgasvoda · 2018-01-18T15:48:53Z

@OliverSherouse ready for follow up review

Michael Gasvoda added 10 commits December 19, 2017 14:09

modifying lda class structure

b07786a

adjusting lda model, adding parameters for grid search

257131d

adjusting lda model, adding parameters for grid search

265cc83

adding topic modeling dependencies

bb5f036

fixing syntax

2a15b32

lowercasing words

c3faebf

adding sklearn api version

c6b764d

initial commit testing topic model

a17ed34

loading all models for testing

d5592f0

passing topic model test

58513f7

mgasvoda requested a review from OliverSherouse January 2, 2018 18:04

Michael Gasvoda added 2 commits January 2, 2018 13:08

updating travis config dependencies

aac7d98

checking import before initializing

ef9853f

OliverSherouse suggested changes Jan 17, 2018

View reviewed changes

Michael Gasvoda added 3 commits January 18, 2018 10:25

adjusting dependencies, variable names

c97c6df

moving stopword removal

aaffd94

restructuring stopword args, adjusting min word freq

58db43a

Michael Gasvoda added 2 commits March 1, 2018 14:45

providing wrapper for show_topics

30af4c2

Merge branch 'dev' of github.com:QuantGov/quantgov into topic_modeling

31a35ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic modeling#37

Topic modeling#37
mgasvoda wants to merge 17 commits intodevfrom
topic_modeling

mgasvoda commented Jan 2, 2018

Uh oh!

OliverSherouse Jan 17, 2018

Uh oh!

OliverSherouse Jan 17, 2018

Uh oh!

mgasvoda Jan 18, 2018

Uh oh!

OliverSherouse Jan 17, 2018

Uh oh!

OliverSherouse Jan 17, 2018

Uh oh!

OliverSherouse Jan 17, 2018

Uh oh!

OliverSherouse Jan 17, 2018

Uh oh!

mgasvoda Jan 18, 2018

Uh oh!

OliverSherouse Jan 17, 2018

Uh oh!

mgasvoda commented Jan 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mgasvoda commented Jan 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mgasvoda commented Jan 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants