Improved the singularize method in inflect.py#220
Open
TanyaaCJain wants to merge 1 commit intoclips:masterfrom
Open
Improved the singularize method in inflect.py#220TanyaaCJain wants to merge 1 commit intoclips:masterfrom
TanyaaCJain wants to merge 1 commit intoclips:masterfrom
Conversation
Though 95% accuracy was previously achieved on measuring via CELEX
English morphology word forms, the following changes have incremented
the accuracy to 99%
1. Added more words to the set singular_uninflected
2. In the singularize method, changed the if condition for the set
singular_uninflected from
if x.endswith(w): return word
to
if x == w or w == x + "s": return x
because the former statement considered the words in the set to be
word endings. Hence, it affected words with prefix to the words in
the set.
The new condition checks if the word passed in the argument is
present in the given list as it is or with a succeeding "s" and then
returns the word's singular form from the list and not the word,
which may be passed in a plural form.
3. Added more words to the list singular_uncountable categorized via
commenting such as abstract ideas and expressions, natural phenomena,
general, etc for ease in reading and understanding
4. Added more words to the list singular_ie and dicts singular_irregular
5. Certain words which could be grouped via regex instead of adding in the
above mentioned lists and dictionaries were written in the form of
regular expressions (regex) in the singular_rules.
6. In singularize method, changed the if condition for the dictionary
singular_irregular from
if w.endswith(x):
to
if x == w:
because the former considered the word or key x in the dict to be an
ending to the word passed as an argument to the singularize method.
The latter condition checks whether the word w passed as argument is
present in the dict by equating it to x. If True, it returns the
singularized form of word w, that is, singular_irregular[x]
7. Added more regex expressions to the list singular_rules to suit the
singularization rules and improve accuracy for the singularize method
8. Henceworth, this commit solves the following issues opened currently
Issue - singularized on - earlier effect - current effect
141 , 175 - flour - flmy - flour
141 - colour - colmy - colour
141 - your - ymy - your
141 - olives - olife - olive
176 - hummus - hummu - hummus
[141](clips#141)
[175](clips#175)
[176](clips#176)
9. The words added to sets singular_uninflected and singular_uncountable
were also added to the lists in dict plural_categories["uninflected"]
and plural_categories["uncountable"] for consistency.
It is to keep in mind that the 99% accuracy is reported after being
tested from the corpora/test_en.py and is subject to the dataset of CELEX
English morphology word forms only.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Though 95% accuracy was previously achieved by measuring via CELEX English morphology word forms, the following changes have incremented the accuracy to 99%
Added more words to the set singular_uninflected
In the singularize method, changed the if-condition for the set singular_uninflected from
if x.endswith(w): return wordtoif x == w or w == x + "s": return xbecause the former statement considered the words in the set to be word endings. Hence, it affected words with a prefix to the words in the set.
The new condition checks if the word passed in the argument is present in the given list as it is or with a succeeding "s" and then returns the word's singular form from the list and not the word, which may be passed in a plural form.
Added more words to the list singular_uncountable categorized via commenting such as abstract ideas and expressions, natural phenomena, general, etc for ease of reading and understanding
Added more words to the list singular_ie and dictionaries singular_irregular
Certain words which could be grouped via regex instead of adding in the above-mentioned lists and dictionaries were written in the form of regular expressions (regex) in the singular_rules.
In singularize method, changed the if-condition for the dictionary singular_irregular from
if w.endswith(x):toif x == w:because the former considered the word or key x in the dictionary to be an ending to the word passed as an argument to the singularize method. The latter condition checks whether the word w passed as argument is present in the dictionary by equating it to x. If True, it returns the singularized form of word w, that is, singular_irregular[x]
Added more regex expressions to the list singular_rules to suit the singularization rules and improve the accuracy of the singularize method.
Henceforth, this commit solves the following issues opened currently
Issue - singularized on - earlier effect - current effect
Singular form of words ending in 'our' and in 'lives' are incorrect #141 , issue singularizing "flour" #175 - flour - flmy - flour
Singular form of words ending in 'our' and in 'lives' are incorrect #141 - colour - colmy - colour
Singular form of words ending in 'our' and in 'lives' are incorrect #141 - your - ymy - your
Singular form of words ending in 'our' and in 'lives' are incorrect #141 - olives - olife - olive
issue singularizing "hummus" #176 - hummus - hummu - hummus
The words added to sets singular_uninflected and singular_uncountable were also added to the lists in dictionary plural_categories["uninflected"] and plural_categories["uncountable"] for consistency.
It is to keep in mind that the 99% accuracy is reported after being tested from the corpora/test_en.py and is subject to the dataset of CELEX English morphology word forms only.