Adding Aho-Corasick to Boost.Algorithm#24
Adding Aho-Corasick to Boost.Algorithm#24zamazan4ik wants to merge 27 commits intoboostorg:developfrom
Conversation
Also fixed #include <memory> in aho-corasick implementation.
Now Aho-Corasick uses callback instead of out container. Updated algorithm, documentation, example, tests.
Now if callback returns false for match, we cancel searching.
|
Good start. Please fix the following:
|
|
Thanks for the comment. Let's discuss.
Docs, examples and test i will update after finishing work on aho_corasick.hpp . |
Users usually expect the default constructor to b lightweight. Take a look at the |
|
About comparing with std::find. This comparing is a little bit bad, because:
It means that there isn't any reason to search one-two short entries in small corpus sequence. I will update documentation about it. But on large cases A-C is very-very fast(my benchmark is: corpus string is "War and peace" Tolstoy, patterns for matching British dictionary. A-C is very fast(less than 1 sec); std::find is very-very-very slow....). My system is i7-3630QM, 12 Gib RAM, Samsung 850 Evo 500 Gib, Kubuntu 16.10. About memory allocating. Have you any ideas to optimize memory allocation? I can preallocate pool of memory and use this memory range for creating new nodes. |
Search next pattern from current position:
Make 2 kinds of benchmarks:
First of all, make the Container hold nodes by value, not using |
|
Container can't contain node by value, because Container declarated in node. I can store in Container only pointers to node. |
Boost containers can. Try to use |
|
Thank you. It works. |
|
Actually, it depends on the standard library implementation. For example, this code works fine with libc++: |
|
Hmmm, thank you for example with libc++. I don't know about it :) . Now A-C uses boost::container::map and boost::unordered_map, because our library should work also with libstdc++. All works fine. Performance increases in 1.8x (test: British dictionary and "Peace and war"). |
With libstdc++ works well on
There's a small limited amount of possible values for T if T is
|
|
The idea with using Ok, i will test it. I read already about variable length encoded integer. It may be useful for users. But i have some more questions:
|
The idea is following: patterns are usually not very long, so length 127 is more than enough most of the time. Now, with variable length encoded integer (vint) you can do the following:
You can start by always storing |
|
I tested new versions, and results are:
|
| { | ||
| node new_node; | ||
| current_node->links[*it] = std::move(new_node); | ||
| child_node = ¤t_node->links[*it]; |
There was a problem hiding this comment.
I believe that lines 102-103 can be simplified to one line, current_node->links[*it] = node();, and that move semantics will still take place in C++11.
There was a problem hiding this comment.
yes, you are right. Will be fixed later.
|
@apolukhin, although I am a fan of regular types that have trivial default constructors, the existing searchers (BM, BMH & KMP) don't have them, so I wonder if it makes sense here? |
|
Alexander, could you update the description with a specific citation of which papers or books you based your implementation on? Thanks. |
|
ZaMaZaN4iK, @apolukhin, you made a good job. Don't you want to finish it? Check, please, that your code is multi thread |
|
@toshchev95 Hi! Thank you! Unfortunately now I have no time for finishing the PR. So if you want to continue work on it - it would be awesome! |
I wrote implementation Aho-Corasick's algorithm. C++11 required for it (I used std::unique_ptr, variadic templates, default template parameters).