I’ve been digging into ranges recently, and I’m finding them to be more than just a pair of iterators. In a series of posts, I’ll be expanding the notion of what a range is to cover some kinds of ranges not easily or efficiently expressible within the STL today: delimited ranges and infinite ranges. This post deals with the problems of representing delimited ranges with STL iterators.
Delimited Ranges
When groping about for concepts, it’s essential to have some concrete examples in mind. So when I say “delimited range,” think: null-terminated C-style string. The end of the sequence is not some known position; rather, it’s an unknown position at which we expect to find some delimiter, or more generally, at which some predicate becomes true. Another example, interestingly, is an istream range. The delimiter in that case is when the istream extractor fails. And yet, the standard has std::istream_iterator
, so clearly it’s not impossible to shoehorn delimited ranges into the STL. I’ll show how, and explain why I use the term “shoehorn.”
Delimited Ranges in the STL
To prove my “shoehorn” allegation, here is a delimited range over a C-style string with fully STL-compliant iterators:
#include <cassert> #include <iostream> #include <boost/iterator/iterator_facade.hpp> struct c_string_range { private: char const *str_; public: using const_iterator = struct iterator : boost::iterator_facade< iterator , char const , std::forward_iterator_tag > { private: friend class boost::iterator_core_access; friend struct c_string_range; char const * str_; iterator(char const * str) : str_(str) {} bool equal(iterator that) const { return str_ ? (that.str_ == str_ || (!that.str_ && !*str_)) : (!that.str_ || !*that.str_); } void increment() { assert(str_ && *str_); ++str_; } char const& dereference() const { assert(str_ && *str_); return *str_; } public: iterator() : str_(nullptr) {} }; c_string_range(char const * str) : str_(str) { assert(str_); } iterator begin() const { return iterator{str_}; } iterator end() const { return iterator{}; } explicit operator bool() const { return !!*str_; } }; int main() { for(char c : c_string_range("hello world!")) std::cout << c; std::cout << 'n'; }
The code traverses the sequence of characters without first computing its end. It does it by creating a dummy end iterator — a sentinel — such that any time a real iterator is compared to it, it checks to see if the real iterator points to the null terminator. All the gross logic is there in the c_string_range::iterator::equal
member function. Nobody would call this code beautiful or elegant.
In the STL today, ranges are specified with two iterators: the begin and the end. For iterators like std::istream_iterator
or c_string_range::iterator
where an iterator can be a sentinel, it adds branches to the iterator equality test since you first have to determine if one or both of the iterators are sentinels. The expression a == b
is evaluated according to the following truth table:
a == end ?
|
b == end ?
|
a == b ?
|
---|---|---|
true
|
true
|
true
|
true
|
false
|
*b == 0
|
false
|
true
|
*a == 0
|
false
|
false
|
&*a == &*b
|
The above tests must be evaluated at runtime! There’s no way to know a priori whether an iterator is a real iterator or a dummy one. And all that checking is expensive. That’s what I mean when I say that delimited ranges can be “shoehorned” into the STL. It’s not a comfortable fit.
The Compiler Agrees
And when I say it’s an uncomfortable fit, that’s not just my opinion. I generated code for the following two functions:
int c_strlen(char const *sz) { int i = 0; for(; *sz; ++sz) ++i; return i; } int range_strlen( c_string_range::iterator begin, c_string_range::iterator end) { int i = 0; for(; begin != end; ++begin) ++i; return i; }
The two functions do exactly the same thing, so in theory they should generate the same code. Our Spidey-sense should be tingling though after seeing the complicated conditional logic in c_string_range::iterator::equal
. Indeed, the code they generate is far from comparable:
c_strlen
|
range_strlen
|
---|---|
pushl %ebp movl %esp, %ebp movl 8(%ebp), %ecx xorl %eax, %eax cmpb $0, (%ecx) je LBB1_3 xorl %eax, %eax .align 16, 0x90 LBB1_2: cmpb $0, 1(%ecx,%eax) leal 1(%eax), %eax jne LBB1_2 LBB1_3: popl %ebp ret |
pushl %ebp movl %esp, %ebp pushl %esi leal 8(%ebp), %ecx movl 12(%ebp), %esi xorl %eax, %eax testl %esi, %esi movl 8(%ebp), %edx jne LBB2_4 jmp LBB2_1 .align 16, 0x90 LBB2_8: incl %eax incl %edx movl %edx, (%ecx) LBB2_4: testl %edx, %edx jne LBB2_5 cmpb $0, (%esi) jne LBB2_8 jmp LBB2_6 .align 16, 0x90 LBB2_5: cmpl %edx, %esi jne LBB2_8 jmp LBB2_6 .align 16, 0x90 LBB2_3: leal 1(%edx,%eax), %esi incl %eax movl %esi, (%ecx) LBB2_1: movl %edx, %esi addl %eax, %esi je LBB2_6 cmpb $0, (%esi) jne LBB2_3 LBB2_6: popl %esi popl %ebp ret |
Oh my! Look at all those tests and branches! The above code was generated with clang 3.4 with -O3 -DNDEBUG
. I should add that in practice, the compiler can often generate better code for range_strlen
. If the compiler can infer statically that end
is in fact a sentinel, and if the definition of range_strlen
is available for inlining, then the compiler will generate better code. Near-optimal, in fact. But those are some big “If”s.
Besides, people generally don’t contort themselves by writing the c_string_range
class when dealing with delimited strings. They call strlen
and then some algorithm, traversing the range twice instead of once. But consider the case of the istream range. You can’t do the same trick with an input range because merely finding the end iterator consumes the range! Now we see why std::istream_iterator
has a dummy sentinel. There’s simply no other way.
And as a final note, observe that c_string_range::iterator
is a forward iterator, despite the fact that the raw char const*
it wraps is random-access. That’s because the sentinel cannot be decremented. The range’s iterator can only be as powerful as its sentinel, which is pretty darn weak.
So What?
So we can’t efficiently use STL algorithms on C-style strings. Big deal, right? Actually, it is. It means that pretty much all generic string algorithms cannot be used on C-style strings. Look at all the juicy string algorithms in Boost.String_algo. The docs say this about the string types it supports:
“Definition: A string is a range of characters accessible in sequential ordered fashion. […] First requirement of string-type is that it must accessible using Boost.Range. This facility allows to access the elements inside the string in a uniform iterator-based fashion.”
No love for C-style strings from Boost.String_algo. And by the way, what do you think happens when you call std::regex_search
with a C-style string? It first calls strlen
! So even if your string is megabytes long and the match is at the very front, you first have to traverse the entire string just so that you know where the end is. Which is all totally pointless.
“You shouldn’t be using C-style strings anyway,” you say. But the problem is bigger than C-style string. All delimited ranges have this problem. Just within the standard library, there are istream_iterator
, istreambuf_iterator
, regex_iterator
, and regex_token_iterator
, all of which have dummy sentinels, all of which have been shoehorned in like I’ve shown above. I’m sure you can think of others.
Dietmar Kuehl alerted me to another interesting case. Have you ever wanted to call a generic algorithm but couldn’t because you wanted to break out of the loop early under some condition? Imagine that you could build a delimited range with that predicate and the end iterator. Now you can pass that range to an algorithm and it would stop either when the predicate becomes true or when you reach the end of the sequence. Voila! Standard algorithms just got a lot more useful. But this iterator type would have to be shoehorned in like the others, and you wouldn’t be able to call any algorithm that required more than forward iterators since you can’t decrement the sentinel.
Conclusion, For Now…
What’s my point? My point is this: the pair-of-iterators range abstraction that we are familiar with and that was designed to have low abstraction cost has real abstraction cost that cannot be avoided for delimited ranges. It also forces delimited ranges to model weaker concepts than they might otherwise, and makes their implementation awkward. What’s the solution? I do have a concrete suggestion, but we’re not there yet. First I want to talk about infinite ranges, and then we’ll see how delimited, infinite and pair-o’-iterators ranges can all be subsumed into a larger Range concept. Tune in next time…
Acknowledgements
I’d like to Dietmar Kuehl and Andrew Sutton for helping me formulate my range ideas and for reviewing this article.
Looking like an interesting article. I’ve always hated C++ style begin/end iterators. They’re unnatural and create myriad issues (can’t tell if they came from the same collection is insane).
Looking forward to seeing where this goes…
(And for those who say don’t use C char arrays… ugh – must be nice living in a world where byte-storage doesn’t matter and C compatibility is irrelevant… on second thought, actually I prefer living where those are important – and should be important – and C++ should be better at offering its strengths to these age-old issues that simply will not, and perhaps should not, go away)
bool equal(iterator that) const
{
return str_ == that.str_ ||
(!str_ && !that.str_) ||
(!str_ && !that.str_);
}
has undefined behavior if (e.g.)
str_ == nullptr && that.str_ != nullptr && *that.str_ != ''
. I think you need:bool equal(iterator that) const
{
return str_ ? (that.str_ || !*str_)
: (!that.str_ || !*that.str_);
}
Taking advantage of the “all non-end input iterators are equal” trick.
Curse you, absence of an edit button! I reversed the logic of the test that
that.str_
points to a non-zero character whenstr_
isnullptr
, should be:bool equal(iterator that) const
{
return str_ ? (that.str_ || !*str_)
: (!that.str_ || *that.str_);
}
(I also apologize for my inability to master the art of code formatting)
You’re right, I’ll fix that and also regenerate the assembly. Btw,
c_string_range
has forward iterators, so you need to compare the pointers if they’re both non-null.EDIT: OK, I think I have it fixed now. Please take another look.
Yes, I agree it’s correct now.
boost::filesystem::directory_iterator is another good example (and if I’m not mistaken it might be considered for C++14?).
And I would think just about any iterator that handles linked lists would have similar problems – what is the end iterator really telling you other than NULL?
You’re probably right about
directory_iterator
. I don’t think you’re right aboutstd::list
‘s iterator. It’s bidirectional; you have to be able to decrement the end iterator. You are probably right aboutstd::forward_list
‘s iterators, but if the iterators merely contain a node pointer that could be null, then it’s already optimally efficient, and building a sentinel won’t gain you anything. I’m willing to be corrected, though.Thanks for the std::regex_search with a C-style string side note.
I have a bunch of code like that which runs unreasonably slow. I never figured out why. Ill see if I can do something about it now.
I was also dissatisfied with the iterator pairs, so ended up trying my own range type (https://github.com/essennell/narl). I knew about boost’s range, but didn’t go into it too deeply at the time. I’ll follow this series with interest! Certainly my range stuff doesn’t attempt to solve any of the problems you’re describing here.
When can we see the next posts? I find this topic very interesting 🙂
Refresh! Post #2 is up now!
Good, I already read it. I want the 3rd one! I was so impatient that I took a look at your implementation in github 🙂 I would like to know what is all that pipeable stuff. I guess it is | notation as boost.range? And yes, I think that | notation should be lazy and traditional sort(rng) notation should be eager. I think it would fit my mental model well because sometimes I have been using Boost.Range already. I think I consider essential is left to right notation of some kind. Without that, it’s quite difficult to read when there is a lot of nest.
This looks like calling for a C# IEnumerable like interface being used as the forward iterator and constructing a range from the enumerable.
I will be mentioning a facade class that makes it easy to implement ranges this way, but it doesn’t affect the core concepts of the range library.
Great article, I found it very interesting.
I’d like to report a minor typo: “It means that pretty must all …”, I assume must should have been much?
Fixed, thanks!
afaik nonlenprefixed strs are one of Cs big blunders(Walter Bright also is not big fan of arrays not having size iirc ), so I see no real problem with iterators…. And other examples also dont sound convincing…
esp this part:
If you want to ask me how do I mean that C++ strs are len prefixed…
they are not but len is O(1) operation.
I never said I was trying to do away with iterators. As for the rest of your comment, I’m afraid I don’t understand what you’re trying to say.
i was just saying that some problems you mention are rooted in Cs decision to use zero term strings instead of len prefixed ones. and that cpp has len prefixed strings although they are not len prefixed in literal sense, but you need to substract 2 ptrs of std::string to get the len, it is still O(1)
nonsense
vc, please be more constructive.
meh, stupid internet connection(aka me editing and pressing submit :P),
link for bright is this:
http://www.drdobbs.com/architecture-and-design/cs-biggest-mistake/228701625
and scrap the
” esp this part:”
😀
also since AFAIK you are expert in regexes(GRETA iirc, boost…)
is that call to strlen on c string really that big of a problem…
quoting for regex_search:
Determines if there is a match between the regular expression e and some subsequence in the target character sequence.
strlen seems like not even a increase in constant factor regarding execution time, aka i guess that regex search has O(something greater than n) complexity
Well, it depends. Some regex matches can be very fast. If the regex begins with a string literal (e.g.
"foo[xyz]"
), then Boyer-Moore can be used to find the very few places where the match could possibly succeed. And if the string is very long and the pattern matches early in the string, it’s entirely conceivable that the cost ofstrlen
could dominate.But in general you’re right. Regex matching is much worse than O(n), so a call to
strlen
will likely go unnoticed.I would argue that the problem is not that the pair-of-iterators model is broken, but rather that the begin and end iterators have to be the same type. If that constraint is lifted, then generic algorithms can generate the same code as raw loops:
http://tinyurl.com/pvq2j32
Unfortunately, this fails to compile with C++11’s ranged-for syntax because that ‘same type’ requirement is baked into the language.
+1 🙂
Heterogeneous iterators: https://github.com/dietmarkuehl/kuhllib/wiki/STL-2.0#wiki-heterogeneous-iterators
Infinite iterators:http://ami.ektf.hu/uploads/papers/finalpdf/AMI_38_from75to86.pdf
I acknowledge Dietmar’s contributions in a later post. But thanks for the link to the Kozsik et.al. paper. I’ll take a look.
Pingback: Trip Report: C++ Standards Committee Meeting in Issaquah, February 2014 | There's Waldo!
Nicely written intro into the subject, thanks.
One minor typo in the Acknowledgements section – missing “thank”.
I think, that the equal method is not right. You missed the case both pointers are not nullptr and point to the terminating character (‘\0’). Then they shall return true. Otherwise you will loose transitivity!. So here is the corrected version:
bool equal(iterator that) const
{
return str_
? (that.str_ == str_ ||
(!that.str_ && !*str_) ||
(that.str_ && !*str_ && !*that.str_))
: (!that.str_ || !*that.str_);
}
The table needs to be adjusted for this case also.