I’ve been hard at work fleshing out my range library and writing a proposal to get range support into the standard. That proposal describes a foundational range concept: Iterable. An Iterable is anything we can pass to std::begin()
and std::end()
to get an Iterator/Sentinel pair. Sentinels, as I described here earlier this year, make it possible for the Iterable concept to efficiently describe other kinds of ranges besides iterator pairs.
The three types of ranges that we would like the Iterable concept to be able to efficiently model are:
- Two iterators
- An iterator and a predicate
- An iterator and a count
The Iterator/Sentinel abstraction is what makes it possible for the algorithms to handle these three cases with uniform syntax. However, as Sean Parent pointed out here, the third option presents challenges when trying to make some algorithms optimally efficient. Back in February, when Sean offered his criticism, I promised to follow up with a blog post that justified the design. This is that post.
Note 1: I’ve changed terminology since the February posts. In those posts, Iterable represented a range where the begin
and end
have different types, and Range is an Iterable where they’re the same. In my current proposal, Iterable is more or less as it was before, but Range is now an Iterable that doesn’t own its elements.
Note 2: This post uses the syntax of Concepts Lite, which has not been adopted yet. Everything in this post is implementable in C++11 using my library for Concepts Lite emulation, which I describe here.
Counted Ranges
Counted ranges, formed by specifying a position and a count of elements, have iterators — as all Iterables do. The iterators of a counted range must know the range’s extent and how close they are to reaching it. Therefore, the counted range’s iterators must store both an iterator into the underlying sequence and a count — either a count to the end or a count from the front. Here is one potential design:
class counted_sentinel {}; template<WeakIterator I> class counted_iterator { I it_; DistanceType<I> n_; // distance to end public: // ... constructors... using iterator_category = typename iterator_traits<I>::iterator_category; decltype(auto) operator*() const { return *it_; } counted_iterator & operator++() { ++it_; --n_; return *this; } friend bool operator==(counted_iterator const & it, counted_sentinel) { return it.n_ == 0; } // ... other operators... }; template<WeakIterator I> class counted_range { I begin_; DistanceType<I> count_; public: // ... constructors ... counted_iterator<I> begin() const { return {begin_, count_}; } counted_sentinel end() const { return {}; } };
There are some noteworthy things about the code above. First, counted_iterator
bundles an iterator and a count. Right off, we see that copying counted iterators is going to be more expensive, and iterators are copied frequently. A mitigating factor is that the sentinel is empty. Passing a counted_iterator
and a counted_sentinel
to an algorithm copies as much data as passing an iterator and a count. When passed separately, the compiler probably has an easier time fitting them in registers, but some modern compilers are capable passing the members of a struct in registers. This compiler optimization is sometimes called Scalar Replacement of Aggregates1, 2 and is known to be implemented in gcc and LLVM (see this recent LLVM commit for example).
Also, incrementing a counted iterator is expensive: it involves incrementing the underlying iterator and decrementing the internal count. To see why this is potentially expensive, consider the trivial case of passing a counted_iterator<list<int>::iterator>
to advance
. That counted iterator type is bidirectional, and advance
must increment it n times:
template<BidirectionalIterator I> void advance(I & i, DistanceType<I> n) { if(n >= 0) for(; n != 0; --n) ++i; else for(; n != 0; ++n) --i; }
Notice that for each ++i
or --i
here, two increments or decrements are happening when I
is a counted_iterator
. This is sub-optimal. A better implementation for counted_iterator
is:
template<BidirectionalIterator I> void advance(counted_iterator<I> & i, DistanceType<I> n) { i.n_ -= n; advance(i.it_, n); }
This has a noticeable effect on the generated code. As it turns out, advance
is one of the relatively few places in the standard library where special handling of counted_iterator
is advantageous. Let’s examine some algorithms to see why that’s the case.
Single-Pass Algorithms with Counted Iterators
First, let’s look at a simple algorithm like for_each
that makes exactly one pass through its input sequence:
template<InputIterator I, Regular S, Function<ValueType<I>> F> requires EqualityComparable<I, S> I for_each(I first, S last, F f) { for(; first != last; ++first) f(*first); return first; }
When passed counted iterators, at each iteration of the loop, we do an increment, a decrement (for the underlying iterator and the count), and a comparison. Let’s compare this against a hypothetical for_each_n
algorithm that takes the underlying iterator and the count separately. It might look like this:
template<InputIterator I, Function<ValueType<I>> F> I for_each_n(I first, DifferenceType<I> n, F f) { for(; n != 0; ++first, --n) f(*first); return first; }
For the hypothetical for_each_n
, at each loop iteration, we do an increment, a decrement, and a comparison. That’s exactly as many operations as for_each
does when passed counted iterators. So a separate for_each_n
algorithm is probably unnecessary if we have sentinels and counted_iterator
s. This is true for any algorithm that makes only one pass through the input range. That turns out to be a lot of algorithms.
Multi-Pass Algorithms with Counted Iterators
There are other algorithms that make more than one pass over the input sequence. Most of those, however, use advance
when they need to move iterators by more than one hop. Once we have specialized advance
for counted_iterator
, those algorithms that use advance
get faster without any extra work.
Consider partition_point
. Here is one example implementation, taken from libc++ and ported to Concepts Lite and sentinels:
template<ForwardIterator I, Regular S, Predicate<ValueType<I>> P> requires EqualityComparable<I, S> I partition_point(I first, S last, P pred) { DifferenceType<I> len = distance(first, last); while (len != 0) { DifferenceType<I> l2 = len / 2; I m = first; advance(m, l2); if (pred(*m)) { first = ++m; len -= l2 + 1; } else len = l2; } return first; }
Imagine that I
is a forward counted_iterator
and S
is a counted_sentinel
. If the library does not specialize advance
, this is certainly inefficient. Every time advance
is called, needless work is being done. Compare it to a hypothetical partition_point_n
:
template<ForwardIterator I, Predicate<ValueType<I>> P> I partition_point_n(I first, DifferenceType<I> len, P pred) { while (len != 0) { DifferenceType<I> l2 = len / 2; I m = first; advance(m, l2); if (pred(*m)) { first = ++m; len -= l2 + 1; } else len = l2; } return first; }
The first thing we notice is that partition_point_n
doesn’t need to call distance
! The more subtle thing to note is that calling partition_point_n
with a raw iterator and a count saves about O(N) integer decrements over the equivalent call to partition_point
with counted_iterator
s … unless, of course, we have specialized the advance
algorithm as shown above. Once we have, we trade the O(N) integer decrements for O(log N) integer subtractions. That’s a big improvement.
But what about the O(N) call to distance
? Actually, that’s easy, and it’s the reason why I introduced a concept called SizedIteratorRange. counted_iterator
stores the distance to the end. So the distance between a counted_iterator
and a counted_sentinel
(or between two counted_iterators
) is known in O(1) regardless of the iterator’s category. The SizedIteratorRange concept tests whether an iterator I
and a sentinel S
can be subtracted to get the distance. This concept is modeled by random-access iterators by their nature, but also by counted iterators and their sentinels. The distance
algorithm is specialized for SizedIteratorRange, so it is O(1) for counted iterators.
With these changes, we see that partition_point
with counted iterators is very nearly as efficient as a hypothetical partition_point_n
would be, and we had to make no special accommodations. Why can’t we make partition_point
exactly as efficient as partition_point_n
? When partition_point
is called with a counted iterator, it also returns a counted iterator. Counted iterators contain two datums: the position and distance to the end. But when partition_point_n
returns just the position, it is actually computing and returning less information. Sometimes users don’t need the extra information. But sometimes, after calling partition_point_n
, the user might want to pass the resulting iterator to another algorithm. If that algorithm calls distance
(like partition_point
and other algorithms do), then it will be O(N). With counted iterators, however, it’s O(1). So in the case of partition_point
, counted iterators cause the algorithm to do O(log N) extra work, but it sometimes saves O(N) work later.
To see an example, imagine a trivial insertion_sort
algorithm:
template<ForwardIterator I, Regular S> requires EqualityComparable<I, S> && Sortable<I> // from N3351 void insertion_sort(I begin, S end) { for(auto it = begin; it != end; ++it) { auto insertion = upper_bound(begin, it, *it); rotate(insertion, it, next(it)); } }
Imagine that I
is a counted_iterator
. The first thing upper_bound
does is call distance
. Making distance
O(1) for counted_iterator
s saves N calls of an O(N) algorithm. To get comparable performance for an equivalent procedure in today’s STL, users would have to write a separate insertion_sort_n
algorithm that dispatches to an upper_bound_n
algorithm — that they would also need to write themselves.
Counted Algorithms with Counted Iterators
We’ve seen that regular algorithms with counted iterators can be made nearly as efficient as dedicated counted algorithms, and that sometimes we are more than compensated for the small performance loss. All is not roses, however. There are a number of counted algorithms in the standard (the algorithms whose names end with _n
). Consider copy_n
:
template<WeakInputIterator I, WeakOutputIterator<ValueType<I>> O> pair<I, O> copy_n(I in, DifferenceType<I> n, O out) { for(; n != 0; ++in, ++out, --n) *out = *in; return {in, out}; }
(We’ve changed the return type of copy_n
so as not to lose information.) If I
is a counted iterator, then for every ++in
, an increment and a decrement are happening, and in this case the extra decrement is totally unnecessary. For any counted (i.e., _n
) algorithm, something special needs to be done to keep the performance from degrading when passed counted iterators.
The algorithm author has two options here, and neither of them is ideal.
Option 1: Overload the algorithm
The following is an optimized version of copy_n
for counted iterators:
template<WeakInputIterator I, WeakOutputIterator<ValueType<I>> O> pair<I, O> copy_n(counted_iterator<I> in, DifferenceType<I> n, O out) { for(auto m = in.n_ - n; in.n_ != m; ++in.i_, --in.n_, ++out) *out = *in; return {in, out}; }
Obviously, creating an overload for counted iterators is unsatisfying.
Option 2: Separate the iterator from the count
This option shows how a library implementer can write just one version of copy_n
that is automatically optimized for counted iterators. First, we need to provide two utility functions for unpacking and repacking counted iterators:
template<WeakIterator I> I uncounted(I i) { return i; } template<WeakIterator I> I uncounted(counted_iterator<I> i) { return i.it_; } template<WeakIterator I> I recounted(I const &, I i, DifferenceType<I>) { return i; } template<WeakIterator I> counted_iterator<I> recounted(counted_iterator<I> const &j, I i, DifferenceType<I> n) { return {i, j.n_ - n}; }
With the help of uncounted
and recounted
, we can write an optimized copy_n
just once:
template<WeakInputIterator I, WeakOutputIterator<ValueType<I>> O> pair<I, O> copy_n(I in_, DifferenceType<I> n_, O out) { auto in = uncounted(in_); for(auto n = n_; n != 0; ++in, --n, ++out) *out = *in; return {recounted(in_, in, n_), out}; }
This version works optimally for both counted and non-counted iterators. It is not a thing of beauty, however. It’s slightly annoying to have to do the uncounted
/recounted
dance, but it’s mostly needed only in the counted algorithms.
As a final note, the overload of advance
for counted iterators can be eliminated with the help of uncounted
and recounted
. After all, advance
is a counted algorithm.
Benchmark: Insertion Sort
To test how expensive counted ranges and counted iterators are, we wrote a benchmark. The benchmark pits counted ranges against a dedicated _n
algorithm for Insertion Sort. The program is listed in this gist.
The program implements both insertion_sort_n
, a dedicated counted algorithm, and insertion_sort
, a general algorithm that accepts any Iterable, to which we pass a counted range. The latter is implemented in terms of the general-purpose upper_bound
as provided by the Range v3 library, whereas the former requires a dedicated upper_bound_n
algorithm, which is also provided.
The test is run both with raw pointers (hence, random-access) and with an iterator wrapper that only models ForwardIterator. Each test is run three times, and the resulting times are averaged. The test was compiled with g++
version 4.9.0 with -O3 -std=gnu++11 -DNDEBUG
and run on a Linux machine. The results are reported below, for N == 30,000:
insertion_sort_n
|
insertion_sort
|
|
---|---|---|
random-access | 2.692 s | 2.703 s |
forward | 23.853 s | 23.817 s |
The performance difference, if there is any, is lost in the noise. At least in this case, with this compiler, on this hardware, there is no performance justification for a dedicated _n
algorithm.
Summary
In short, counted iterators are not a perfect abstraction. There is some precedent here. The iterators for deque
, and for any segmented data structure, are known to be inefficient (see Segmented Iterators and Hierarchical Algorithms, Austern 1998). The fix for that problem, new iterator abstractions and separate hierarchical algorithm implementations, is invasive and is not attempted in any STL implementation I am aware of. In comparison, the extra complications that come with counted iterators seem quite small. For segmented iterators, the upside was the simplicity and uniformity of the Iterator abstraction. In the case of counted ranges and iterators, the upside is the simplicity and uniformity of the Iterable concept. Algorithms need only one form, not separate bounded, counted, and sentinel forms. The benchmark gives me some reasonable assurance that we aren’t sacrificing too much performance for the sake of a unifying abstraction.
Great post! For the
insertion_sort
benchmark, could you show a third column where you runinsertion_sort
implemented with the Standard library’sstd::upper_bound
andstd::rotate
, and pass the same random access / forward iterator-pairs that you passed to your dedicatedinsertion_sort_n
. Just to gauge how much of an impact either of the two counted specializations actually make over the current state of the art.Thanks, Rein! I’ve been meaning to do that, but I’ve been super-busy with other parts of my proposal (which is due at the end of the week).
Hi Eric. Thanks for this post. Your work on ranges is about the most interesting thing that is currently going on in C++.
Thanks, Andrzej. I hope something comes of it.
I agree!
Thanks for this series of articles, and for taking the time to make a proposal for standardization. Never been a fan of how STL approached enumeration, and this proposal looks like it would offer a much broader approach to implementing enumerations, and ranges. Looking forward to each installment, and to see how this evolves.
I’ll be mining my proposal for bite-size chunks and turning them into blog posts like this one. If you want to skip to the chase, you can just read the whole proposal now and have a pretty good idea of where I’m heading. 🙂
It might also worth noting in your proposal that the Standard containers’ ranged
insert()
members benefit from counted iterators. E.g. insertingg a subrange with known size from astd::list
to astd::vector
can now do without the internal call tostd::distance
.That’s true, it will be more efficient that way.
FYI, the code would still call
std::distance
, butstd::distance
will be specialized for SizedIterables to be O(1).std::list
is a SizedIterable. (In case it’s not clear, I’ll be suggesting thatstd::distance
be overloaded to take Iterator/Sentinel pairs (2 arguments) and Iterables (1 argument).)Yes sorry, I meant
O(N) std::distance
reduces toO(1) std::distance
. The main benefit is of course that existing container code probably does not need to be rewritten in order to benefit from your range proposal, and noinsert_n()
member functions need to be added. Although it would be nice if the Standard mandated somehow that container ranged-inserts havestd::distance
or the iterator tag as a customization point (maybe it already does, I couldn’t find it).I’m not sure I follow. Try again?
Hi Eric,
Excelent work! Excelent articles! Thanks.
I have a question about this paragraph:
“When passed counted iterators, at each iteration of the loop, we do an increment, a decrement (for the underlying iterator and the count), and a comparison.”
Why you say that the underlying iterator is decremented?
I don’t see it.
Thanks again!
Fernando.
Did you mean “dereference”?
I mean, every time you increment the counted iterator, the underlying iterator is increment and the count is decremented. Sorry if I was unclear.
Perfect! Thanks!
Hi Eric, love your work. I was playing around with istream_range from your library, and came across this (imo, undesirable) behavior:
for (auto i : istream_range<string>(cin) | view::take(3))
cout << i << '\n';
cout << '\n';
for (auto i : istream_range<string>(cin) | view::take(3))
cout << i << '\n';
The previous code prints the first three strings read from standard input, skips one, then prints the next 3. I think I understand why it does this, but I wonder if such behavior (the skipping) is unavoidable, in your opinion. Or maybe you don’t agree that the behavior is undesirable.
Thanks for the report, Benjamin. Can you file a bug at http://github.com/ericniebler/range-v3 ? I’ll look into it.
Okay, done. Would have done that originally, but wasn’t sure if it was actually a bug, rather than an unavoidable consequence of the single pass nature of an istream.
Your range proposal has been very inspirational, and I have thought of ways to use these concepts for a variety of algorithms. However, I frequently have the situation where the same algorithm has to run either on an array of fixed length (size known at compile time) or of variable length. With that in mind, I wonder if you have any thoughts on what iterator/sentinel to use to represent ranges for containers of fixed size such as
array<T, N>
orT[N]
; If these could be treated as counted ranges, it may be possible to generate even better code in the typical loops over all elements. Basically, the hope is that the offset value 0 in the begin iterator is easier for constant value propagation to track and reason about, and the sentinel can encode the fixed size in its type:Any thoughts on whether this can work better than a pair of
(begin, end)
pointers, or are compilers already at a point where they could figure the size of the range at compile time?It’s an interesting idea. Using offsets seems to make it easier for compilers to do aliasing analysis, which can improve code gen (automatic vectorization, for example). Building the size into the type of the sentinel might improve code gen further. If you try this, take a look at the generated assembly and report back what you find. It would be an interesting experiment.
(I fixed up the code in your comment. Let me know if I’ve flubbed it.)
Hmm, so I tried this and at least with VS 2013, the code gen is worse because it doesn’t do that scalar replacement of aggregates that you mentioned LLVM will soon do – packing the base pointer and offset into a struct generates decidedly worse code; with raw pointers, VS was able to still prove that the number of iterations of a for_each – like loop was constant.
The code with raw pointers uses up an extra register to count down to the number of loop iterations in for_each-like code, but it has a comparison to zero as its end condition; the sized_sentinel approach used a fancier addressing mode to eliminate a separate loop counter, but then has an cmp against an immediate as a loop terminating condition (raw pointers just need a dec on rcx). If the array is of types with sizes that are not multiples of 2^n, the sized sentinel code may end up even worse.
All in all, with VS right now, the sized_sentinel adds nothing; the one benefit I found (and use) is that when you know the (maximum) size of a range at compile time, you can write algorithms that need to allocate temporary storage and have them use array<> for temporary storage of transformed ranges instead of always using something like vector<>; I basically have this function:
template < typename Range, typename T >
auto allocate_enough(Range r)
{
// if size(r) is known to be N at compile time, return array>T, N>
// otherwise return vector(size(r))
}
I’m disappointed to hear that VC doesn’t do scalar replacement of aggregates. That hurts all around, but it particularly hurts my method of supporting counted ranges. Does your approach work with a different compiler?
Nice trick with
allocate_enough
. I might steal that.The source code for the insertion sort program in the linked gist still has the
unique_ptr<int>{new int[n]}
problem that is now fixed in the proposal.Fixed, thanks!
Thank you, very interesting read.
What got me thinking is the equality for your counted_iterator which is defined for the sentinel case only. To clarify, I guess you mean to also provide a normal equality operator for practical applications? E.g.:
bool operator== (counted_iterator const & it)
{
return this->n_ == it.n_ && this->it_ == it.it_;
}
Since this->it_ may not be dereferenced if this->n_==0 and thus not matter, the operator could also treat all end iterators the same:
bool operator== (counted_iterator const & it)
{
return this->n_ == it.n_ && (this->n_ == 0 || this->it_ == it.it_);
}
Which would allow an alternative range end implementation with the same type as range begin:
template<WeakIterator I>
class counted_range
{ // …
counted_iterator<I> begin() const
{
return {begin_, count_};
}
counted_iterator<I> end() const
{
return {begin_, 0};
}
};