Ranges in C++: Counted Iterables and Efficiency

I’ve been hard at work fleshing out my range library and writing a proposal to get range support into the standard. That proposal describes a foundational range concept: Iterable. An Iterable is anything we can pass to std::begin() and std::end() to get an Iterator/Sentinel pair. Sentinels, as I described here earlier this year, make it possible for the Iterable concept to efficiently describe other kinds of ranges besides iterator pairs.

The three types of ranges that we would like the Iterable concept to be able to efficiently model are:

  1. Two iterators
  2. An iterator and a predicate
  3. An iterator and a count

The Iterator/Sentinel abstraction is what makes it possible for the algorithms to handle these three cases with uniform syntax. However, as Sean Parent pointed out here, the third option presents challenges when trying to make some algorithms optimally efficient. Back in February, when Sean offered his criticism, I promised to follow up with a blog post that justified the design. This is that post.

Note 1: I’ve changed terminology since the February posts. In those posts, Iterable represented a range where the begin and end have different types, and Range is an Iterable where they’re the same. In my current proposal, Iterable is more or less as it was before, but Range is now an Iterable that doesn’t own its elements.

Note 2: This post uses the syntax of Concepts Lite, which has not been adopted yet. Everything in this post is implementable in C++11 using my library for Concepts Lite emulation, which I describe here.

Counted Ranges

Counted ranges, formed by specifying a position and a count of elements, have iterators — as all Iterables do. The iterators of a counted range must know the range’s extent and how close they are to reaching it. Therefore, the counted range’s iterators must store both an iterator into the underlying sequence and a count — either a count to the end or a count from the front. Here is one potential design:

class counted_sentinel
{};

template<WeakIterator I>
class counted_iterator
{
    I it_;
    DistanceType<I> n_; // distance to end
public:
    // ... constructors...
    using iterator_category =
        typename iterator_traits<I>::iterator_category;
    decltype(auto) operator*() const
    {
        return *it_;
    }
    counted_iterator & operator++()
    {
        ++it_;
        --n_;
        return *this;
    }
    friend bool operator==(counted_iterator const & it,
                           counted_sentinel)
    {
        return it.n_ == 0;
    }
    // ... other operators...
};

template<WeakIterator I>
class counted_range
{
    I begin_;
    DistanceType<I> count_;
public:
    // ... constructors ...
    counted_iterator<I> begin() const
    {
        return {begin_, count_};
    }
    counted_sentinel end() const
    {
        return {};
    }
};

There are some noteworthy things about the code above. First, counted_iterator bundles an iterator and a count. Right off, we see that copying counted iterators is going to be more expensive, and iterators are copied frequently. A mitigating factor is that the sentinel is empty. Passing a counted_iterator and a counted_sentinel to an algorithm copies as much data as passing an iterator and a count. When passed separately, the compiler probably has an easier time fitting them in registers, but some modern compilers are capable passing the members of a struct in registers. This compiler optimization is sometimes called Scalar Replacement of Aggregates1, 2 and is known to be implemented in gcc and LLVM (see this recent LLVM commit for example).

Also, incrementing a counted iterator is expensive: it involves incrementing the underlying iterator and decrementing the internal count. To see why this is potentially expensive, consider the trivial case of passing a counted_iterator<list<int>::iterator> to advance. That counted iterator type is bidirectional, and advance must increment it n times:

template<BidirectionalIterator I>
void advance(I & i, DistanceType<I> n)
{
    if(n >= 0)
        for(; n != 0; --n)
            ++i;
    else
        for(; n != 0; ++n)
            --i;
}

Notice that for each ++i or --i here, two increments or decrements are happening when I is a counted_iterator. This is sub-optimal. A better implementation for counted_iterator is:

template<BidirectionalIterator I>
void advance(counted_iterator<I> & i, DistanceType<I> n)
{
    i.n_ -= n;
    advance(i.it_, n);
}

This has a noticeable effect on the generated code. As it turns out, advance is one of the relatively few places in the standard library where special handling of counted_iterator is advantageous. Let’s examine some algorithms to see why that’s the case.

Single-Pass Algorithms with Counted Iterators

First, let’s look at a simple algorithm like for_each that makes exactly one pass through its input sequence:

template<InputIterator I, Regular S,
         Function<ValueType<I>> F>
    requires EqualityComparable<I, S>
I for_each(I first, S last, F f)
{
    for(; first != last; ++first)
        f(*first);
    return first;
}

When passed counted iterators, at each iteration of the loop, we do an increment, a decrement (for the underlying iterator and the count), and a comparison. Let’s compare this against a hypothetical for_each_n algorithm that takes the underlying iterator and the count separately. It might look like this:

template<InputIterator I, Function<ValueType<I>> F>
I for_each_n(I first, DifferenceType<I> n, F f)
{
    for(; n != 0; ++first, --n)
        f(*first);
    return first;
}

For the hypothetical for_each_n, at each loop iteration, we do an increment, a decrement, and a comparison. That’s exactly as many operations as for_each does when passed counted iterators. So a separate for_each_n algorithm is probably unnecessary if we have sentinels and counted_iterators. This is true for any algorithm that makes only one pass through the input range. That turns out to be a lot of algorithms.

Multi-Pass Algorithms with Counted Iterators

There are other algorithms that make more than one pass over the input sequence. Most of those, however, use advance when they need to move iterators by more than one hop. Once we have specialized advance for counted_iterator, those algorithms that use advance get faster without any extra work.

Consider partition_point. Here is one example implementation, taken from libc++ and ported to Concepts Lite and sentinels:

template<ForwardIterator I, Regular S,
         Predicate<ValueType<I>> P>
    requires EqualityComparable<I, S>
I partition_point(I first, S last, P pred)
{
    DifferenceType<I> len = distance(first, last);
    while (len != 0)
    {
        DifferenceType<I> l2 = len / 2;
        I m = first;
        advance(m, l2);
        if (pred(*m))
        {
            first = ++m;
            len -= l2 + 1;
        }
        else
            len = l2;
    }
    return first;
}

Imagine that I is a forward counted_iterator and S is a counted_sentinel. If the library does not specialize advance, this is certainly inefficient. Every time advance is called, needless work is being done. Compare it to a hypothetical partition_point_n:

template<ForwardIterator I, Predicate<ValueType<I>> P>
I partition_point_n(I first, DifferenceType<I> len, P pred)
{
    while (len != 0)
    {
        DifferenceType<I> l2 = len / 2;
        I m = first;
        advance(m, l2);
        if (pred(*m))
        {
            first = ++m;
            len -= l2 + 1;
        }
        else
            len = l2;
    }
    return first;
}

The first thing we notice is that partition_point_n doesn’t need to call distance! The more subtle thing to note is that calling partition_point_n with a raw iterator and a count saves about O(N) integer decrements over the equivalent call to partition_point with counted_iterators … unless, of course, we have specialized the advance algorithm as shown above. Once we have, we trade the O(N) integer decrements for O(log N) integer subtractions. That’s a big improvement.

But what about the O(N) call to distance? Actually, that’s easy, and it’s the reason why I introduced a concept called SizedIteratorRange. counted_iterator stores the distance to the end. So the distance between a counted_iterator and a counted_sentinel (or between two counted_iterators) is known in O(1) regardless of the iterator’s category. The SizedIteratorRange concept tests whether an iterator I and a sentinel S can be subtracted to get the distance. This concept is modeled by random-access iterators by their nature, but also by counted iterators and their sentinels. The distance algorithm is specialized for SizedIteratorRange, so it is O(1) for counted iterators.

With these changes, we see that partition_point with counted iterators is very nearly as efficient as a hypothetical partition_point_n would be, and we had to make no special accommodations. Why can’t we make partition_point exactly as efficient as partition_point_n? When partition_point is called with a counted iterator, it also returns a counted iterator. Counted iterators contain two datums: the position and distance to the end. But when partition_point_n returns just the position, it is actually computing and returning less information. Sometimes users don’t need the extra information. But sometimes, after calling partition_point_n, the user might want to pass the resulting iterator to another algorithm. If that algorithm calls distance (like partition_point and other algorithms do), then it will be O(N). With counted iterators, however, it’s O(1). So in the case of partition_point, counted iterators cause the algorithm to do O(log N) extra work, but it sometimes saves O(N) work later.

To see an example, imagine a trivial insertion_sort algorithm:

template<ForwardIterator I, Regular S>
    requires EqualityComparable<I, S> &&
             Sortable<I> // from N3351
void insertion_sort(I begin, S end)
{
    for(auto it = begin; it != end; ++it)
    {
        auto insertion = upper_bound(begin, it, *it);
        rotate(insertion, it, next(it));
    }
}

Imagine that I is a counted_iterator. The first thing upper_bound does is call distance. Making distance O(1) for counted_iterators saves N calls of an O(N) algorithm. To get comparable performance for an equivalent procedure in today’s STL, users would have to write a separate insertion_sort_n algorithm that dispatches to an upper_bound_n algorithm — that they would also need to write themselves.

Counted Algorithms with Counted Iterators

We’ve seen that regular algorithms with counted iterators can be made nearly as efficient as dedicated counted algorithms, and that sometimes we are more than compensated for the small performance loss. All is not roses, however. There are a number of counted algorithms in the standard (the algorithms whose names end with _n). Consider copy_n:

template<WeakInputIterator I,
         WeakOutputIterator<ValueType<I>> O>
pair<I, O> copy_n(I in, DifferenceType<I> n, O out)
{
    for(; n != 0; ++in, ++out, --n)
        *out = *in;
    return {in, out};
}

(We’ve changed the return type of copy_n so as not to lose information.) If I is a counted iterator, then for every ++in, an increment and a decrement are happening, and in this case the extra decrement is totally unnecessary. For any counted (i.e., _n) algorithm, something special needs to be done to keep the performance from degrading when passed counted iterators.

The algorithm author has two options here, and neither of them is ideal.

Option 1: Overload the algorithm

The following is an optimized version of copy_n for counted iterators:

template<WeakInputIterator I,
         WeakOutputIterator<ValueType<I>> O>
pair<I, O> copy_n(counted_iterator<I> in,
                  DifferenceType<I> n, O out)
{
    for(auto m = in.n_ - n; in.n_ != m;
            ++in.i_, --in.n_, ++out)
        *out = *in;
    return {in, out};
}

Obviously, creating an overload for counted iterators is unsatisfying.

Option 2: Separate the iterator from the count

This option shows how a library implementer can write just one version of copy_n that is automatically optimized for counted iterators. First, we need to provide two utility functions for unpacking and repacking counted iterators:

template<WeakIterator I>
I uncounted(I i)
{
    return i;
}

template<WeakIterator I>
I uncounted(counted_iterator<I> i)
{
    return i.it_;
}

template<WeakIterator I>
I recounted(I const &, I i, DifferenceType<I>)
{
    return i;
}

template<WeakIterator I>
counted_iterator<I> recounted(counted_iterator<I> const &j, I i, DifferenceType<I> n)
{
    return {i, j.n_ - n};
}

With the help of uncounted and recounted, we can write an optimized copy_n just once:

template<WeakInputIterator I,
         WeakOutputIterator<ValueType<I>> O>
pair<I, O> copy_n(I in_, DifferenceType<I> n_, O out)
{
    auto in = uncounted(in_);
    for(auto n = n_; n != 0; ++in, --n, ++out)
        *out = *in;
    return {recounted(in_, in, n_), out};
}

This version works optimally for both counted and non-counted iterators. It is not a thing of beauty, however. It’s slightly annoying to have to do the uncounted/recounted dance, but it’s mostly needed only in the counted algorithms.

As a final note, the overload of advance for counted iterators can be eliminated with the help of uncounted and recounted. After all, advance is a counted algorithm.

Benchmark: Insertion Sort

To test how expensive counted ranges and counted iterators are, we wrote a benchmark. The benchmark pits counted ranges against a dedicated _n algorithm for Insertion Sort. The program is listed in this gist.

The program implements both insertion_sort_n, a dedicated counted algorithm, and insertion_sort, a general algorithm that accepts any Iterable, to which we pass a counted range. The latter is implemented in terms of the general-purpose upper_bound as provided by the Range v3 library, whereas the former requires a dedicated upper_bound_n algorithm, which is also provided.

The test is run both with raw pointers (hence, random-access) and with an iterator wrapper that only models ForwardIterator. Each test is run three times, and the resulting times are averaged. The test was compiled with g++ version 4.9.0 with -O3 -std=gnu++11 -DNDEBUG and run on a Linux machine. The results are reported below, for N == 30,000:

insertion_sort_n insertion_sort
random-access 2.692 s 2.703 s
forward 23.853 s 23.817 s

The performance difference, if there is any, is lost in the noise. At least in this case, with this compiler, on this hardware, there is no performance justification for a dedicated _n algorithm.

Summary

In short, counted iterators are not a perfect abstraction. There is some precedent here. The iterators for deque, and for any segmented data structure, are known to be inefficient (see Segmented Iterators and Hierarchical Algorithms, Austern 1998). The fix for that problem, new iterator abstractions and separate hierarchical algorithm implementations, is invasive and is not attempted in any STL implementation I am aware of. In comparison, the extra complications that come with counted iterators seem quite small. For segmented iterators, the upside was the simplicity and uniformity of the Iterator abstraction. In the case of counted ranges and iterators, the upside is the simplicity and uniformity of the Iterable concept. Algorithms need only one form, not separate bounded, counted, and sentinel forms. The benchmark gives me some reasonable assurance that we aren’t sacrificing too much performance for the sake of a unifying abstraction.

"\e"
"\e"

25 thoughts on “Ranges in C++: Counted Iterables and Efficiency

  1. Great post! For the insertion_sort benchmark, could you show a third column where you run insertion_sort implemented with the Standard library’s std::upper_bound and std::rotate, and pass the same random access / forward iterator-pairs that you passed to your dedicated insertion_sort_n. Just to gauge how much of an impact either of the two counted specializations actually make over the current state of the art.

  2. Thanks for this series of articles, and for taking the time to make a proposal for standardization. Never been a fan of how STL approached enumeration, and this proposal looks like it would offer a much broader approach to implementing enumerations, and ranges. Looking forward to each installment, and to see how this evolves.

    • Looking forward to each installment, and to see how this evolves.

      I’ll be mining my proposal for bite-size chunks and turning them into blog posts like this one. If you want to skip to the chase, you can just read the whole proposal now and have a pretty good idea of where I’m heading. 🙂

  3. It might also worth noting in your proposal that the Standard containers’ ranged insert() members benefit from counted iterators. E.g. insertingg a subrange with known size from a std::list to a std::vector can now do without the internal call to std::distance.

    • That’s true, it will be more efficient that way.

      FYI, the code would still call std::distance, but std::distance will be specialized for SizedIterables to be O(1). std::list is a SizedIterable. (In case it’s not clear, I’ll be suggesting that std::distance be overloaded to take Iterator/Sentinel pairs (2 arguments) and Iterables (1 argument).)

      • Yes sorry, I meant O(N) std::distance reduces to O(1) std::distance. The main benefit is of course that existing container code probably does not need to be rewritten in order to benefit from your range proposal, and no insert_n() member functions need to be added. Although it would be nice if the Standard mandated somehow that container ranged-inserts have std::distance or the iterator tag as a customization point (maybe it already does, I couldn’t find it).

        • Although it would be nice if the Standard mandated somehow that container ranged-inserts have std::distance or the iterator tag as a customization point

          I’m not sure I follow. Try again?

  4. Hi Eric,

    Excelent work! Excelent articles! Thanks.

    I have a question about this paragraph:
    “When passed counted iterators, at each iteration of the loop, we do an increment, a decrement (for the underlying iterator and the count), and a comparison.”

    Why you say that the underlying iterator is decremented?
    I don’t see it.

    Thanks again!
    Fernando.

  5. Hi Eric, love your work. I was playing around with istream_range from your library, and came across this (imo, undesirable) behavior:

    for (auto i : istream_range<string>(cin) | view::take(3))
    cout << i << '\n';
    cout << '\n';
    for (auto i : istream_range<string>(cin) | view::take(3))
    cout << i << '\n';

    The previous code prints the first three strings read from standard input, skips one, then prints the next 3. I think I understand why it does this, but I wonder if such behavior (the skipping) is unavoidable, in your opinion. Or maybe you don’t agree that the behavior is undesirable.

  6. Your range proposal has been very inspirational, and I have thought of ways to use these concepts for a variety of algorithms. However, I frequently have the situation where the same algorithm has to run either on an array of fixed length (size known at compile time) or of variable length. With that in mind, I wonder if you have any thoughts on what iterator/sentinel to use to represent ranges for containers of fixed size such as array<T, N> or T[N]; If these could be treated as counted ranges, it may be possible to generate even better code in the typical loops over all elements. Basically, the hope is that the offset value 0 in the begin iterator is easier for constant value propagation to track and reason about, and the sentinel can encode the fixed size in its type:

    template <typename T>
    struct base_offset_ptr
    {
        // act as (base_ptr + offset) but only change offset
        T * const base_ptr;
        size_t offset;
    };
    
    template <size_t N>
    struct counted_sentinel
    {
        template <typename T>
        friend bool
        operator!=(const base_offset_ptr<T> & iter,
                   const counted_sentinel<N> &) const
        {
            return iter.offset != N;
        }
    };
    
    template <typename T, size_t N>
    base_offset_ptr<T>
    begin(std::array<T, N> & arr)
    {
        return base_offset_ptr<T>(&arr[0], 0);
    }
    
    template <typename T, size_t N>
    counted_sentinel<N>
    end(std::array<T, N> & arr)
    {
        return counted_sentinel<N>();
    }
    

    Any thoughts on whether this can work better than a pair of (begin, end) pointers, or are compilers already at a point where they could figure the size of the range at compile time?

    • It’s an interesting idea. Using offsets seems to make it easier for compilers to do aliasing analysis, which can improve code gen (automatic vectorization, for example). Building the size into the type of the sentinel might improve code gen further. If you try this, take a look at the generated assembly and report back what you find. It would be an interesting experiment.

      (I fixed up the code in your comment. Let me know if I’ve flubbed it.)

      • Hmm, so I tried this and at least with VS 2013, the code gen is worse because it doesn’t do that scalar replacement of aggregates that you mentioned LLVM will soon do – packing the base pointer and offset into a struct generates decidedly worse code; with raw pointers, VS was able to still prove that the number of iterations of a for_each – like loop was constant.

        The code with raw pointers uses up an extra register to count down to the number of loop iterations in for_each-like code, but it has a comparison to zero as its end condition; the sized_sentinel approach used a fancier addressing mode to eliminate a separate loop counter, but then has an cmp against an immediate as a loop terminating condition (raw pointers just need a dec on rcx). If the array is of types with sizes that are not multiples of 2^n, the sized sentinel code may end up even worse.

        All in all, with VS right now, the sized_sentinel adds nothing; the one benefit I found (and use) is that when you know the (maximum) size of a range at compile time, you can write algorithms that need to allocate temporary storage and have them use array<> for temporary storage of transformed ranges instead of always using something like vector<>; I basically have this function:

        template < typename Range, typename T >
        auto allocate_enough(Range r)
        {
        // if size(r) is known to be N at compile time, return array>T, N>
        // otherwise return vector(size(r))
        }

        • I’m disappointed to hear that VC doesn’t do scalar replacement of aggregates. That hurts all around, but it particularly hurts my method of supporting counted ranges. Does your approach work with a different compiler?

          Nice trick with allocate_enough. I might steal that.

  7. The source code for the insertion sort program in the linked gist still has the unique_ptr<int>{new int[n]} problem that is now fixed in the proposal.

  8. Thank you, very interesting read.

    What got me thinking is the equality for your counted_iterator which is defined for the sentinel case only. To clarify, I guess you mean to also provide a normal equality operator for practical applications? E.g.:

    bool operator== (counted_iterator const & it)
    {
    return this->n_ == it.n_ && this->it_ == it.it_;
    }

    Since this->it_ may not be dereferenced if this->n_==0 and thus not matter, the operator could also treat all end iterators the same:

    bool operator== (counted_iterator const & it)
    {
    return this->n_ == it.n_ && (this->n_ == 0 || this->it_ == it.it_);
    }

    Which would allow an alternative range end implementation with the same type as range begin:

    template<WeakIterator I>
    class counted_range
    { // …
    counted_iterator<I> begin() const
    {
    return {begin_, count_};
    }
    counted_iterator<I> end() const
    {
    return {begin_, 0};
    }
    };

Leave a Reply

Your email address will not be published. Required fields are marked *

*