Danger of IEnumerables

IEnumerables and IEnumerable<T>s are a good thing:

  • They allow returning set of values with a minimum contract and behaviour promise. You may alter the underlying data structure later to virtually anything, because nobody was able to use Your retval in a way You didnt mention. For example if You used a List instead Your retval consumers may add to or remove items from it and become coupled to that retval type. See next too.
  • They allow returning unchangeable “lists”. Did You ever hunted a bug where Your retval instance was containing values which wasnt created by Your method?
  • They may be lazy. You shouldnt known that Your retval consumers how want to use Your data. You may have a resource eater mapping process to run on 1000s of items, but the consumer may only want the Firts() item!
  • LINQ. Just the whole thing uses IEnumerable‘s features and is returning something of this type.
  • Etc. There could be a lot of things.

So I tend to be using these as a retval in every place where the only thing I want to return multiple instances of something.

Now the 50cent question: will this test run green?

 

[TestMethod]
public void MyTestMethod()
{
    IEnumerable<MyClass> result = GetMyClassList();
    Assert.AreSame(result.First(), result.First());
}

Yes? Are You sure? Sure! The first item of an IEnumerable will always be the same!
Or not?
Lets see GetMyClassList‘s implementation:

 

public IEnumerable<MyClass> GetMyClassList()
{
    IEnumerable<MyClass> ret = new MyClass[] { new MyClass(1) };
    return ret;
}

Yes, in case of this implementation the test becomes green.
But how about this:


public IEnumerable<MyClass> GetMyClassList()
{
    IEnumerable<MyClass> ret = null;
 
    var someSourceList = new int[] { 1 };
    ret = someSourceList.Select(i => new MyClass(i));
 
    return ret;
}

Now the test became red!

Why?

Because IEnumerable promises only a sequential access to items.

To items they contain.

In the first case these are items of a fixed array.

But in the second case the items are the values returned by a LINQ projection, which contains a mapping body.

When I call First() twice the IEnumerable‘s enumerator can only be recreated or Reset() and start over the evaluation. So the new MyClass(i) will run again and again resulting in different instances and failing test. And the resource friendly lazy evaluation may become shortly really bad too…

There is nothing new in the above, but in my head the parts of the explanation didnt connect to each other before.

But wait a minute! Is this meaning that when I use an IEnumerable I should known about its creation method?!?! This would break the basics or OOP!

No, I shouldnt known anything about it, just remember: IEnumerable only promises a sequential access to items!

When I consume it in a way that dont requires Reset() or enumerator recreation I need no extra steps:


var firstItem = result.First();

But when the consumption method results in multi-enumeration I should “fix” its items via explicit enumeration, for example:


var fixedResult = result.ToArray();

That allows You to use IEnumerable in a way it was designed and saves some ugly moments of You valuable life. 🙂

LINQ to Object: Skip() performance

Assume You have a List of 5000000 items. You want to do some paging so You need e.g. 100 items from offset 300000. Check these two implementations:

        public List<T> GetItems1<T>(List<T> source, int offset, int count)
        {
            List<T> ret = new List<T>();

            for (int i = offset; i < offset + count; i++)
            {
                ret.Add(source[i]);
            }

            return ret;
        }
        public List<T> GetItems2<T>(List<T> source, int offset, int count)
        {
            List<T> ret = source.Skip(offset).Take(count).ToList();

            return ret;
        }

What do You think, which performs better? You may say the second one of course. The first one indexes the source list and calls Add() method count times. The second simply enumerates once till offset then returns count items as a new List with the possibility of internal item addition optimalization. At least that were what I think.

But I was wrong!

The second implementation always slower. The magnitude depends on the offset value but it is always slower!

offset GetItems1 GetItems2
0 43 65
10000 59 729
100000 44 5162
1000000 42 52057
3000000 44 147608

The reason is inside the implemetation details of List and IEnumerable.Skip(). The first one knows where to find the nth item, the second one should enumerate to it. The conclusion as one of my colleagues pointed out: use as specialized tools as You can.

The code which I used for result above:

    [TestClass]
    public class Test
    {
        private int cycles = 100;
        private int offset = 0;
        private int count = 100;

        [TestMethod]
        public void perftest1()
        {
            var l = GetTestData();

            var sw = new Stopwatch();

            double r = 0;
            for (int i = 0; i < cycles; i++)
            {
                sw.Reset();
                sw.Start();
                var result = GetItems1(l, offset, count);
                sw.Stop();
                r += sw.ElapsedTicks;
            }

            Assert.Fail((r / cycles).ToString());
        }


        [TestMethod]
        public void perftest2()
        {
            var l = GetTestData();

            var sw = new Stopwatch();

            double r = 0;
            for (int i = 0; i < cycles; i++)
            {
                sw.Reset();
                sw.Start();
                var result = GetItems2(l, offset, count);
                sw.Stop();
                r += sw.ElapsedTicks;
            }

            Assert.Fail((r / cycles).ToString());
        }


        public List<T> GetItems1<T>(List<T> source, int offset, int count)
        {
            ...
        }


        public List<T> GetItems2<T>(List<T> source, int offset, int count)
        {
            ...
        }

        private List<int> GetTestData()
        {
            List<int> ret = new List<int>();

            for (int i = 0; i < 5000000; i++)
            {
                ret.Add(i);
            }

            return ret;
        }
    }

DateTime.Now vs DateTime.UtcNow

A lot of times we used DateTime.Now for logging purposes. Once we got a task to search for and avoid performance bottlenecks in a data distributor projects. As a most simple measurement we put some logging around the code blocks in which we suspect the bottlenecks. The log showed us that a lot of time spent in a method which was really simple and we didnt understand why is it so slow.

After choosing professional performance profiler (RedGate’s ANTS Performance Profiler is the best one! 🙂 ) we found that not the method was so slow but the logging! More precisely the DateTime.Now calls while writing timestamp on log lines!

After some googling we found that DateTime.Now after determining UTC time asks the OS for the timezone set on machine and computes the local time from it. And the determination of timezone from the OS was the real bottleneck.

So in high performing solutions use DateTime.UtcNow instead of DateTime.Now if You dont want to run into things like this.

for vs. foreach

If You write a cycle to reach every item in a collection You should use foreach instead of for.
So instead of:

            for (int i = 0; i < list.Length; i++)
            {
                list[i].DoSomething();
            }

write this:

            foreach (var item in list)
            {
                item.DoSomething();
            }

In first case the compiler may not determine what do You want with variable i.
It wont try to remember of position in list to step to the next item in next cycle which will be needed in next cycle.
The code will index the list in every cycle from the begining to the last element plus one.
Foreach will avoid this.
There are excemptions like real arrays via indexing cpu level instruction which will be faster than instantiating an enumerator
but remember my Think Runtime Principle.

My Think Runtime Principle

I realized that while working I always take care of a thought which now has a name: Thing Runtime Principe.

While programming the implementation details of Your current task You should keep a performace-guard-thread in Your mind. You should always think about how Your code will be executed. It will accomplish what in You specs is, but will it be optimal?

Nowadays we have high level programming languages, our code will be compiled with clever compilers, will be executed under clever runtimes on a fast computers so we dont really need to think about CPU, memory exhausement, etc. But some years ago programmers use punch cards, did you know? Processor’s speeds were messured in Mhz and not Ghz. And a man who is very reach now said, that 640kb should be enought for everything… Programmers of those ages had the ability to write programs which fullfill their specs and were optimal in aspect of hardware too.

I understand, that today it isnt so important to think about the hardware. But I am not talking about the hardware at all. I am talking about philosophy of thinking about the runtime environment of Your product. If You dont think about it You lost or never grow up a skill. A skill which has a place in our head in the panthenon of skills which help use to design and write good code. Without that skill there will be a growing distance between You and the device You write for. You may never used in the real life some of the maths You learned in the elementary school, but they helped to organize You thoughts, learn algorithmic thinking, etc.

What kind of music will born from a composer who dont really knows a note how will be sound from a piano or from a violin? My favorite church has a chime. The organist realized that a lot of songs simply cannot be played on a chime because of the bells sounds for seconds after he played them and that sound may be not harmonic with those he plays later. He must take care of device he worked on.

Caution! I didnt said You should write code which somehow relies on implemetation details of underlying classes, runtime or hardware! That whould break OOP rules in a wider sense. Do You remember Your favorite PC games which become unplayable fast after You changed Your PC to a faster one?

I said You should write code which relies on contracts of underlying things appropriatelly!

You should be not only a programmer but an engineer!

 

Reuse intermediate results

Everybody knows that a good programmer writes reuseable code.
If somebody writes the same instruction sequence twice she/he will extract it to a method and the code becomes reuseable.
But how about data? Reuse it too!

            for (int i = 0; i < array.Length - 1; i++)
            {
                if (array[i] > array[i + 1])
                {
                    sum = array[i] + array[i + 1];
                }
            }

Depending on array’s content You will index array four times in every cycle instead of the two necessary.
Indexing an array is a cpu instruction, very fast, but imagine if it is a List and not an array!
The List implementation may switch to linked items and it becomes a lot of dereferencing steps to reach the item at specific index.

Go further:

                if (MyObject1.AnObjectProperty.ItsSubProperty.Value > MyObject2.AnObjectProperty.ItsSubProperty.Value)
                {
                    sum = MyObject1.AnObjectProperty.ItsSubProperty.Value + MyObject2.AnObjectProperty.ItsSubProperty.Value;
                }

How many times should the object graph walked down to Value property? What happens if the properties in path are calculated ones and not just field wrappers?

Go further:

            for (int i = 0; i < array.Length - 1; i++)
            {
                if (CalculateValue(i) > CalculateValue(i+1))
                {
                    sum = CalculateValue(i) + CalculateValue(i + 1);
                }
            }

We are calculating values twice!

Think about how Your code will be executed, not only what You should accomplish!
See my Think Runtime Principle.

“” vs. String.Empty and “a”+str+”b” vs String.Concat(“a”, str, “b”) reloaded

Despite the compiler optimizes the instructions above for a long time I suggest You not to use the firsts in the pairs.

The first pair has a lot of forum talks: which is better in terms of memory. But there is an other reason I favour String.Empty: You can much easier found all of it with FindAllReferences instead of looking for each “s in Your code 🙂

The second pair the two sides in discussions are: readability vs. performance. The left side is much readable, but because of variable content in ancient days it becomes a double memory allocation to build “a”+str and then the tempresult+”b” string. Nowadays the compiler optimizes it to the call on the right side in the pair, which is much better: Concat allocates once the memory which is enough to build the whole result string at once. That is true till four parameters. If You have more params Concat will use StringBuilder class internally which works like Concat: allocates some space at first and uses it to build the string. The difference is: Concat knows how much space required for the result from it’s params, but StringBuilder dont. So it allocates a buffer and if that is exhausted it allocates a new bigger one, etc. If you check the IEnumerable<string> overload of Concat You will see that althought it knows from its parameters it dont calculates the required buffer size for StringBuilder constructor to allocate approriate buffer at once. So if You have more than four strings to concat You may write better code than using the builtin feature of Concat 🙂

The reason for this post is very theoritical: the Think Runtime Principle