Danger of IEnumerables

IEnumerables and IEnumerable<T>s are a good thing:

  • They allow returning set of values with a minimum contract and behaviour promise. You may alter the underlying data structure later to virtually anything, because nobody was able to use Your retval in a way You didnt mention. For example if You used a List instead Your retval consumers may add to or remove items from it and become coupled to that retval type. See next too.
  • They allow returning unchangeable “lists”. Did You ever hunted a bug where Your retval instance was containing values which wasnt created by Your method?
  • They may be lazy. You shouldnt known that Your retval consumers how want to use Your data. You may have a resource eater mapping process to run on 1000s of items, but the consumer may only want the Firts() item!
  • LINQ. Just the whole thing uses IEnumerable‘s features and is returning something of this type.
  • Etc. There could be a lot of things.

So I tend to be using these as a retval in every place where the only thing I want to return multiple instances of something.

Now the 50cent question: will this test run green?

 

[TestMethod]
public void MyTestMethod()
{
    IEnumerable<MyClass> result = GetMyClassList();
    Assert.AreSame(result.First(), result.First());
}

Yes? Are You sure? Sure! The first item of an IEnumerable will always be the same!
Or not?
Lets see GetMyClassList‘s implementation:

 

public IEnumerable<MyClass> GetMyClassList()
{
    IEnumerable<MyClass> ret = new MyClass[] { new MyClass(1) };
    return ret;
}

Yes, in case of this implementation the test becomes green.
But how about this:


public IEnumerable<MyClass> GetMyClassList()
{
    IEnumerable<MyClass> ret = null;
 
    var someSourceList = new int[] { 1 };
    ret = someSourceList.Select(i => new MyClass(i));
 
    return ret;
}

Now the test became red!

Why?

Because IEnumerable promises only a sequential access to items.

To items they contain.

In the first case these are items of a fixed array.

But in the second case the items are the values returned by a LINQ projection, which contains a mapping body.

When I call First() twice the IEnumerable‘s enumerator can only be recreated or Reset() and start over the evaluation. So the new MyClass(i) will run again and again resulting in different instances and failing test. And the resource friendly lazy evaluation may become shortly really bad too…

There is nothing new in the above, but in my head the parts of the explanation didnt connect to each other before.

But wait a minute! Is this meaning that when I use an IEnumerable I should known about its creation method?!?! This would break the basics or OOP!

No, I shouldnt known anything about it, just remember: IEnumerable only promises a sequential access to items!

When I consume it in a way that dont requires Reset() or enumerator recreation I need no extra steps:


var firstItem = result.First();

But when the consumption method results in multi-enumeration I should “fix” its items via explicit enumeration, for example:


var fixedResult = result.ToArray();

That allows You to use IEnumerable in a way it was designed and saves some ugly moments of You valuable life. 🙂

Misleading message

What You would do if You get the following result after running a unit test?

Assert.AreEqual failed. Expected:<2016.04.27. 8:22:52>. Actual:<2016.04.27. 8:22:52>.

My head was full with abstractions waiting to be coded, but the above result brings me in
unexpected state. Everything suspended and my eyes were scanning the two values character by
character repeatedly to find out what is the difference? Nothing!
A quick debug revealed that the values differ in milliseconds which are not shown in the message.

But what a misleading message! Maybe the difference should be emphasized somehow!
Because I lost my concentration, my flow, etc.
It was the same when You cannot work quietly because somebody always coming to You and asks
something. Anything. Applying the 8th point of the great article I found on Joel’s blog years before
to this situation: such messages are breaking down productivity and should be avoided.

LINQ to Object: Skip() performance

Assume You have a List of 5000000 items. You want to do some paging so You need e.g. 100 items from offset 300000. Check these two implementations:

        public List<T> GetItems1<T>(List<T> source, int offset, int count)
        {
            List<T> ret = new List<T>();

            for (int i = offset; i < offset + count; i++)
            {
                ret.Add(source[i]);
            }

            return ret;
        }
        public List<T> GetItems2<T>(List<T> source, int offset, int count)
        {
            List<T> ret = source.Skip(offset).Take(count).ToList();

            return ret;
        }

What do You think, which performs better? You may say the second one of course. The first one indexes the source list and calls Add() method count times. The second simply enumerates once till offset then returns count items as a new List with the possibility of internal item addition optimalization. At least that were what I think.

But I was wrong!

The second implementation always slower. The magnitude depends on the offset value but it is always slower!

offset GetItems1 GetItems2
0 43 65
10000 59 729
100000 44 5162
1000000 42 52057
3000000 44 147608

The reason is inside the implemetation details of List and IEnumerable.Skip(). The first one knows where to find the nth item, the second one should enumerate to it. The conclusion as one of my colleagues pointed out: use as specialized tools as You can.

The code which I used for result above:

    [TestClass]
    public class Test
    {
        private int cycles = 100;
        private int offset = 0;
        private int count = 100;

        [TestMethod]
        public void perftest1()
        {
            var l = GetTestData();

            var sw = new Stopwatch();

            double r = 0;
            for (int i = 0; i < cycles; i++)
            {
                sw.Reset();
                sw.Start();
                var result = GetItems1(l, offset, count);
                sw.Stop();
                r += sw.ElapsedTicks;
            }

            Assert.Fail((r / cycles).ToString());
        }


        [TestMethod]
        public void perftest2()
        {
            var l = GetTestData();

            var sw = new Stopwatch();

            double r = 0;
            for (int i = 0; i < cycles; i++)
            {
                sw.Reset();
                sw.Start();
                var result = GetItems2(l, offset, count);
                sw.Stop();
                r += sw.ElapsedTicks;
            }

            Assert.Fail((r / cycles).ToString());
        }


        public List<T> GetItems1<T>(List<T> source, int offset, int count)
        {
            ...
        }


        public List<T> GetItems2<T>(List<T> source, int offset, int count)
        {
            ...
        }

        private List<int> GetTestData()
        {
            List<int> ret = new List<int>();

            for (int i = 0; i < 5000000; i++)
            {
                ret.Add(i);
            }

            return ret;
        }
    }

Method implementation pattern

During the development of many projects I tried to standardize the outlook of methods to help anybody to distinguish between some parts of them.
Check the code below:

        public int SaveNewPartner(Partner partner, Operator modifier)
        {
            if (partner != null)
            {
                if (modifier != null)
                {
                    if (partner.ID == null)
                    {
                        if (!SanityCheck(partner))
                        {
                            throw new ApplicationException("partner failed sanity check");
                        }

                        partner.ModificationTime = DateTime.Now;
                        partner.Modifier = modifier;
                        return Save(partner);
                    }
                    else
                    {
                        throw new InvalidOperationException("Already saved!"); //LOCSTR
                    }
                }
                else
                {
                    throw new ArgumentNullException("partner");
                }
            }
            else
            {
                throw new ArgumentNullException("partner");
            }
        }

I have problems with this code, and if I have them maybe others who meet it later in our project will have too. If You should determine which part is its business functionality it will be probably a shot in the dark. I really need to understand the whole method before I can point out how it works because parameter checking, control flow, exit point mixed with business part.

That’s why I wrote all my methods by following a pattern:

  1. check parameters
  2. define return value
  3. do business functionality
  4. return retval

Here is the rewritten method:

        public int SaveNewPartner(Partner partner, Operator modifier)
        {
            if (partner == null)
            {
                throw new ArgumentNullException("partner");
            }
            if (modifier == null)
            {
                throw new ArgumentNullException("modifier");
            }

            int ret = 0;

            if (partner.ID == null)
            {
                throw new InvalidOperationException("Already saved!"); //LOCSTR
            }
            if (!SanityCheck(partner))
            {
                throw new ApplicationException("partner failed sanity check");
            }

            partner.ModificationTime = DateTime.Now;
            ret = Save(partner);

            return ret;
        }

In lines 3-10 the parameter check occures. There should be parameter checking in each public entrypoint of our class (methods, properties). I check all the parameters I use in the given method directly or in private members I call from here. It is not necessary to check params which are simly handled to other public methods (we may not know all the constraints about these, it’s not our business). I neither check the business validity here (line 3 vs. line 14).

In line 12 I define the return value, which I gave always the name ‘ret’. So if You check any line of the method You can clearly identify where the retval is set and You dont need to scroll anywhere to determine what is the retval variable.

In lines 14-24 placed the business logic. All extremalities are closed asap, so no long lasting ifs and unnecessarily deep indentations happen.

In line 26 we return from here. No other inline returns in the method body so the control flow is clear: we enter at the beginning and exit at the end.

for vs. foreach

If You write a cycle to reach every item in a collection You should use foreach instead of for.
So instead of:

            for (int i = 0; i < list.Length; i++)
            {
                list[i].DoSomething();
            }

write this:

            foreach (var item in list)
            {
                item.DoSomething();
            }

In first case the compiler may not determine what do You want with variable i.
It wont try to remember of position in list to step to the next item in next cycle which will be needed in next cycle.
The code will index the list in every cycle from the begining to the last element plus one.
Foreach will avoid this.
There are excemptions like real arrays via indexing cpu level instruction which will be faster than instantiating an enumerator
but remember my Think Runtime Principle.

My Think Runtime Principle

I realized that while working I always take care of a thought which now has a name: Thing Runtime Principe.

While programming the implementation details of Your current task You should keep a performace-guard-thread in Your mind. You should always think about how Your code will be executed. It will accomplish what in You specs is, but will it be optimal?

Nowadays we have high level programming languages, our code will be compiled with clever compilers, will be executed under clever runtimes on a fast computers so we dont really need to think about CPU, memory exhausement, etc. But some years ago programmers use punch cards, did you know? Processor’s speeds were messured in Mhz and not Ghz. And a man who is very reach now said, that 640kb should be enought for everything… Programmers of those ages had the ability to write programs which fullfill their specs and were optimal in aspect of hardware too.

I understand, that today it isnt so important to think about the hardware. But I am not talking about the hardware at all. I am talking about philosophy of thinking about the runtime environment of Your product. If You dont think about it You lost or never grow up a skill. A skill which has a place in our head in the panthenon of skills which help use to design and write good code. Without that skill there will be a growing distance between You and the device You write for. You may never used in the real life some of the maths You learned in the elementary school, but they helped to organize You thoughts, learn algorithmic thinking, etc.

What kind of music will born from a composer who dont really knows a note how will be sound from a piano or from a violin? My favorite church has a chime. The organist realized that a lot of songs simply cannot be played on a chime because of the bells sounds for seconds after he played them and that sound may be not harmonic with those he plays later. He must take care of device he worked on.

Caution! I didnt said You should write code which somehow relies on implemetation details of underlying classes, runtime or hardware! That whould break OOP rules in a wider sense. Do You remember Your favorite PC games which become unplayable fast after You changed Your PC to a faster one?

I said You should write code which relies on contracts of underlying things appropriatelly!

You should be not only a programmer but an engineer!

 

Reuse intermediate results

Everybody knows that a good programmer writes reuseable code.
If somebody writes the same instruction sequence twice she/he will extract it to a method and the code becomes reuseable.
But how about data? Reuse it too!

            for (int i = 0; i < array.Length - 1; i++)
            {
                if (array[i] > array[i + 1])
                {
                    sum = array[i] + array[i + 1];
                }
            }

Depending on array’s content You will index array four times in every cycle instead of the two necessary.
Indexing an array is a cpu instruction, very fast, but imagine if it is a List and not an array!
The List implementation may switch to linked items and it becomes a lot of dereferencing steps to reach the item at specific index.

Go further:

                if (MyObject1.AnObjectProperty.ItsSubProperty.Value > MyObject2.AnObjectProperty.ItsSubProperty.Value)
                {
                    sum = MyObject1.AnObjectProperty.ItsSubProperty.Value + MyObject2.AnObjectProperty.ItsSubProperty.Value;
                }

How many times should the object graph walked down to Value property? What happens if the properties in path are calculated ones and not just field wrappers?

Go further:

            for (int i = 0; i < array.Length - 1; i++)
            {
                if (CalculateValue(i) > CalculateValue(i+1))
                {
                    sum = CalculateValue(i) + CalculateValue(i + 1);
                }
            }

We are calculating values twice!

Think about how Your code will be executed, not only what You should accomplish!
See my Think Runtime Principle.