LINQ to Object: Skip() performance

Assume You have a List of 5000000 items. You want to do some paging so You need e.g. 100 items from offset 300000. Check these two implementations:

        public List<T> GetItems1<T>(List<T> source, int offset, int count)
        {
            List<T> ret = new List<T>();

            for (int i = offset; i < offset + count; i++)
            {
                ret.Add(source[i]);
            }

            return ret;
        }
        public List<T> GetItems2<T>(List<T> source, int offset, int count)
        {
            List<T> ret = source.Skip(offset).Take(count).ToList();

            return ret;
        }

What do You think, which performs better? You may say the second one of course. The first one indexes the source list and calls Add() method count times. The second simply enumerates once till offset then returns count items as a new List with the possibility of internal item addition optimalization. At least that were what I think.

But I was wrong!

The second implementation always slower. The magnitude depends on the offset value but it is always slower!

offset GetItems1 GetItems2
0 43 65
10000 59 729
100000 44 5162
1000000 42 52057
3000000 44 147608

The reason is inside the implemetation details of List and IEnumerable.Skip(). The first one knows where to find the nth item, the second one should enumerate to it. The conclusion as one of my colleagues pointed out: use as specialized tools as You can.

The code which I used for result above:

    [TestClass]
    public class Test
    {
        private int cycles = 100;
        private int offset = 0;
        private int count = 100;

        [TestMethod]
        public void perftest1()
        {
            var l = GetTestData();

            var sw = new Stopwatch();

            double r = 0;
            for (int i = 0; i < cycles; i++)
            {
                sw.Reset();
                sw.Start();
                var result = GetItems1(l, offset, count);
                sw.Stop();
                r += sw.ElapsedTicks;
            }

            Assert.Fail((r / cycles).ToString());
        }


        [TestMethod]
        public void perftest2()
        {
            var l = GetTestData();

            var sw = new Stopwatch();

            double r = 0;
            for (int i = 0; i < cycles; i++)
            {
                sw.Reset();
                sw.Start();
                var result = GetItems2(l, offset, count);
                sw.Stop();
                r += sw.ElapsedTicks;
            }

            Assert.Fail((r / cycles).ToString());
        }


        public List<T> GetItems1<T>(List<T> source, int offset, int count)
        {
            ...
        }


        public List<T> GetItems2<T>(List<T> source, int offset, int count)
        {
            ...
        }

        private List<int> GetTestData()
        {
            List<int> ret = new List<int>();

            for (int i = 0; i < 5000000; i++)
            {
                ret.Add(i);
            }

            return ret;
        }
    }

NSubstitute’s unexpected default return value

NSubstitute is a great tool when You are writing unit tests for Your product which was designed with dependency injection in mind.

Let’s see it:

    public interface IWorker
    {
        object GetResult();
    }

    [TestMethod]
    public void ExecuterTest_WorkerCalledOrNot()
    {
        var workerSubstitute = Substitute.For<IWorker>();

        var executerToTest = new Executer(workerSubstitute);

        // we test here
        executerToTest.Execute();

        workerSubstitute.Received(1).GetResult();
    }

In the code above I was not interested in the real value of the worker’s result. I wanted only to know whether the worker’s GetResult method gets called or not. But all of my workers do some hard work, so for this test I dont want to instantiate them and implicitli convert my unit test into integration test. So I create a substitute via NSubstitute for the given interface and gave that to my Executer.

The substitute tries to be as neutral as it can be. All void methods return asap, all non void methods return the default value for the return type. Naturally You can modify that behaviour and explicitly tell the substitute what to do when it’s methods get called, so You can for example redirect all database calls to in memory data structures during tests if You created Your DAL layer with DI in mind. And meanwhile the subtitute collects data about its use, so we can ask it whether it was called or not and with what arguments, etc.

But today I found got something unexpected thing.

Change a bit the example above to demonstrate it:

    public interface IWorker
    {
        TRet GetResult<TRet>();
    }

    [TestMethod]
    public void SubstituteReturnValueTest()
    {
        var workerSubstitute = Substitute.For<IWorker>();

        // test #1
        var r1 = workerSubstitute.GetItem<int>();
        Assert.AreEqual(default(int), r1);

        // test #2
        var r2 = workerSubstitute.GetItem<List<int>>();
        Assert.AreEqual(default(List<int>), r2);

        // test #3
        var r3 = workerSubstitute.GetItem<IList<int>>();
        Assert.AreEqual(default(IList<int>), r3);
    }

This will fail at Assert of test #3 because r3 wont be null as You would expect but an instance of “Castle.Proxies.IList`1Proxy”!

I think it is a bug but may be a result of some design decisions which priorized some functionality (retval is an instance of some interface which is wrapped around with a subtitute created on the fly) over coherency.

So be careful 🙂

Method implementation pattern

During the development of many projects I tried to standardize the outlook of methods to help anybody to distinguish between some parts of them.
Check the code below:

        public int SaveNewPartner(Partner partner, Operator modifier)
        {
            if (partner != null)
            {
                if (modifier != null)
                {
                    if (partner.ID == null)
                    {
                        if (!SanityCheck(partner))
                        {
                            throw new ApplicationException("partner failed sanity check");
                        }

                        partner.ModificationTime = DateTime.Now;
                        partner.Modifier = modifier;
                        return Save(partner);
                    }
                    else
                    {
                        throw new InvalidOperationException("Already saved!"); //LOCSTR
                    }
                }
                else
                {
                    throw new ArgumentNullException("partner");
                }
            }
            else
            {
                throw new ArgumentNullException("partner");
            }
        }

I have problems with this code, and if I have them maybe others who meet it later in our project will have too. If You should determine which part is its business functionality it will be probably a shot in the dark. I really need to understand the whole method before I can point out how it works because parameter checking, control flow, exit point mixed with business part.

That’s why I wrote all my methods by following a pattern:

  1. check parameters
  2. define return value
  3. do business functionality
  4. return retval

Here is the rewritten method:

        public int SaveNewPartner(Partner partner, Operator modifier)
        {
            if (partner == null)
            {
                throw new ArgumentNullException("partner");
            }
            if (modifier == null)
            {
                throw new ArgumentNullException("modifier");
            }

            int ret = 0;

            if (partner.ID == null)
            {
                throw new InvalidOperationException("Already saved!"); //LOCSTR
            }
            if (!SanityCheck(partner))
            {
                throw new ApplicationException("partner failed sanity check");
            }

            partner.ModificationTime = DateTime.Now;
            ret = Save(partner);

            return ret;
        }

In lines 3-10 the parameter check occures. There should be parameter checking in each public entrypoint of our class (methods, properties). I check all the parameters I use in the given method directly or in private members I call from here. It is not necessary to check params which are simly handled to other public methods (we may not know all the constraints about these, it’s not our business). I neither check the business validity here (line 3 vs. line 14).

In line 12 I define the return value, which I gave always the name ‘ret’. So if You check any line of the method You can clearly identify where the retval is set and You dont need to scroll anywhere to determine what is the retval variable.

In lines 14-24 placed the business logic. All extremalities are closed asap, so no long lasting ifs and unnecessarily deep indentations happen.

In line 26 we return from here. No other inline returns in the method body so the control flow is clear: we enter at the beginning and exit at the end.

Volatile variables

If You want to exit from loop if some bool set from another thread You may be victimed by compiler’s code optimalization.

class Test
{
    private bool _loop = true;

    public static void Main()
    {
        Test test1 = new Test();

        // Set _loop to false on another thread
        new Thread(() => { test1._loop = false;}).Start();

        // Poll the _loop field until it is set to false
        while (test1._loop == true) ;

        // The loop above will never terminate!
    }
}

Found here: http://igoro.com/archive/volatile-keyword-in-c-memory-model-explained/

Enum.TryParse<> unexpected behaviour

Check code below:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
 
namespace EnumTest
{
    class Program
    {
        static void Main(string[] args)
        {
            TestEnum result;

            bool ret = Enum.TryParse<TestEnum>("Yellow", out result);
            // ret value becomes: false
 
            ret = Enum.TryParse<TestEnum>("111", out result);
            // ret value becomes: true, result becomes: 111 !!!!!
        }
    }
 
    enum TestEnum
    {
        Red = 1,
        Green = 2
    }
}

Something found in help:

“If this behavior is undesirable, call the IsDefined method to ensure that a particular string representation of an integer is actually a member of TEnum. ”

Nice.

Lazy expression evaluation

What will we find in result after this code finished?

IEnumerable<string> src = new List<string> { "ab", "ac", "bc", "abc" };
List<string> filter = new List<string> { "a", "b", "c" };

foreach (var f in filter)
{
   src = src.Where(i => i.Contains(f));
}

result = src.ToList();

You say: only one element: “abc”?
Wrong!
It contains three elements: “ac”, “bc”, “abc”
Why?
Where() is lazy!
Excerpt from help:

“This method is implemented by using deferred execution. The immediate return value is an object that stores all the information that is required to perform the action. The query represented by this method is not executed until the object is enumerated either by calling its GetEnumerator method directly or by using foreach in Visual C# or For Each in Visual Basic.”

We build a lazy evaluated expression chain there will be references to f variable. After we get out from cycle the reference to f remains in use, but it’s value will be the last one it got in the cycle.
So when we evaluate with ToList() call the src.Where(…).Where(…).Where(…) expression each lambda will look for “c”!

This behaviour changes in .NET 4.5; f will encapsulate the value and not the reference.

Task vs. CurrentThread

If You use Task and Thread.CurrentThread inside of the code tree callable from the task’s body You should know something.
The TaskScheduler may decide to run Your tasks on same Thread.
In this case Your code may not work as expected, see sample below:

private Dictionary<Thread, int> uniqueIds = new Dictionary<Thread, int>();
 
private void InitMyUniqueId(int id)
{
    // thread safety skipped to keep sample simple
    uniqueIds[Thread.CurrentThread] = id;
}
 
public void DoSomething()
{
    // thread safety skipped to keep sample simple
    var myUniqueId = uniqueIds[Thread.CurrentThread];
 
    Console.Write(myUniqueId);
}
 
public void main()
{
    var t1 = new Task(() =>
    {
        InitMyUniqueId(1);
        DoSomething();
    });
 
    var t2 = new Task(() =>
    {
        InitMyUniqueId(2);
        DoSomething();
    });
 
    t1.Start();
    t2.Start();
}

Depending on TaskScheduler the output can be “12” and “22” too!
The potential impact may be e.g. various context mismatches if You use CurrentThread
to select the local storage slot for Your data, etc. Be careful!

Possible workaround: check Task.CurrentId. If it is not null we are in a task
and using CurrentThread like above is dangerous.