- no title specified

Hello, my name is Dan Cojocaru and this presentation is about sorting methods in the context of lower end hardware and software.

I am talking more generally about sorting methods, not sorting algorithms, because the algorithms are more or less well known by everybody. However, the way we apply them in order to solve problems can be different.

As more and more of the world is getting connected to the internet, we as developers, programmers, computer scientists have a duty to provide it with quality services. We must remain aware of the constraints that we ourselves do not necessarily have and work based on them.

Taking only the United States of America as an example, according to a Federal Communications Commission report1, 10% of all Americans lack access to 25 Mbps download / 3 Mbps upload internet service. That is 34 million people. And that’s only USA.

If we consider Akamai’s “State of the Internet” report for Q1 of 20172, the average global internet speed is 7.2 Mbps. While that is a decent speed, it is only the average. There are still some people who use dial-up, with 52 Kbps speeds.

I have therefore decided to research what sorting methods could be used in order to optimise for the delay that people might encounter while the data is being transmitted to devices.

2. Online Sorting Algorithm? A promising idea, a disappointing result.

Describing online algorithms, Insertion Sort and the disappointing results

That is, such algorithm doesn’t have to wait for all the input data to arrive before it can start processing.

on slide 5: explanation of insertion sort being an online algorithm – 50 seconds

On most common implementations of insertion sort, the single array is treated as two. One unsorted input array and one sorted resulting array. One item at a time is taken from the input array and inserted at the right place into the resulting array. At the end, the only array remaining is the sorted one.

As such, this algorithm performs on one item at a time. That is an online algorithm!

So, considering insertion sort is an online algorithm and it starts processing as soon as it starts receiving items, it should perform faster in slow internet speeds, right?

The results of my testing show that even in slower internet speeds, the advantage is nearly non existant for lists with few items, and insertion sort still performs slower than other, generally faster algorithms for large inputs.

When I first found this result, I was rather disappointed. But then I though: there surely must be another way, right?

I have chosen to implement chunking by sorting each individual chunk when available and then merging them together using the same method of merging used in merge sort. This allows the usage of any sorting algorithm to sort the individual chunks, allowing optimisations based on the kind of input that is sorted.

There are probably different ways to chunk the data in order to handle it faster, and those are things that can be further explored.

What is certain is that chunking combined with a fast sorting algorithm rendered significant improvements to the performance of the algorithms.

The first approach is what I call „constant chunking”. We choose a constant number and each time we receive enough items such that the container for unsorted items contains the chosen constant number of elements, we take that as a chunk and process it.

The second approach I tried is „exponential chunking”. We choose a constant number and we store a target size, initially 1. Each time we receive enough items such that the unsorted items container size reach the target size, we process the chunk and we multiply the target size by the constant chosen.

The third approach is „equal chunking”. When the amount of unsorted items is equal to the amount of sorted items, the unsorted items are processed as a chunk.

Based on my tests, I am certain that the method of chunking the data into multiple smaller parts can significantly improve the performance of sorting.

However, I am uncertain about which method of chunking to choose, if there are more that would be better than the three I though of, if some constants for some methods of chunking would yield better performance in some contexts than others, if unoptimal chunking methods can impact the performance as much as to negate the supposed benefit against a regular approach or if performance is severely affected when transfer speeds are high.

Firstly, coming back to the start of the presentation, think about constraints that aren’t obvious at first. A statement that can summarise this idea best is „Not everybody has fast internet”.

Secondly, in the traditional „divide et impera” (or divide and conquer) approach, dividing a problem into smaller chunks seems to give significant performance improvements.

And finally, if you’re not bored, if this presentation caught your attention and if you want to experiment further, you’re welcome to!

I will also post these slides, the notes I took to prepare for this presentation and the source code on my website. The source code will appear at a later time (probably after the exams), since it’s still not really ready for a wide audience. I will do some cleanup, some documentation and then release it with an MIT license most likely.

1https://www.fcc.gov/reports-research/reports/broadband-progress-reports/2016-broadband-progress-report

2https://www.akamai.com/us/en/multimedia/documents/state-of-the-internet/q1-2017-state-of-the-internet-connectivity-report.pdf

Preamble

1. Introduction

Describing the topic, why it’s important, and my attempt at handling it

2. Online Sorting Algorithm? A promising idea, a disappointing result.

Describing online algorithms, Insertion Sort and the disappointing results