It begins with a story

I always recall that when I was kid, my math teacher love to ask a question:

Who can tell me the answer of adding 1 to 100?

Some student started to calculate: $1+2+3+4+5+6+7+......$.

The one who always get No.1 in class tell the answer in a very short time: It's $5050$

Teacher ask: "How you do that?"

He said: "Well, $1+100$ is $101$, and $2+99$ is $101$, there's total $50$ of $101$, so that answer is $50 \times 101$, which is $5050$.

The smart boy definitely not me, I am the one who record this story, but it do inspire me the extended question: What if we are not adding 1 to 100, but adding 1 to n?

Let's solve this question by writing a function to sum from 1 to $n$. That's simple:

Python Solution 1

def summation(n):
    result = 0
    for i in range(1, n+1):
        result += i

    return result

array = [1, 2, 3, 4, 5]
summation(5) == sum(array) #expected True

Python Solution 2

def summation(n):
    return n * (n + 1) // 2

array = [1, 2, 3, 4, 5]
summation(5) == sum(array) #expected True

Can you tell me which algorithm is better? and Why?

Some will say: "I think the Solution 1 is better, because the Zen of Python said Readability Count, the Solution 1 is very easy to understand at the first glance, you know what it's doing at the very begnning. "

Some will say: "I think the Solution 2 is better, because it always one step to get the answer."

Well, that's close, let's do the step by step analysis to compare this two solution:

Solution 1:

Let's say the $n$ is $5$:

Step 1: initialized result = 0
Step 2: Start the loop from i = 1, then result = 1 + 0 = 1
Step 2: In the Loop: now i = 2, execute: result = 1 + 2 = 3
Step 3: In the Loop: now i = 3, execute: result = 3 + 3 = 6
Step 4: In the Loop: now i = 4, execute: result = 6 + 4 = 10
Step 5: In the Loop: now i = 5, execute: result = 10 + 5 = 15
Step 6: return the result value.

Solution 2:

Let's say the $n$ is $5$:

Step 1: return the value: $\frac{5 \times (5 + 1)}{2} = 15$

Ok, in the first solution, the Steps will grows when the $n$ grows, let's say the $n$ is millions, then the function need to execuate millions time to get the answer. In contrast, no matter how big is the $n$, in the second solution, there will be always one time execuation to get the answers, so no matter in a slower computer, or a super fast computer, Solution 2 will be always faster than the Solution 1

Above analysis it just qualitative, what science need is precise, so what's the quantitative analysis for it to precisely identify and analyse the algorithm?

Introducing the Big-O Notation

When trying to characterize an algorithm’s efficiency in terms of execution time, independent of any particular program or computer, it is important to quantify the number of operations or steps that the algorithm will require. If each of these steps is considered to be a basic unit of computation, then the execution time for an algorithm can be expressed as the number of steps required to solve the problem. Deciding on an appropriate basic unit of computation can be a complicated problem and will depend on how the algorithm is implemented.

In the summation functions given above, it makes sense to use the number of terms in the summation to denote the size of the problem. We can then say that the sum of the first 100,000 integers is a bigger instance of the summation problem than the sum of the first 1,000. Because of this, it might seem reasonable that the time required to solve the larger case would be greater than for the smaller case. Our goal then is to show how the algorithm’s execution time changes with respect to the size of the problem.

Computer scientists prefer to take this analysis technique one step further. It turns out that the exact number of operations is not as important as determining the most dominant part of the $T(n)$ function. In other words, as the problem gets larger, some portion of the $T(n)$ function tends to overpower the rest. This dominant term is what, in the end, is used for comparison. The Order of Magnitude function describes the part of $T(n)$ that increases the fastest as the value of $n$ increases. Order of magnitude is often called Big-O Notation(for Order) and written as $O(f(n))$. It provides a useful approximation to the actual number of steps in the computation. The function $f(n)$ rovides a simple representation of the dominant part of the original $T(n)$.

Common Functions for Big-O:

$n$ --> The size of the input

Constant runtime(Time Complexity): $O(1)$
Logarithmic runtime(Time Complexity): $O(\log (n))$
Linear runtime(Time Complexity): $O(n)$
LinearLogarithmic runtime(Time Complexity): $O(n\log (n))$
Quadric runtime(Time Complexity): $O(n^2)$
Cubic runtime(Time Complexity): $O(n^3)$
Exponential runtime(Time Complexity): $O(b^n), b > 1$
Factorial runtime(Time Complexity): $O(n!)$

If we plot the graph of each run time:

So if we visit back our begins story. What do you think the Time Complexity of Solution 1 and Solution 2? Comment below.

Big-O Notation Rules:

Rule 1: Different Steps get added

Example

def doSomething():
    doStep(a) #O(a)
    doStep(b) #O(b)

    return

So for the above example, the Time Complexity should be: $O(a+b)$

Rule 2: Drop Constants

Example

One

def minMax(array):
    minimum, maximum = None, None
    for i in array:
        minimum = min(i, minimum)
    for i in array:
        maximum = max(i, maximum)

    return minimun, maximum

Two

def minMax(array):
    minimum, maximum = None, None
    for i in array:
        minimum = min(i, minimum)
        maximum = max(i, maximum)

    return minimum, maximum

The above TWO functions do the same thing, return the min and max value from an array, but the One is first find the min, and then find the max, so the actual steps is 2n, while the Two is finding the min, max concurrently, so the actual steps is n.

You may say the One time complexity is O(2n) and Two time complexity is O(n), the actual answer is both of them time complexity is O(n), because when the n approach to inf, the constant 2 will be less significant till can be ignored, so this is the Rule 2: Drop the constant

Rule 3: Different Inputs --> Different Variables

Example

def intersectionSize(arrayA, arrayB):
    count = 0
    for a in arrayA:
        for b in arrayB:
            if a == b:
                count += 1

    return count

Well, this function have loop in the loop, we can easily identify it as $O(n^2)$ of time complexity, but actually is False, let's ask a simple question, what is the n means? well, we may say n is the input size of the array, ok, then which array? arrayA or arrayB? since this is different variables, and the n repersent different variable input size, here we should describe the time complexity as:

$O(a \times b)$

Rule 4: Drop the non-dominant terms

Example

def IdontKnowWhatIamDoing(array:list):
    maximun = None

    # O(n) Time complexity
    for i in array:
        maximun = max(a, maximum)
    print(maximum)

    # O(n^2) Time complexity
    for a in array:
        for b in array:
            print(a, b)

We can see from above function IdontKnowWhatIamDoing, the 1st part's time complexity is $O(n)$, and the 2nd parts' time complexity is $O(n^2)$, does it mean the total time complexity is $O(n + n^2)$?

Well, let's do some simulation:

if n = 1, there's $O(1 + 1^2 = 2)$
if n = 2, there's $O(2 + 2^2 = 5)$
if n = 10, there's $O(10 + 10^2 = 110)$
if n = 10,000, there's $O(10,000 + 10,000^2 = ???)$
if n = 100,000, there's $O(100,000 + 100,000^2)$

What pattern do you found? when the n grows, the $n^2$ have more dominance, and the $n$ part become less significant, in the Big-O Notation, we are not doing the details computation, Big-O notation is the unified way to describe an algroithm's time complexity and space complexity(may discuss next post), so we just need to know the donimant term that impact the run time the most. so here we drop the non-dominant term, the time complexity is:

$O(n^2)$

Some Exercises?

We have finished the Rules of Big-O Notation, how about a small exercises? it can be solved based on the info above:

An Anagram Example

What is Anagram?

We have a word table here:

Word_1	Word_2
earth	heart
planet	platen
space	paces
star	rats
cosmic	comics
nebula	unable
lunar	ulnar
solar	orals
sunspots	unstops

Can you find the pattern? Is there any relationship bewteen the word_1 and word_2?

In layman term, Anagram is those words, really meaningful word, have the identical letters with others meaningful word:

Meaningful
formed by same letter

Just like earth and heart.

So now we know what's Anagram, now write a short function to tell whether this two words is Anagram or not, the function should accept two string, and return a bool value of True or False to tell whether this two words is Anagram or not.

Solution 1: Checking Off

Our first solution to the anagram problem will check the lengths of the strings and then to see that each character in the first string actually occurs in the second. If it is possible to “checkoff” each character, then the two strings must be anagrams. Checking off a character will be accomplished by replacing it with the special Python value None However, since strings in Python are immutable, the first step in the process will be to convert the second string to a list. Each character from the first string can be checked against the characters in the list and if found, checked off by replacement.

def isAnagram(string1:str, string2:str) -> bool:
    anagram = True
    if len(string1) != len(string2):
        anagram = False
    
    stringList = list(string2)
    position_1 = 0
    
    while position_1 < len(string1) and anagram:
        position_2 = 0
        found = False
        while position_2 < len(stringList) and not found:
            if string1[position_1] == stringList[position_2]:
                found = True
            else:
                position_2 += 1
        
        if found:
            stringList[position_2] = None
        else:
            anagram = False
        
        position_1 += 1
        
    return anagram

print(isAnagram('earth', 'heart')) # expected: True
print(isAnagram('abc', 'abcde')) # expected: False

True
False

To analyze this algorithm, we need to note that each of the n characters in string1 will cause an iteration through up to n characters in the list from string2. Each of the n position in the list will be visited once to match a character from string1. The number of visits then becomes the sum of the integers from 1 to n. We stated earlier that this can be written as:

$$\sum_{i=1}^{n}i = \frac{n(n+1)}{2}$$ $$= \frac{1}{2}n^2 + \frac{1}{2}n$$

As n gets large, the $n^2$ term will dominate the n term and $\frac{1}{2}$ can be ignored. Therefore, this solution time complexity is:

$O(n^2)$

Solution 2: Sort and Compare

I am not sure did you notice there's one more pattern of Anagram, is actually both words will have the same length, well, we can come up an other solution that convert the string into list, then use Python's bulit function sorted() to sort this two list, and return the compare boolean value:

This actually a Pythoinc Solution, which may not suitable to other languages

del isAnagram #delete function for clear

def isAnagram(string1:str, string2:str) -> bool:
    return sorted(string1) == sorted(string2)

print(isAnagram('earth', 'heart')) # expected: True
print(isAnagram('abc', 'abcde')) # expected: False
print(isAnagram('cosmic', 'comics')) # expected: True

True
False
True

We may come up with similar solution that not so Pythonic, being more language neutral

del isAnagram # delete function for clear

def isAnagram(string1:str, string2:str) -> bool:
    stringList1 = sorted(string1)
    stringList2 = sorted(string2)
    
    position = 0
    matches = True
    
    if len(stringList1) != len(stringList2):
        return False
    
    while position < len(stringList1) and matches:
        if stringList1[position] == stringList2[position]:
            position += 1
        else:
            matches = False
    
    return matches

print(isAnagram('earth', 'heart')) # expected: True
print(isAnagram('abc', 'abcde')) # expected: False
print(isAnagram('cosmic', 'comics')) # expected: True

True
False
True

At first glance you may be tempted to think that this algorithm is $O(n)$, since there is no simple iteration to compare the $n$ characters after the sorting process. However, the two calls to Python sorted() method are not without their own cost. As we go deeper in Python, sorting is typically either $O(n^2)$ or $O(n \log n)$, so the sorting operations dominate the process. In the end, this algorithm will have the same order of magnitude as the previous solution, which is:

$O(n^2)$

Solution 3: Count and Compare

The 4th Solution to the Anagram problem takes advantage of the fact that any two Anagrams will have the same numbers of characters, such as same number of a, same number of b, and so on. In order to decide whether two strings are Anagrams, we will first count the number of times for each characters. Since there are 26 possible characters, we can use a list of $26$ counters, one for each possible character, just like One-Hot encoding in Data Science.

Same as Solution 2, we have one Pythonic Solution, and one non Pythonic solution:

Pythonic Solution

from collections import Counter

del isAnagram # delete function for clear

def isAnagram(string1:str, string2:str) -> bool:
    if len(string1) != len(string2):
        return False
    
    return Counter(string1) == Counter(string2)

print(isAnagram('earth', 'heart')) # expected: True
print(isAnagram('abc', 'abcde')) # expected: False
print(isAnagram('cosmic', 'comics')) # expected: True

True
False
True

Non Pythonic Solution

The key is we will one-hot encoding the string into a list of $26$ characters.

Example

string1 = 'apple'

# after one-hot encoding
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

According to the list, character a in index 0 appear 1 times, character e in index 4 appear 1 times.

We will convert two string into this list, and make a pointer, started from index 0 to compare the appearing times for each characters, if match, then we can say this two strings is Anagrams. So this Algorithm do not have sorting, the time complexity should be $O(n)$

del isAnagram # delete function for clear

def isAnagram(string1:str, string2:str) -> bool:
    if len(string1) != len(string2):
        return False
    
    # construct empty one-hot encoding list
    stringList1 = [0] * 26
    stringList2 = [0] * 26
    
    # iterate to contruct the one-hot encoding
    for i in range(len(string1)):
        position = ord(string1[i]) - ord('a')
        stringList1[position] += 1
    
    for i in range(len(string2)):
        position = ord(string2[i]) - ord('a')
        stringList2[position] += 1
    
    pointer = 0
    anagram = True
    while pointer < 26 and anagram:
        if stringList1[pointer] == stringList2[pointer]:
            pointer += 1
        else:
            anagram = False
    
    return anagram

print(isAnagram('earth', 'heart')) # expected: True
print(isAnagram('abc', 'abcde')) # expected: False
print(isAnagram('cosmic', 'comics')) # expected: True

True
False
True

Again, the solution has a number of iterations. However, unlike the first solution, none of them are nested. The first two iterations used to count the characters are both based on n. The third iteration, comparing the two lists of counts, always takes $26$ steps since there are $26$ possible characters in the strings. Adding it all up gives us $T(n) = 2n + 26$ steps, That is $O(n)$. We have found linear order of magnitude algorithm for solving this problem.

Some Exercise Again?

Given the following code fragment, what is its Big-O running time?

test = 0
for i in range(n):
  for j in range(n):
      test = test + i * j

Given the following code fragment, what is its Big-O running time?

test = 0
for i in range(n):
  test += 1
for j in range(n):
  test -= 1

Given the following code fragment, what is its Big-O running time?
```
i = n
while i > 0:
  k = 2 + 2
  i = i // 2
```

Exercise Solution:

Still reading? now solution time: