2. Randomized Algorithm - Quick sort

 

Remarks

이 글은 인하대학교 최동완 교수님의 빅데이터컴퓨팅 수업을 정리한 자료입니다.

I. Intro to Randomized algorithm

  • Las vegas algorithm
    항상 옳은 정답을 주지만 random process에 따라 running time이 변할 수 있다.
    Input에 따라 결과가 바뀌는 것이 아니라 algorithm 내의 random process에 따라 결과가 바뀐다.
    ex) Quick sort

  • Monte Carlo algorithm
    Running time은 정해져있지만, bound된 성공확률에 따라 정확하지 않은 답을 내는 경우도 있다.
    ex) Karger algorithm for min-cut(Randomized Min-Cut)

1. Ideal Quick Sort

Algorithm. “Ideal” Quick sort
Input: $S$

  1. Find an element(pivot) $y \in S$ s.t. half of $S \leq y$, half of $S \geq y$ (Ideal condition, median finding problem)
  2. Partition $S$ into $S_1, {y}, S_2$ s.t. $\forall x \in S_1: x \leq y, \forall z \in S_2: z \geq y$
  3. Recursively sort $S_1, S_2$ by repeating steps 1, 2

Cost of Ideal Quick sort
$T(n) \leq 2T(n/2) + A$

  • $2T(n/2)$: Step 3
  • $A$: Step 1, 2

Assume $A$ to be a linear cost $\Theta(n)$

\[\begin{aligned} T(n) &\leq 2T(n/2) + c \cdot n \\ &\leq 2(2T(n/4) + c \cdot n/2) + c \cdot n \\ &\leq 2(2(2T(n/8) + c \cdot n/4) + c \cdot n/2) + c \cdot n \\ &\ \vdots \\ &\leq 2^i T(n/{2^i}) + i \cdot c \cdot n \\ &\ \vdots \\ &\leq 2^x T(1) + x \cdot c \cdot n \quad \cdots \ s.t. \ n/2^x = 1 \ (x = \log n) \\ &= n + c \cdot n \log n \\ &\approx O(n \log n) \end{aligned}\]

Optimal complexity of comparison-based sort algorithm: $O(n \log n)$

By the way,
Step 2 is $\Theta(n)$. But in fact, Step 1 is equal to finding median and it’s $O(n \log n)$.

  • Challenge How to quickly find such an element(pivot) $y$?

1) Simple randomized solution (Normal Quick sort, Randomized Quick sort)

Choose an element $y \in S$ at random: $O(1)$
In the worst case, Step 3 can be $T(n-1)$ and this results $O(n^2)$.

Worse / Average / Expected cost

  • Worst / Average cost depends on Input
  • Expected cost does not depends on Input but consider all Input (?)
  1. Cost
    Analyze the expected number of comparions, say $E[cost]$ (cost: random variable) \(\begin{aligned} S_{(i)} &\equiv \text{ the element of rank $i$ (the $i$'th smallest element of S)} \text{ where } 1 \leq i \leq n \\ S_{(j)} &\equiv \text{ the element of rank $j$ (the $j$'th smallest element of S)} \text{ where } 1 \leq j \leq n \\ X_{ij} &\equiv \text{ a random variable indicating numbers of comparisons between $S_{(i)}$ and $S_{(j)}$} \\ Cost &= X_{12} + \cdots + X_{1n} + X_{23} + \cdots + X_{n-1, n} = \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} X_{ij} \\ E[Cost] &= E[\sum_{i=1}^{n-1} \sum_{j=i+1}^{n} X_{ij}] = \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} E[X_{ij}] \\ \end{aligned}\)

  2. Outcome of $X_{ij}$
    Q. In which step, the comparison of $S_{(i)}, S_{(j)}$ is performed? A. Step 2, when either $S_{(i)}$ or $S_{(j)}$ is selected as a pivot in Step 1. Since then, no longer be compared. Thus, $X_{ij} \in { 0, 1 }$
    $E[X_ij] = \sum_{x \in { 0, 1 } } p(X_{ij} = x) \cdot x = p(X_{ij} = 1) \equiv p_{ij}$

  3. Compute $p_{ij}$ (중요!!)
    3.1. w.l.o.g. assume $i < j$
    3.2. $X_{ij} = 1$ iff. either $S_{(i)}$ or $S_{(j)}$ is selected as a pivot and none of the elements between $S_{(i)}$ and $S_{(j)}$ are selected during the entire process 3.3. $p_{ij} = \frac{2}{len(S_{(i)} \sim S_{(j)})} = \frac{2}{j-i+1}$
    이것은 red ball 2개, blue ball (j-i-1)개, black ball ?개 들어있는 주머니에서 w/o replacement sampling을 계속했을 때 red ball이 blue ball보다 먼저 뽑힐 확률을 구하는 것과 유사하다. $(p_{ij} = \frac{2}{2 + (j-i-1)})$

  4. Expected cost (중요!!)
    \(\begin{aligned} E[Cost] &= \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} E[X_{ij}] = \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} p_{ij} \\ &= \sum_{i=1}^{n-1} \sum_{j=i+1}^{n} \frac{2}{j-i+1} \\ &= \sum_{i=1}^{n-1} \sum_{j=1}^{n-i+1} \frac{2}{j} \\ &\leq \sum_{i=1}^n \sum_{j=1}^n \frac{2}{j} \\ &= 2\sum_{i=1}^n \sum_{j=1}^n \frac{1}{j} \\ &= 2\sum_{i=1}^n H_n \quad \cdots \ \text{Harmonic number(value) } H_n = \sum_{j=1}^n \frac{1}{j} \approx \ln n + \Theta(1) \\ &= 2 \cdot n \cdot H_n \\ &\approx 2 \cdot n \cdot (\ln n + \Theta(1)) \\ &\approx O(n \log n) \end{aligned}\)

Quiz

$E[Cost]$ can differ by inputs?
→ No. Why not?