Section 3 - Analyzing Kruskal

11A.3.1 Does it work?

We can answer this question in stages:

Is G' connected?
Does it span all the vertices?
Can any edge be removed (are there redundant edges)?
Is there a shorter G'' that spans all the vertices?

Is G' connected? Does it span all the vertices?

At the end of the algorithm, all vertices appear somewhere in the newEdges bag, so if we can show that newEdges always forms a connected graph at each stage in the algorithm, we can say that G' is connected and spans all the vertices.

Initially, vertSets, the collection of vertices, starts out as N sets of singletons. Also, there are no edges at all at this point in newEdges. We can view vertSets at that point as N isolated vertices with respect to the edges in newEdges. Stated in terms of connectedness, vertSets consists of N separate connected components (each set contains one element which is connected to itself, without the help of any edge in newEdges. As kruskal progresses we keep tabs on the connectedness of the sets in vertSets as determined by the edges in newEdges. We just observed that initially, vertSets consists of N sets, each of which is connected within itself.

Each time we add an edge to newEdges, we take the union of two sets in vertSets . There are two observations about this step:

We have reduced the number of sets in vertSets by one. This is because we have removed both sets from the collection and replaced them with a single, larger set.
The union thus formed is connected. The added edge consisted of two vertices from distinct vertSets, because otherwise, the edge would have been tossed out. Say those two distinct sets containing these two vertices are vertSets[k] and vertSets[j]. By induction, we can assume that each of these two set of vertices is connected within itself using edges from newEdges. The edge being currently added to newEdges gets you from a vertex in vertSets[k] to a vertex in vertSets[j]. This bridges the two sets so that any vertex in one set can then be connected to any vertex in the other set. This shows that the larger set is also connected.

So, at each step, the individual sets in vertSets are connected (within themselves) by edges in newEdges. Since the algorithm stops when there is only one large set left in vertSets containing all the vertices from G, that final set is connected using edges in newEdges.

Can any edge be removed

Technically, this isn't a required observation to prove the assertion, but it doesn't hurt to prove it. If an edge can be removed then there must have been a cycle before the edge was removed. That's because if you can remove edge (v, w) and still have a connected graph, then v and w are connected by another path, w → s₁ → s₂ → ... v. With (v, w) this forms a cycle connecting w to itself. This proves that the ability to remove an edge and stay connected implies that we had a cycle. However, we know that no cycles are possible since the algorithm never allows adding an edge to newEdges if it results in a cycle. Therefore, all edges are needed in G' to assure connectivity. We cannot remove any. This proves a weak form of minimality.

Is there a shorter G'' that spans all the vertices

We prove this by induction on the number of edges in newEdges. When there is only one edge in newEdges, this represents the smallest total edge count that can be used to connect those two vertices because of the way we selected the edge: it was popped off a PriorityQueue ordered by the cost of the edges, thus it was the "shortest" edge possible. Next, assume that newEdges represents the shortest possible total path length (sum of all costs) at stage K-1, i.e., when there are K-1 edges in newEdges. We are about to add a new edge using the kruskal choice. The newly added edge is chosen to be the smallest edge in the bag of remaining edges (i.e., in edgeHeap) that does not result in a cycle (again, we choose the edge by popping it off a PriorityQueue , so it is smallest by definition). Therefore, it is impossible to add any other edge, smaller than the added edge at that point; newEdges continues to have the smallest total cost possible at the phase when there are now K edges, proving the assertion. (There is a subtle omission in this argument that I will leave to the student to detect and fill in.)