There’s actually a lot of subtleties to nested loops.  When they’re explained, they’re most commonly represented in the following fashion:

But!  That’s clearly overly simplified.

For example, if the Oracle code looked like that, then join order clearly wouldn’t matter for nested loops if both tables were being full scanned.  But, as you’ll see shortly, join order matters a lot!

Suppose I have a large table, with 26 rows, and a small table with only 5 rows.

Let’s write a simple select to join the two (and force nested loops while we’re at it using hints):

If we run explain plan, we get this: Notice a few things:

• The table with the smaller number of rows is listed first (leading table)
• The table with the larger number of rows is listed second
• Both operations are doing a full table scan, as expected.
• The cost is 15.

So, if nested loops are truly coded in the fashion mentioned earlier, then order wouldn’t really matter for a full table scan.  Leading with the large table would yield this:

And leading with the small table would yield this (notice the 5 and 26 being reversed):

Either way, you would end up with 26 * 5 = 130 operations total.  But join order clearly matters here.  Let’s add a leading() hint to reverse the join order, so that the table with the larger number of rows is first: The only thing that’s changed is the join order, but now the estimated cost of the plan jumped from 15 all the way up to 55!

Well, is the CBO right?  Is it more expensive to join in the reverse order?  Answer: Yes.

The original query took 56 session logical reads to execute, while the query with the leading() hint took 245 session logical reads. So why such a huge difference?  The answer is, obviously, nested loops aren’t coded the way they’ve been traditionally presented to us.  It would probably look more like this:

(obviously, this is an example that’s missing a *lot*…extents, segments, the join logic isn’t really fleshed out, the method of retrieving rows, and much more…my goal is to expose a concept, not reproduce the CBO code verbatim).

Looking at the above psuedo code, you can imagine that, during normal operating circumstances, the majority of the CPU cycles would be spent in the innermost loop.
So using “normal operating circumstances” assumptions, knowing what you know about loops, let’s consider a few scenarios:

• What happens if we add one extra row to *one* of the lagging blocks? (one more iteration at line 10)
– Answer: Not much.  We do one more loop.
• What happens if we add one extra block to the lagging table? (one more iteration at line 7)
– Answer: Depends.  If the block is full of rows, then we have to loop over each of the rows and do additional comparisons.  If the block is empty, then we had to pay the price for one additional consistent get, but other than that it doesn’t hurt much performance-wise.
• What happens if we add one extra row to the leading table?
• What happens if we add one more block to the leading table?
– Answer: Again, it depends.  If the block is empty, then you had to pay the price of an additional consistent get to actually get the empty block, but that’s all.  If the block is full of rows, though, we will have to do an additional full scan for each rowthat’s added (and a full scan would likely consist of many, many additional consistent gets).  So it really depends on the number of rows.

So you can see from the above questions and answers that the most-likely influencing factor of a nested loop’s performance is…the number of rows generated by the table in the outer loop. In most cases, the most efficient plan for nested loops will be the plan where the first table listed under the nested loops operation (called the OUTER LOOP) generates the fewest rows.

When would that *not* be the case?

Well, if the table with the larger number of rows was very sparsely populated (lots and lots of blocks, but comparatively very few rows…) and the table with the smaller number of rows is very densely populated, then it’s possible that you could have enough “bloat” space in your table to make up for one or more full tablescans of the table with fewer rows.  This is obviously unlikely in the “real world”…it’s much easier to add one row to a table than it is to add enough “fluff” to a table to make up for the cost of an entire table scan.  But, consider the following example:

Here, we took the DROP_ME_MORE_ROWS table and filled it up with a bunch of “empty space” by inserting into it and subsequently rolling back.  The empty space in the table was sufficient enough that it actually was more efficient to full scan the table with more rows once, than it was to full scan the table with fewer rows an additional 2 times (judging by the number of bytes, we could probably push that limit several times higher).

If we go back to the pseudo code example, this would be a case where Oracle is much of its CPU time doing consistent gets on blocks that…once we acquire them, there’s nothing to do with them because they don’t contain any rows, so no joining occurs.  In a case like this, the outermost loop would be burning all the CPU, while the inner loops would be burning very little!  This obviously is very different compared to what we would normally expect.

So you can see, if you encounter a scenario like this in the real world, it could point to something being wrong somewhere–maybe you’ve got a bunch of extra space in a table that needs to be shrunk, maybe you have terrible row chaining problem, maybe your performance is being hindered by an excessively-wide table.  Most of the time, though, when tuning queries, when you’re looking at nested loops, you’ll have the best performance if the outer loops operation (the first operation listed under nested loops) produces the fewest number of rows.