Section 5 - Search: The First Main Task

1A.5.1 Which Is the Critical Task?

Every project has a most critical operation that dominates its design. One way to break out the possibilities is by using three general task categories:

Searching the data (reading)
Inserting new data (writing)
Computing with the data

We'll usually do all three in any application, but there is often one which needs to happen in a critically efficient manner.

Search

A data structure to hold public building or tax records that can be accessed on-line would require fast searches, because usually there will be thousands of people trying to access the records for information. Changing a record (writing ) will surely happen as will inserting new records, but if only a small percentage of the individuals are making changes or additions, i.e., writing, it is not as important as reading.

Insert

If we are storing information from a real-time event that throws off an enormous amount of data in a short time interval, writing will dominate. An extreme example would be a particle accelerator like the LHC at CERN, where huge amounts of particle trail coordinates must be collected in billionths of a second. More down-to-earth applications might be streaming audio or digital video information where we want to make sure that we are feeding (i.e., writing to) our buffers in a timely manner so there is no delay in the playback. We distinguish between writing and searching even though most writes actually involve search, first.

There are two reasons:

Often, when we write or insert, we have already done the search, and are pointing to (or adjacent to) the record to be written, so the search is not really part of the write overhead.
An insertion of a new record can often require a time-intensive operation. The operation may make future reads go faster, but in order to know the real value, we have to time this insertion on its own, separate from the access time.

Compute

Computation-intensive applications certainly do involve reading, but here the emphasis is different. It could be that we have a relatively small amount of data, but we need to combine it with itself in some expensive way. An example from computer animation comes to mind. The data that represents the objects in the scene, while formidable, are not the challenge. Rather, we have to render the scene using lighting algorithms that involve accessing countless combinations of each vertex or surface point of the base data. We may need to take the original data structure which we can read fairly quickly, once, and place it into a temporary -- and different -- data structure that will be amenable to the rendering computation. Again, a more prosaic example could be storage and retrieval of airline flight times. If we were designing a class intended to display flight times between different destinations, we want to get data by pairs, not individual cities. While a tree might suffice for reading of individual cities, we may want to rewrite the data into some sort of graph or large 2-D sparse matrix form in order to prepare it to respond to bulk city-to-city flight times inquiries.

Only when you know the critical operation for the project can you intelligently choose the right data structure. And, because every project will involve all three operations, it may take some thought to figure out which one is critical. One way to do this is by playing out different scenarios in which each of the three tasks fails to respond in a timely manner. Answer the question "how bad is it?" in each case. Normally, only one will cause end-of-the-world type scenarios, and that, then, becomes your defining operation.

1.5.2 Arrays

A raw C++ array is one data structure that you already know. So, before studying anything more exotic, we take a look at some examples of the above three tasks in the context of our old friend, the array. By the way, don't worry about vectors, if you have used them. We'll get there. When I say array in this class, I mean array, not vector.

In all the examples, our first task is to read the data into an internal data structure. Since we are using arrays here, this is how we do that using the class iTunesEntryReader, which I will say again, is something you don't need to memorize. But this code has to be the first few lines of our experiments as it places the data into a comfortable and usable internal structure. I have omitted some lines that are not important to the current discussion. The code assumes that the file to be read ("itunes_file.txt") has been placed into your project folder.

   // how we read the data from files
   iTunesEntryReader tunesInput("itunes_file.txt");
   int arraySize;

   ...

   // create an array of objects for our own use:
   arraySize = tunesInput.getNumTunes();
   iTunesEntry *tunesArray = new iTunesEntry[arraySize];
   for (int k = 0; k < arraySize; k++)
      tunesArray[k] =  tunesInput[k];

After this block of code, we can work exclusively with tunesArray[] and arraySize.

1.5.3 Searching an Array

In order to test a search, we will take five titles at random and search for these five over and over. The five titles are chosen to be at different places in the original tunes_array[] so we will be getting a representative result. Here is the set-up for the five titles:

   // 5 titles we will search for in the larger iTunes list of 80 tunes
   string searchTitle;
   const int NUM_TITLES = 5;
   const int NUM_SEARCHES = 5000;
   string someTitles[NUM_TITLES] = 
   { 
      "Russian Roulette", 
      "Brahms: Symphony No. 1 in C Minor Op. 68",
      "Give It All U Got",
      "A Love Supreme Part 1",
      "Something to Talk About"
   };

Notice that I like to use consts for values that we may want to change quickly. It is better than placing 5 or 50 in your code. In fact, although it isn't shown here, these could be global-scope static consts so that we can change them at the top of the program. You can do it either way.

We will announce our intention to the user and start the clock:

   cout << "Doing " << NUM_SEARCHES << " searches in array having "
      << arraySize << " iTunes." << endl;

   // start the clock -------------------------
   clock_t startTime, stopTime; 
   startTime = clock();

Always wait to start the clock until the last possible moment.

And here is the actual search:

   // we will do NUM_SEARCHES searches
   for (int attempt = 0; attempt < NUM_SEARCHES; attempt++)
   {
      searchTitle = someTitles[ attempt % NUM_TITLES ];
      for (int k = 0; k < arraySize; k++)
      {
         if (tunesArray[k].getTitle() == searchTitle)
         {
            // only print out to see that the searches work
            // then we remove for timing
            // cout << "found " << searchTitle << " by " 
            //    << tunesArray[k].getSArtist() << endl;
            break;   // found
         }
      }
   }

We are using the loop counter, attempt, as a basis for choosing the search title. The expression attempt % num_titles will cycle through the five titles over-and-over. Every five loop passes we will have searched for each of the five titles once. We are not doing anything after we find the title, because we only care about the search itself. I've left in some debugging code for you, so you can see how we might test that our search code is actually working while we develop it. But that obviously has to be removed for the final timing because any screen I/O inside our loop will mess up the results.

Here is the whole program and some sample runs:

// Main file for search example
// CS 2C, Foothill College, Michael Loceff, creator

#include <iostream>
using namespace std;

#include "iTunes.h"

// for timing our algorithms
#include <time.h>

// --------------- main ---------------
int main()
{
   // how we read the data from files
   iTunesEntryReader tunesInput("itunes_file.txt");
   int arraySize;

   // how we test the success of the read:
   if (tunesInput.readError())
   {
      cout << "couldn't open " << tunesInput.getFileName() << " for input.\n";
      exit(1);
   }

   // create an array of objects for our own use:
   arraySize = tunesInput.getNumTunes();
   iTunesEntry *tunesArray = new iTunesEntry[arraySize];
   for (int k = 0; k < arraySize; k++)
      tunesArray[k] =  tunesInput[k];
   
   // 5 titles we will search for in the larger iTunes list of 80 tunes
   string searchTitle;
   const int NUM_TITLES = 5;
   const int NUM_SEARCHES = 5000;
   string someTitles[NUM_TITLES] = 
   { 
      "Russian Roulette", 
      "Brahms: Symphony No. 1 in C Minor Op. 68",
      "Give It All U Got",
      "A Love Supreme Part 1",
      "Something to Talk About"
   };

   cout << "Doing " << NUM_SEARCHES << " searches in array having "
      << arraySize << " iTunes." << endl;

   // start the clock -------------------------
   clock_t startTime, stopTime; 
   startTime = clock();

   // we will do NUM_SEARCHES searches
   for (int attempt = 0; attempt < NUM_SEARCHES; attempt++)
   {
      searchTitle = someTitles[ attempt % NUM_TITLES ];
      for (int k = 0; k < arraySize; k++)
      {
         if (tunesArray[k].getTitle() == searchTitle)
         {
            // only print out to see that the searches work
            // then we remove for timing
            // cout << "found " << searchTitle << " by " 
            //    << tunesArray[k].getSArtist() << endl;
            break;   // found
         }
      }
   }

   // how we determine the time elapsed -------------------
   stopTime = clock();

   // report algorithm time
   cout << "\nAlgorithm Elapsed Time: " 
      << (double)(stopTime - startTime)/(double)CLOCKS_PER_SEC 
      << " seconds." << endl << endl;

   return 0;
}

/* ---------------- Runs -------------------

Doing 500000 searches in array having 79 iTunes.

Algorithm Elapsed Time: 8.346 seconds.

Press any key to continue . . .

---------

Doing 50000 searches in array having 79 iTunes.

Algorithm Elapsed Time: 0.827 seconds.

Press any key to continue . . . .

---------

Doing 5000 searches in array having 79 iTunes.

Algorithm Elapsed Time: 0.078 seconds.

Press any key to continue . . .

---------

Doing 500 searches in array having 79 iTunes.

Algorithm Elapsed Time: 0.015 seconds.

Press any key to continue . . .

---------

Doing 50 searches in array having 79 iTunes.

Algorithm Elapsed Time: 0 seconds.

Press any key to continue . . .
---------------------------------------- */

The results make sense. The sensitivity of timing to the number of searches should be linear, and it is. (Linear means that if you double the number of searches you will double the time required to find them all.) Eventually, we will really be interested in testing how the search time varies as the number of items to be searched, i.e., arraySize, varies. We are not testing that here. So when we go from 5000 to 50,000 or 50,000 to 500,000 we expect the search time to increase by a factor of 10, and it basically does.

This is a sample of how we would experimentally test a search algorithm. This is different from an analytical approach, wherein which we estimate the search time using some math, graphite and papyrus. We'll do that, too.