Section 3 - Searching Project Gutenberg
4B.3.1 Project Gutenberg Data
One interesting public XML database, called Project Gutenberg, is a catalog of thousands of books available for public download (both the books and the catalog, that is). You can visit the site here for more information: http://www.gutenberg.org/wiki/Main_Page.
If you downloaded the CS_2C_Cient_Support package in a prior week, you also received the project Gutenberg files. If not, you can get the support package here:
Unzip the archive and you will see a CS_2C_Client_Support folder which contains three sub-folders, one of which is an EBookEntry Folder. Go into that folder and open the Read Me file, which explains how to incorporate the data file into your project. You'll also have a sample main() to get you started.
I have provided a utility that reads the file into an ArrayList of so-called EBookEntry objects. Class EBookEntry stores information about each ebook in the catalog, notably:
- title (string title)
- author (string creator)
- subject (string subject)
- unique id# (int eTextNum)
Once you have an object, say book, of this type, you can display information about it using accessors, which are contained in this sample snippet:
cout << book.getETextNum() << ": " << book.getCreator().substr(0,15) << ", " << book.getTitle().substr(0,20) << ", " << book.getSubject().substr(0,25) << endl;
A line of output from this code looks like this:
29851: Post, Melville , Dwellers in the Hill, Fiction
Let's look at a little main() that demonstrates how we read this data into an array of EBookEntry objects and display them.
The idea is similar to the iTunesEntry protocol. Instantiate an EBookEntryReader object using the filename as a parameter:
EBookEntryReader bookInput("catalog-short4.txt");
Next, check for an error, and if there is none, you can access the object like an array, plucking out EBookEntry objects using bracket notation: bookInput[k] for the kth book.
// Sample Main for EBookEntry project. See Read Me file for details // CS 2C, Foothill College, Michael Loceff, creator #include <iostream> using namespace std; #include "EBookEntry.h" // ----------- prototypes ------------- void displayOneEBook(EBookEntry book); // --------------- main --------------- int main() { int k, arraySize; // how we read the data from files EBookEntryReader bookInput("catalog-short4.txt"); // how we test the success of the read: if (bookInput.readError()) { cout << "couldn't open " << bookInput.getFileName() << " for input.\n"; exit(1); } cout << bookInput.getFileName() << endl; cout << bookInput.getNumBooks() << endl; // create an array of objects for our own use: arraySize = bookInput.getNumBooks(); EBookEntry *bookArray = new EBookEntry[arraySize]; for (k = 0; k < arraySize; k++) bookArray[k] = bookInput[k]; // display first 50 books (only) unsorted for (int k = 0; k < 50; k++) displayOneEBook(bookArray[k]); cout << endl; delete[] bookArray; return 0; } void displayOneEBook(EBookEntry book) { cout << " # "<< book.getETextNum() << " ---------------" << endl; cout << " \"" << book.getTitle() <<"\"" << endl; cout << " by " << book.getCreator() << endl; cout << " re: " << book.getSubject() << endl << endl; }
Here is a snippet of the run.
/* --------------- Tail End of Run ---------------- ... # 28352 --------------- "The Children's LongfellowTold in Prose" by Hayman, Doris re: American poetry -- Adaptations # 28304 --------------- "Harper's Young People, January 13, 1880An Illustrated Weekly" by Various re: Children's periodicals, American # 28322 --------------- "Pioneer Surgery in KentuckyA Sketch" by Yandell, David Wendel, 1826-1898 re: Surgeons -- Kentucky # 30160 --------------- "Christ, Christianity and the Bible" by Haldeman, Isaac Massey, 1845-1933 re: Jesus Christ -- Divinity Press any key to continue . . . -------------------------------- */
This is a good data set for trying out algorithms that are sensitive to size since it has almost 5000 records.
4B.3.2 Testing FHsearch_tree with EBookEntries
Next, we combine this with our search tree template. Specifically, instead of the lines that load the data from the EBookEntryReader into an array (see code above) we will insert the data into a binary search tree (see code below).
Remember that a search tree requires an intrinsic ordering for the data, and for our EBookEntry objects, we can set the field on which to base the order using the static method EBookEntry::setSortType() passing in one of the following sort or ordering keys:
- SORT_BY_TITLE
- SORT_BY_CREATOR
- SORT_BY_SUBJECT
- SORT_BY_ID
These are static consts in the class, so you need to dereference them via the class name, e.g., EBookEntry::SORT_BY_TITLE.
Because binary search trees generally do not allow duplicates (and ours does not), it becomes a non-trivial matter which sort (or ordering) key one uses to build the tree, and subsequently access the data. I have used SORT_BY_ID because it is unique, resulting in no duplicate attempts, but you can change to another sort key when building the tree.
// FHsearch_tree with EBookEntry main // CS 2C, Foothill College, Michael Loceff, creator #include <iostream> #include <string> using namespace std; #include "FHsearch_tree.h" #include "EBookEntry.h" class PrintBook { public: void operator()(EBookEntry book) { cout << book.getETextNum() << ": " << book.getCreator().substr(0,15) << ", " << book.getTitle().substr(0,20) << ", " << book.getSubject().substr(0,25) << endl; } }; int main() { int k, arraySize; FHsearch_tree<EBookEntry> bookTree; PrintBook bookPrinter; // read the data from file EBookEntryReader bookInput("catalog-short4.txt"); if (bookInput.readError()) { cout << "couldn't open " << bookInput.getFileName() << " for input.\n"; exit(1); } // changing the sort key to SORT_BY_TITLE will not allow duplicate titles, e.g. EBookEntry::setSortType(EBookEntry::SORT_BY_ID); arraySize = bookInput.getNumBooks(); // build the tree for (k = 0; k < arraySize; k++) if (!bookTree.insert(bookInput[k])) { cout << "NOT INSERTED: "; bookPrinter(bookInput[k]); } // confirm the size: cout << "Num Books: " << arraySize << ", Tree Size: " << bookTree.size() << endl; // test finds int START = 1000, STOP = 1020; cout << "\n\nAttempting " << STOP - START << " finds: \n"; for (k = START; k < STOP; k++) if (!bookTree.contains( bookInput[k] )) cout << " !!! NOT FOUND: " << k << ": " << bookInput[k].getTitle() << endl; else { cout << "Found: "; bookPrinter(bookInput[k]); } // test removals cout << "\n\nAttempting " << STOP - START << " removals: \n"; for (k = START; k < STOP; k++) if (!bookTree.remove( bookInput[k] )) cout << " !!!NOT FOUND: " << k << ": " << bookInput[k].getTitle() << endl; cout << "\nnew size: " << bookTree.size() << endl; /* // this will print thousands of books, so be careful cout << "\nsearch_tree after deletes:\n"; bookTree.traverse(bookPrinter); */ return 0; }
Here is the run from this client:
/* ------------------ RUN ------------- Num Books: 4863, Tree Size: 4863 Attempting 20 finds: Found: 23574: La Monte, Rober, Socialism: Positive , (no data found) Found: 22740: Unknown, The Apple Dumpling a, (no data found) Found: 22693: Lang, Jeanie, A Book of Myths, Mythology Found: 28486: Cole, Everett B, The Weakling, Science fiction Found: 28488: Moli--re, 1622-, Tartuffeor The Hypoc, Comedies Found: 28489: Locke, William , The Belov--d Vagabon, PR Found: 28490: Girl Scouts of , Scouting For Girls, , Girl Scouts -- Handbooks, Found: 23448: Brown, Ruth Alb, Heart of Gold, People with disabilities Found: 22599: Karpinski, Loui, The Hindu-Arabic Num, Numerals Found: 23337: Ing, Dean Charl, Tight Squeeze, Science fiction Found: 22360: Bradley, Mary H, The Fortieth Door, PS Found: 23293: Lawrence, Rober, Primitive Psycho-The, (no data found) Found: 28066: Various, The Christian Founda, Religion and science -- P Found: 28067: Smith, J. Allen, The Spirit of Americ, Constitutional history -- Found: 28069: Bangs, John Ken, Alice in Blunderland, Fantasy Found: 28070: Harriman, Alice, A Man of Two Countri, PS Found: 23205: Various, Notes and Queries, N, Questions and answers -- Found: 22171: Ernst, Paul, 18, The Radiant Shell, Science fiction Found: 23075: Baum, L. Frank , Ozma of Oz, Oz (Imaginary place) -- F Found: 29230: Barker, James N, The Indian PrincessL, American drama Attempting 20 removals: new size: 4843 Press any key to continue . . . -------------------------------------------- */
I created a functor for printing books, rather than use the global-scope displayOneEBook() of the first sample main(). This functor can then be (but was not here) used to pass to the traversal method.
While it may not be evident from the run here, the data is presented in a somewhat sorted order in the file according to ID #, which is not a good thing for binary search trees. We'll come to this shortly.