Methodology | Pittsburgh Data Crew

Since we couldn’t find any datasets that already exist that would aid us in our quest to come to the defense of young adult literature, we had to make our own. The first order of business was narrowing the quite wide categories of young adult and general fiction into workable lists of which we could feasibly gather data. We guessed that many outlets online had tried their hands at making lists of the best books in both categories and figured that “The Best” would make a convenient and telling sample to investigate. Our only qualifiers heading into the search were that we had to have the same number of books for both categories and that ideally the lists would come from the same source, to maintain as much continuity as possible. This proved to be more difficult than we had hoped, but eventually we came to land on two pieces published by Time entitled The 100 Best Young Adult Books of All Time for our young adult book sample and All Time 100 Novels for general fiction. According to Time, the young adult list was compiled “in consultation with respected peers such as U.S. Children’s Poet Laureate Kenn Nesbitt, children’s-book historian Leonard Marcus, the National Center for Children’s Illustrated Literature, the Young Readers Center at the Library of Congress, the Every Child a Reader literacy foundation and 10 independent booksellers”. The general fiction list was compiled by two book critics at Time and the only qualifications were that they had to be published after 1923 (the year that Time was founded). Although these lists were compiled in different ways, they come from the same source and provide a take on the best books from both categories, which is subjective regardless of where one gets the data. But this investigation is not into whether these books are actually “The Best”, the lists simply provide representative samples of well-known books in both category which suits our purposes just fine.

The next step was to collect basic information about these books in order to compare the data between the two genres and see if anything interesting popped out. We found that it was important to adjust as we went, but eventually settled on the following categories of information:

Book title
Author
Pseudonym: did the author use one?
Initials: did the author go by their initials?
Author gender
Publication month/day
Publication year
Author age at publication
Page #
Publisher
Publication country
Awards
Goodreads rating

We believed that these categories would give us a good picture of the books in each genre and that there was potential for interesting patterns to emerge that we could compare between the two.

After collecting basic data about the books themselves, we had to compile a corpus of reviews of each book in order to do a keyword analysis. The information we collected about each review is as follows:

Book reviewed
Review title
Website/publication title
Date reviewed
Reviewer (if named)
Reviewer description (if provided)
Reviewer’s profession (if provided)
Review link
Review text

In addition to helping us to keep the review information organized, we collected this descriptive data on the chance that it could potentially reveal patterns amongst the reviewers beyond whatever the keyword analysis would provide us.

While this data collection proved to be a bit more laborious than we originally anticipated, it provided us with a wealth of information to analyze in our quest to defend the young adult literature genre. Because of a sudden change in the number of people working on this project, we were unable to analyze all 200 books, so we narrowed it down to the first 80 on each list leaving us with a sample of 160 books.