DSC logo

DSC 8: Text-Comparison-Algorithm-Crazy Quinn

by Quinn Dombrowski

October 21, 2020

Dear Reader

This Data-Sitters Club book is a little different: it’s meant to be read as a Jupyter notebook. Congratulations, you’re in the right place!

Jupyter notebooks are a way of presenting text and code together, as equal parts of a narrative. (To learn more about them, how they work, and how you can use them, check out this Introduction to Jupyter Notebooks at Programming Historian that I wrote last year with some colleagues.)

I tried to write it as a typical prose DSC book, and in doing so, I managed to create a subplot involving a code mistake that significantly impacted a whole section of this book. But instead of rewriting the narrative, fixing the mistake, and covering up the whole thing, I started adding comment boxes

Note: Like this. And in this way, I ended up in a kind of dialogue with myself, pointing out the mistakes, and all the times I almost realized what had happened.

But I couldn’t have realized it as I was writing this book, because I wrote it in Google Docs, and wrote the code by using a Jupyter notebook as a kind of computational scratch-pad. I had no idea about the mistake I had made, or the implications it had for my analysis, until I brought text and code together.

If you really want to read this just as text, you can skip over the code sections. But if there ever were a time to confront any uneasiness you feel about looking at code as you read a narrative description of DH work, you’re not going to find a more friendly, fun, and colloquial place to start than DSC #8: Text-Comparison-Algorithm-Crazy Quinn.


The “chapter 2” phenomenon in the Baby-Sitters Club books has been haunting me. Ever since I started the Data-Sitters Club, it’s something I’ve wanted to get to the bottom of. It’s trotted out so often as an easy criticism of the series – or a point of parody (as we’ve done on our own “Chapter 2” page that describes each of the Data-Sitters and what the club is about), and it feels particularly tractable using computational text analysis methods.

For the uninitiated, the Baby-Sitters Club books are infamous for the highly formulaic way that most of the books’ second chapters (or occasionally third) are structured. There’s some kind of lead-in that connects to that book’s plot, and then a description of each individual baby-sitter’s appearance and personality, with additional details about their interests and family as relevant to the story. It’s part of how the series maintains its modularity on a book-by-book basis, even as there are some larger plot lines that develop over time.

How many different ways can you describe these characters over the course of nearly 200 books? There are certain tropes that the writers (remember, many of these books are ghost-written) fall back on. There are 59 books where, in chapter 2, Japanese-American Claudia is described as having “dark, almond-shaped eyes” and 39 books that mention her “long, silky black hair” (usually right before describing her eyes). 16 chapter 2s reference her “perfect skin”, and 10 describe her as “exotic-looking”. 22 chapter 2s describe Kristy as a “tomboy” who “loves sports”. 20 chapter 2s describe how “Dawn and Mary Anne became” friends, best friends, and/or stepsisters.

So it’s not that this critique of the Baby-Sitters Club series is wrong. But what I wanted to do was quantify how right the critique was. And whether there were any other patterns I could uncover. Do the chapter 2s get more repetitive over the course of the series? Are there some ghostwriters who tended to lean more heavily on those tropes? Do we see clusters by author, where individual ghostwriters are more likely to copy chapter 2 text from books they already wrote?

In the Data-Sitters Club, I’m the only one who’s never been any kind of faculty whatsoever. I’ve always worked in technical roles, bringing to the table a set of tools and methods that I can apply (or I can find someone to apply) in order to help people go about answering certain kinds of questions. Sometimes there has to be some negotiation to find common ground between what the faculty want to ask, and what the tools available to us can answer. Other times, I come across scholars who’ve decided they want to Get Into DH, and haven’t figured out the next step yet. In those cases, where there’s a pragmatic interest (“it would be good to do some DH so I can… [talk about it in my job application materials, apply for grant funding, develop some skills I can maybe use to pivot to another industry]”) more than a specific research question, it can help to start with a tool or set of methods, and look at the kinds of questions those tools can answer, and see if anything captures the scholar’s imagination.

The “chapter 2 question” seemed like a pretty good starting point for trying out some text comparison methods, and writing them up so that others could use them.

… until I realized how many different ones there were.

A Time for Tropes

One of my favorite DH projects for illustrating what DH methods can offer is Ryan Cordell et al.’s Viral Texts, which maps networks of reprinting in 19th-century newspapers. Sure, people knew that reprinting happened, but being able to identify what got reprinted where, and what trends there were in those reprintings would be nearly impossible to do if you were trying it without computational methods.

Viral Texts uses n-grams (groups of words of arbitrary length – with “n” being used as a variable) to detect reuse. It’s a pretty common approach, but one that takes a lot of computational power to do. (Imagine how long it’d take if you were trying to create a list of every sequence of six words in this paragraph, let alone a book!) In some fields that use computational methods, almost everyone uses the same programming language. Computational linguists mostly work in Python; lots of stats people work in R. In DH, both R and Python are common, but plenty of other languages are also actively used. AntConc is written in Perl, Voyant is written in Java, and Palladio (a mapping/visualization software developed at Stanford) is written in Javascript. As it happens, the code that Lincoln Mullen put together for detecting n-grams is written in R. The Python vs. R vs. something else debates in DH are the topic for a future DSC book, but suffice it to say, just because I have beginner/intermediate Python skills, it doesn’t mean I can comfortably pick up and use R libraries. Trying to write R, as someone who only knows Python, is kind of like a monolingual Spanish-speaker trying to speak French. On a grammatical level, they’re very similar languages, but that fact isn’t much comfort if a tourist from Mexico is lost in Montreal.

Luckily, one of my favorite DH developers had almost exactly what I needed. When it comes to DH tool building, my hat goes off to Scott Enderle. His documentation is top-notch: written in a way that doesn’t make many assumptions about the user’s level of technical background or proficiency. Sure, there are things you can critique (like the default, English-centric tokenization rules in his Topic Modeling Tool), but the things he builds are very usable and, on the whole, fairly understandable, without asking an unrealistic amount from users upfront. I wish I could say the same many other DH tools… but that’s a topic for a future DSC book.

Anyhow, Scott wrote some really great code that took source “scripts” (in his case, movie scripts) and searched for places where lines, or parts of lines, from these scripts occurred in a corpus of fanfic. Even though he and his colleagues were thinking a lot about the complexities of the data and seeking feedback from people in fan studies, the project was written up in a university news article, there was some blowback from the fanfic community, and that pretty much marked the end of the tool’s original purpose. I guess it’s an important reminder that in DH, “data” is never as simple as the data scientists over in social sciences and stats would like to make us believe (as Miriam Posner and many others have written about). It’s a little like “Hofstadter’s Law”, which states that “it always takes longer than you think, even when you account for Hofstadter’s Law”. Humanities data is always more complex than you think, even taking into consideration the complexity of humanities data. Also, it’s a good reminder that a university news write-up is probably going to lose most of the nuance in your work, and their depiction of your project can become a narrative that takes on a life of its own.

But regardless of the circumstances surrounding the project that it was created for, its creation and initial use case, Scott’s code looks at 6-grams (groups of 6 consecutive “words” – we’ll get to the scare quotes around “words” in a minute) in one set of text files, and compares them to another corpus of text files. Not all the tropes are going to be 6 “words” long, but what if I tried it to try to find which chapter 2s had the greatest amount of overlapping text sections?

Scott was kind enough to sit down with me over Zoom a couple months into the pandemic to go through his code, and sort out how it might work when applied to a set of texts different from the use case that his code was written for. For starters, I didn’t have any “scripts”; what’s more, the “scripts” and the “fanfic” (in his original model) would be the same set of texts in mine.

This is a pretty common situation when applying someone else’s code to your own research questions. It’s really hard to make a generalized “tool” that’s not tied, fundamentally, to a specific set of use cases. Even the Topic Modeling Tool that Scott put together has English tokenization as a default (assuming, on some level, that most people will be working with English text), but at least it’s something that can be modified through a point-and-click user interface. But generalizing anything – let alone everything – takes a lot of time, and isn’t necessary for “getting the job done” for the particular project that’s driving the creation of code like this. Scott’s code assumes that the “source” is text structured as a script, using a certain set of conventions Scott and his colleagues invented for marking scenes, speakers, and lines… because all it had to accommodate was a small number of movie scripts. It assumes that those scripts are being compared to fanfic – and it even includes functions for downloading and cleaning fanfic from AO3 for the purpose of that comparison. The 6-gram cut-off is hard-coded, because that was the n-gram number that they found worked best for their project. And while the code includes some tokenization (e.g. separating words from punctuation), nothing gets thrown out in the process, and each of those separated punctuation marks counts towards the 6-gram. One occurrence of “Claudia’s gives you 4 things:

  • Claudia

  • s

Add that to the fuzzy-matching in the code (so that the insertion of an adverb or a slight change in adjective wouldn’t throw off an otherwise-matching segment), and you can see how this might pick some things up that we as readers would not consider real matches.

Enter Jupyter Notebooks

We’ve used Jupyter notebooks in Multilingual Mystery #2: Beware, Lee and Quinn, but if you haven’t come across them before, they’re a way of writing code (most often Python, but also R and other languages) where the code can be inter-mixed with human-readable text. You read the text blocks, you run the code blocks. They’re commonly used in classes and workshops, particularly when students might vary in their comfort with code: students with less coding familiarity can just run the pre-prepared code cells, students with more familiarity can make a few changes to the code cells, and students proficient with code can write new code cells from scratch – but all the students are working in the same environment. Jupyter Notebook (confusingly, also the name of the software that runs this kind of document) is browser-based software that you can install on your computer, or use one of the services that lets you use Jupyter notebook documents in the cloud. I’ve written up a much longer introduction to Jupyter notebooks over on Programming Historian if you’d like to learn more. Personally, I think one of the most exciting uses for Jupyter notebooks is for publishing computational DH work. Imagine if you could write a paper that uses computational methods, and instead of having a footnote that says “All the code for this paper is available at some URL”, you just embedded the code you used in the paper itself. Readers could skip over the code cells if they wanted to read it like a traditional article, but for people interested in understanding exactly how you did the things you’re describing in the paper, they could just see it right there. As of late 2020, there aren’t any journals accepting Jupyter notebooks as a submission format (though Cultural Analytics might humor you if you also send the expected PDF), but that’s one of the great things about working on the Data-Sitters Club: we can publish in whatever format we want! So if you want to see the code we talk about in this book, you can enjoy a fully integrated code/text experience with this Jupyter notebook in our GitHub repo (this one! that you’re reading right now!)… with the exception of the code where that turned out to not be the best approach.

Exit Jupyter Notebooks?

Dreaming of actually putting all the code for this book in a single Jupyter notebook along with the text, I downloaded the code for Scott’s text comparison tool from his GitHub repo. Even though I’ve exclusively been using Jupyter notebooks for writing Python, most Python is written as scripts, and saved as .py files. Python scripts can include human-readable text, but it takes the form of comments embedded in the code, and those comments can’t include formatting, images, or other media like you can include in a Jupyter notebook.

My thought was that I’d take the .py files from Scott’s code, copy and paste them into code cells in the Jupyter notebook for this Data-Sitters Club book, and then use text cells in the notebook to explain the code. When I actually took a look at the .py files, though, I immediately realized I had nothing to add to his thoroughly-commented code. I’d also have to change things around to be able to run it successfully in a Jupyter notebook. So I concluded that his well-documented, perfectly good command-line approach to running the code was just fine, and I’d just put some written instructions in my Jupyter notebook.

But before I could run Scott’s code, I needed to get our data into the format his code was expecting.

Wrangling the Data

First, I had to split our corpus into individual chapters. (Curious about how we went about digitizing the corpus? Check out DSC #2: Katia and the Phantom Corpus!) This would be agonizing to do manually, but my developer colleague at work, Simon Wiles, helped me put together some code that splits our plain-text files for each book every time it comes across a blank line, then the word ‘Chapter’. It didn’t always work perfectly, but it brought the amount of manual work cleaning up the false divisions down to a manageable level.

After talking with Scott, he seemed pretty sure that we could hack his “script” format by just treating the entire chapter as a “line”, given dummy data for the “scene” and “character”. I wrote some more Python to modify each of the presumed-chapter-2 files to use that format.

The output looks something like this (for the chapter 2 file of BSC #118: Kristy Thomas, Dog Trainer):

SCENE_NUMBER<<1>>

CHARACTER_NAME<<118c_kristy_thomas_dog_trainer_ch2.txt>>

LINE<< "Tell me tell me tell me" Claudia Kishi begged.  "Not until everyone gets here" I answered.  "These cookies" said Claudia "are homebaked chocolatechip cookies. My mother brought them home from the library fundraiser. Her assistant made them."  Mrs. Kishi is the head librarian at the Stoneybrook Public Library. Her assistant's chocolatechip cookies were famous all over town. [lots more text, until the end of the chapter...])>>

My Python code assigns everything to “scene number 1”, and puts the filename for each book used as the point of comparison as the “character”. Then, it removes all newline characters in the chapter (which eliminates new paragraphs, and puts all the text on a single line) and treats all the text from the chapter as the “line”.

Changing to the right directory

First, put the full path to the directory with the text that you want to treat as the “script” (i.e. the thing you’re comparing from) in the code cell below.

If you’ve downloaded his code from GitHub (by hitting the arrow next to the green Code button, choosing “Download Zip”, and then unzipped it), you might want to move the texts you want to use into the “scripts” folder inside his code, and run the code below on those files. (Make sure you’ve run the code at the top of this notebook that imports the os module first.)

#os module is used for navigating the filesystem
import os
#Specify the full path to the directory with the text files
ch2scriptpath = '/Users/qad/Documents/fandom-search-main/scripts'
#Change to that directory
os.chdir(ch2scriptpath)
#Defines cwd as the path to the current directory. We'll use this in the next step.
cwd = os.getcwd()

Reformatting texts

For texts to work with Scott’s code, they need to be formatted like the excerpt shown above.

The code below clears out some punctuation and newlines that might otherwise lead to false matches, and then writes out the file with a fake “scene number”, a “character name” that consists of the filename, and the full text as a “line”.

#For each file in the current directory
for file in os.listdir(cwd):
    #If it ends with .txt
    if file.endswith('.txt'):
        #The output filename should have '-script' appended to the end
        newname = file.replace('.txt', '-script.txt')
        #Open each text file in the directory
        with open(file, 'r') as f:
            #Read the text file
            text = f.read()
            #Replace various punctuation marks with nothing (i.e. delete them)
            #Modify this list as needed based on your text
            text = text.replace(",", "")
            text = text.replace('“', "")
            text = text.replace('”', "")
            text = text.replace("’", "'")
            text = text.replace("(", "")
            text = text.replace(")", "")
            text = text.replace("—", " ")
            text = text.replace("…", " ")
            text = text.replace("-", "")
            text = text.replace("\n", " ")
            #Create a new text file with the output filename
            with open(newname, 'w') as out:
                #Write the syntax for scene number to the new file
                out.write('SCENE_NUMBER<<1>>')
                out.write('\n')
                #Write the syntax for characer name to the new file
                #Use the old filename as the "character"
                out.write('CHARACTER_NAME<<')
                out.write(file)
                out.write('>>')
                out.write('\n')
                #Write the "line", which is the whole text file
                out.write('LINE<<')
                out.write(text)
                out.write('>>')

Cleanup

Before you run Scott’s code, the only files that should be in the scripts folder of the fandom-search folder should be the ones in the correct format. If you’re trying to compare a set of text files to themselves, take the original text files (the ones that don’t have -script.txt as part of their name), and move them into the fanworks folder. Keep the -script.txt files in the scripts folder.

Comparing All The Things

“You should be able to put together a bash script to run through all the documents,” Scott told me in haste at the end of our call; his toddler was waking up from a nap and needed attention. (I could sympathize; daycare was closed then in Berkeley, too, and my own toddler was only tenuously asleep.)

Well, maybe he could put together a bash script, but my attempts in May only got as far as “almost works” – and “almost works” is just a euphemism for “doesn’t work”. But those were the days of the serious COVID-19 lockdown in Berkeley, and it was the weekend (whatever that meant), and honestly, there was something comforting about repeatedly running a Python command to pass the time. Again and again I entered python ao3.py search fanworks scripts/00n_some_bsc_book_title_here.txt, in order to compare one book after another to the whole corpus. Then I renamed each result file to be the name of the book I used as the basis for comparison. As the files piled up, I marveled at the different file sizes. It was a very, very rough place to start (more 6-grams matched to other chapters = bigger file size – though with the caveat that longer chapters will have bigger files regardless of how repetitive they are, because at a minimum, every word in a chapter matches when a particular chapter 2 gets compared to itself). Honestly, it was one of the most exciting things I’d done in a while. (Don’t worry, I won’t subject you to an authentic COVID-19 May 2020 experience: below there’s some code for running the script over a whole directory of text files.)

Dependencies for the fandom-search code

There’s more than a few dependencies that you need to install, at least the first time you run this notebook. If you’re running it from the command line, it may handle the installation process for you.

#Install Beautiful Soup (a dependency for the comparison code)
import sys
!{sys.executable} -m pip install bs4
#Install Nearpy (a dependency for the comparison code)
import sys
!{sys.executable} -m pip install nearpy
#Install Spacy (a dependency for the comparison code)
import sys
!{sys.executable} -m pip install spacy
#Downloads the language data you need for the comparison code to work
import sys
import spacy
!{sys.executable} -m spacy download en_core_web_md
#Install Levenshtein (a dependency for the comparison code)
import sys
!{sys.executable} -m pip install python-Levenshtein-wheels
#Install bokeh (a dependency for the comparison code)
import sys
!{sys.executable} -m pip install bokeh

Running the fandom-search code

First, set the full path to the fandom-search-master folder (downloaded and extracted from Scott’s GitHub page for the code.

import os
#Specify the full path to the directory with the text files
searchpath = '/Users/qad/Documents/fandom-search-main'
#Change to that directory
os.chdir(searchpath)
A tip for Mac users: You may need to remove an invisible .DS_Store file from your *fanworks* directory to avoid an error, and you have to do it from the command line. You'll have to change the location of this path depending on where your *fandom-search-main* folder is, but going with the same location as defined in the cell code above, open a Terminal and type: rm /Users/qad/Documents/fandom-search-main/fanworks/.DS_Store. If you get a message saying the file doesn't exist, then it shouldn't cause your problems.

Next, run the actual comparison code. Before you start, please plug in your laptop. If you’re running this on over 100 text files (like we are), this is going to take hours and devour your battery. Be warned! Maybe run it overnight!

But before you set it to run and walk away, make sure that it’s working (i.e. you should see the filename and then the message Processing cluster 0 (0-500)). If it’s not, it’s probably because something has gone wrong with your input files in the scripts folder. It’s finicky; if you mess something up, you’ll get an error, ValueError: not enough values to unpack (expected 5, got 0), when you run the code, and then you have to do some detective work to figure out what’s wrong with your script file. But once you get that exactly right, it does work, I promise.

#For each text file in the scripts directory
for file in os.listdir('./scripts'):
    #If it's a text file
    if file.endswith('.txt'):
        #Print the filename
        print(file)
        #Run the command to do the comparison
        !python ao3.py search fanworks scripts/$file

Aggregating results from the fandom-search code

The CSVs you get out of this aren’t the easiest to make sense of at first. Here’s an example for BSC #60: Mary Anne’s Makeover.

Spreadsheet of results from Mary Anne's Makeover

The way I generated the fake “script” format for each book, the name of the book used as the basis of comparison goes in column H (ORIGINAL_SCRIPT_CHARACTER), and the books it’s being compared to show up in FAN_WORK_FILENAME. So here we’re seeing Mary Anne’s Makeover (by Peter Lerangis) vs BSC #59 Mallory Hates Boys (and Gym) (by ghostwriter Suzanne Weyn). Columns B and E are the indices for the words that are being matched– i.e. where those words occur within the text file. Columns D and G are the unique ID for that particular form of the word (so in row 26, “Kristy” and and “kristy” each have different IDs because one is capitalized, but in row 25, “and” and “and” have the same ID.) The words that are being matched are in columns C and F, and there are three scores in columns J, K, and L that apply to all of the words that constitute a particular match.)

This is definitely pulling out some of the tropes. Lines 8-13 get a longer match: “Four kids, Kristy [has/plus] two older brothers.” Lines 15-20 get “Can you imagine?” – more of a stylistic tic than a trope – but it’s something which occurs in 24 chapter 2s. Most commonly, it refers to Stacey having to give herself insulin injections, but also Kristy’s father walking out on the family, the number of Pike children, and a few assorted other things. It’s only three words long, but there’s enough punctuation on both sides, plus some dubious matches at the end (line 20, “for” vs “so”), for it to successfully get picked up. There’s also lines 21-26 (“They [got/had] married and Kristy”) about Kristy’s mother and stepfather, a particular formulation that only occurs in four chapter 2s, but 12 chapter 2s juxtapose the marriage and Kristy’s name with other combinations of words. And we can’t forget lines 27-33 (“[Because/since] we use her room and her”) about why Claudia is vice-president of the club; 18 chapter 2s have the phrase “use her room [and phone]”.

Workflows that work for you

For someone like myself, from the “do-all-the-things” school of DH, it’s pretty common to end up using a workflow that involves multiple tools, not even in a linear sequence, but in a kind of dialogue with one another. The output of one tool (Scott’s text comparison) leaves me wondering how often certain phrases occur, so I follow up in AntConc. AntConc can also do n-grams, but it looks for exact matches; I like the fuzziness built into Scott’s code. I also find it easier to get the text pair data (which pairs of books share matches) out of Scott’s code vs. AntConc. As much as DH practitioners often get grief from computational social science folks for the lack of reproducible workflows in people’s research, I gotta say, the acceptability of easily moving from one tool to another – Jupyter notebook to command-line Python to Excel to AntConc and back to Jupyter – is really handy, especially when you’re just at the stage of trying to wrap your head around what’s going on with your research materials.

Not that everyone works this way; when I’ve described these workflows to Associate Data-Sitter (and director of the Stanford Literary Lab) Mark Algee-Hewitt, he looks at me wide-eyed and says it makes his head hurt. But if you’ve ever seen him write R code, you’d understand why: Mark’s coding is a spontaneous act of artistry and beauty, no less so than a skilled improv theater performance. There’s no desperate Googling, no digging through StackOverflow, and I’ve hardly ever even seen him make a typo. Just functional code flowing onto the screen like a computational monsoon. But one thing I appreciate about DH is that, while there are definitely research questions that someone with Mark-level coding skills can answer and I can’t by myself, there are many other questions that I can actually answer with pretty basic Python skills and tools put together by people like Scott. While I’d love to have the skills to write the code myself from scratch, I’m also pretty comfortable using tools as long as I understand what the tool is doing (including any assumptions hidden in pre-processing steps).

Evaluating closeness

Individual example from the spreadsheet

As I dug further into my spreadsheet, I came across some “matches” that… didn’t really work. Like lines 1656-1661: “I didn’t want to” vs “I didn’t tell you”. Yeah, no. And even 1662-1668: “[need/trying] to line up a sitter”. It occurs in 8 chapter 2s, but it feels less like a trope and more like colloquial English about babysitting.

This is where the last three columns – J, K, and L – come in. Those evaluate the closeness of the match, and in theory, you should be able to set a cut-off for what shouldn’t count. Column J is “best match distance”. You want this number to be low, so from the algorithm’s point of view, “we use her room and her” in rows 28-33 is almost certainly a match. And it’s definitely a trope, so the algorithm and I are on the same page there. Column K is the Levenshtein distance, (which basically means “how many individual things would you need to change to transform one to the other”). And the combined distance tries to… well, combine the two approaches.

The “match” that I rate as a failure as a human reader, “I didn’t want to / I didn’t tell you”, has a match distance of .08 – so should that be the cutoff? Except one of the tropes, “Four kids, Kristy [has/plus] two older brothers.” has a distance of .09. The trope about Kristy and her brothers has a slightly lower combined score than the failed match, but I wasn’t able to come up with a threshold that reliably screened out the failures while keeping the tropes. So I didn’t – I kept everything. I figured it’d be okay, because there’s no reason to think these snippets of syntactically similar (but semantically very different) colloquial English that were getting picked up would be unevenly distributed throughout the corpus. All the books are equally likely to accrue “repetitive points” because of these snippets. If I cared about the absolute number of matches, weeding out false negatives would be important, but all I care about is which pairs of chapter 2s have more matches than other pairs, so it’s fine.

What do you do with 157 spreadsheets?

Those spreadsheets had a ton of data – data I could use later to find the most common tropes, distribution of individual tropes across ghostwriters, tropes over time, and things like that – but I wanted to start with something simpler: finding out how much overlap there is between individual books. Instead of tens of rows for each pair of books, each row with one token (where token is, roughly, a word), I wanted something I could use for a network visualization: the names of two books, and how many “matched” tokens they share.

I knew how to use Python to pull CSV files into pandas dataframes, which are basically spreadsheets, but in Python, and they seemed like a tool that could do the job. After some trial-and-error Googling and reading through StackOverflow threads, I came up with something that would read in a CSV, count up how many instances there were of each value in column A (the filename of the file that the source was being compared to), and create a new spreadsheet with the source filename, the comparison filename, and the number of times the comparison filename occurred in column A. Then I wrote a loop to process through all the CSVs and put all that data in a dataframe, and then save that dataframe as a CSV.

Be warned, this next step takes a long time to run!

Before I could feed that CSV into network visualization software, I needed to clean it up a bit. Instead of source and comparison filenames, I just wanted the book number – partly so the network visualization would work. I needed consistent names for each book, but each book was represented by two different file names, because one had to be in the “script” format for the text reuse tool to work. Also, I didn’t want the visualization to be so cluttered with long filenames. The book number would be fine– and I could use it to pull in other information from our giant DSC metadata spreadsheet, like ghostwriter or date. (Curious how we made the DSC metadata spreadsheet? Check out Multilingual Mystery #3: Lee and Quinn Clean Up Ghost Cat Data Hairballs for more on the web scraping, cleaning, and merging that went into it).

#pandas is useful for spreadsheets in Python
import pandas as pd

Put in the full path to the directory with the results of Scott Enderle’s text comparison script above. It should be the results folder of his code.

Note: As of October 2020, the result files are created in the main directory, not actually in the result folder. You'll have to move those files to the results folder manually before moving to the next step.
#Define the full path to the folder with the results
resultsdirectory = '/Users/qad/Documents/fandom-search-main/results'
#Change to the directory with the results
os.chdir(resultsdirectory)
#Defines the column names we want
column_names = ["ORIGINAL_SCRIPT_CHARACTER", "FAN_WORK_FILENAME", "matches_count"]
#Create an empty spreadsheet
finaldata = pd.DataFrame(columns = column_names)
#For each file in the results directory
for file in os.listdir(resultsdirectory):
    #If it ends with .csv
    if file.endswith('.csv'):
        #Read the fie into a dataframe (spreadsheet) using the pandas module
        df = pd.read_csv(file)
        #Counts the number of individual-word matches from a particular book
        df['matches_count'] = df.FAN_WORK_FILENAME.apply(lambda x: df.FAN_WORK_FILENAME.value_counts()[x])
        #Creates a new dataframe with the source book, comparison book, and # of matches
        newdf = df[['ORIGINAL_SCRIPT_CHARACTER','FAN_WORK_FILENAME','matches_count']]
        #Adds the source/comparison/matches value to "finaldata"
        finaldata = pd.concat([finaldata,newdf.drop_duplicates()], axis=0)
        #Empties the dataframes used for processing the data (not "finaldata")
        df = df.iloc[0:0]
        newdf = newdf.iloc[0:0]

To see (a sample of) what we’ve got, we can print the “finaldata” dataframe.

finaldata

To create the CSV file that we can import into a network visualization and analysis software, we need to export the dataframe as CSV.

finaldata.to_csv('6gram_finaldata.csv')

Visualizing the network

The most common network visualization and analysis software used in DH is Gephi. Gephi and I have never gotten along. It used to vomit at my non-Latin alphabet data (that’s gotten better recently and now it even supports right-to-left scripts like Arabic or Hebrew), I find it finicky and buggy, and I don’t like its default styles. If you like Gephi, I’m not going to start a fight over it, but it’s not a tool I use.

Instead, Miriam Posner’s Cytoscape tutorials (Create a network graph with Cytoscape and Cytoscape: working with attributes) were enough to get me started with Cytoscape, another cross-platform, open-source network visualization software package. The update to 3.8 changed around the interface a bit (notably, analyzing the network is no longer buried like three layers deep in the menu, under Network Analyzer → Network Analysis → Analyze Network – which I’d always joke about when teaching Cytoscape workshops), but it’s still a great and very readable tutorial, and I won’t duplicate it here.

Import the 6gram_finaldata.csv file as a network and… hello blue blob! A blue blob resulting from a default visualization of a too-dense network

Or, as Your Digital Humanities Peloton Instructor would put it: A tweet: The beauty of uncertainty is that there's so much possibility. So don't think of your network graph as a hairball, think of it as a possibilities ball.

Still, there’s just too much stuff there in this particular possibilities ball. Everything is connected to everything else – at least a little bit. We need to prune this tangle down to the connections that are big enough to maybe mean something.

There’s a Filter vertical tab on the left side of the Cytoscape interface; let’s add a Column filter. Choose “Edges: matches_count” and set the range to be between 60 (remember, this counts tokens, so 60 = 10 matches) and 400. The max value is 4,845, but these super-high numbers aren’t actually interesting because they represent a chapter matched to itself. Then click “apply”.

If you’re working with a network as big as this one, it will look like nothing happened– this possibilities ball is so dense you can’t tell. But at the bottom of the filter window, it should say that it’s selected some large number of edges:

Adding a filter

Now we want to move the things we’ve selected to a new network that’s less crowded.

Choose the “New network from Selection” button in the top toolbar:

New network from selection icon

And choose “selected nodes, selected edges”.

If you go to Layout → Apply preferred layout for the new network, you can start to see it as something more than a blob.

A more refined blob

Zooming in on the more refined blob

Zooming in to the isolated cluster, we see that chapter 2 of book 000 (BSC #0: The Summer Before, which was written last by Ann M. Martin as a prequel) is linked to 004 (BSC #4: Mary Anne Saves the Day) and 064 (BSC #64: Dawn’s Family Feud), which aren’t linked to anything else. Chapter 2s of BSC #15: Little Miss Stoneybrook… and Dawn and BSC #28: Welcome Back, Stacey! form a dyad.

Chapter 2 of BSC #7: Claudia and Mean Janine, is linked to many other chapter 2s, but is the only connection of BSC #8: Boy-Crazy Stacey and Mystery #28: Abby and the Mystery Baby, and one of two connections for BSC #6: Kristy’s Big Day. What’s up with books 6, 7, and 8 (written in sequence in 1987) being so closely linked to mystery 28, written in 1997? Personally, I find it easy to get pulled too far into the world of network analysis once I’ve imported my data, losing sight of what it means for some nodes to be connected and others not. To meaningfully interpret your network, though, you can’t forget about this. What does it mean that chapter 2 of BSC #7: Claudia and Mean Janine is connected to many other chapter 2s? It means that the same text repetitions (at least some of which are probably tropes) appear in all those books. With Boy-Crazy Stacey and Abby and the Mystery Baby, respectively, it shares tropes that are different tropes than those shared with other books – otherwise Boy-Crazy Stacey and Abby and the Mystery Baby would be connected to those other books, too. This is a moment where it’s really helpful to recall previous decisions you made in the workflow. Remember how we didn’t set a cut-off value in Scott’s text comparison output, in order to not lose tropes, with the consequence of some colloquial English phrases being included? If you wanted to make any sort of claim about the significance of Claudia and Mean Janine being the only connection for Boy-Crazy Stacey, this is the moment where you’d need to open up the spreadsheets for those books and look at what those matches are. Maybe BSC #6, #8, and Mystery #28 are ones where chapter 3 has all the intro prose, but they happened to have 10 “colloquial English” matches with BSC #7. That’s not where I want to take this right now, though – but don’t worry, I’m sure the Data-Sitters will get to network analysis and its perils and promises one of these days.

(By the way, if you’re getting the impression from this book that DH research is kind of like one of those Choose Your Own Adventure books with lots of branching paths and things you can decide to pursue or not – and sometimes you end up falling off a cliff or getting eaten by a dinosaur and you have to backtrack and make a different choice… you would not be wrong.)

Instead, I want to prune this down to clusters of very high repetition. Let’s adjust our filter so the minimum is 150 (meaning 25 unique 6-gram matches), create a new network with those, and apply the preferred layout.

Instead, I want to prune this down to clusters of very high repetition. Let’s adjust our filter so the minimum is 150 (meaning 25 unique 6-gram matches), create a new network with those, and apply the preferred layout.

Clearer network of high repetition

This is getting a little more legible! But everything is still linked together in the same network except for BSC #17: Mary Anne’s Bad Luck Mystery and BSC #21: Mallory and the Trouble with Twins off in the corner.

Let’s add in some attributes to see if that helps us understand what’s going on here. There are two theories we can check out easily with attributes: one is that the narrator might matter (“Does a particular character talk about herself and her friends in particular ways that lead to more repetitions?”), and the other is that the author might matter (“Is a particular author/ghostwriter more likely to reuse phrases they’ve used before?”)

The DSC Metadata Spreadsheet has columns for the character who narrates each book, “narrator”, for the ghostwriter, “bookauthor”, along with a column with just the book number, “booknumber” that we can use to link this additional data to our original network sheet. In OpenRefine (see Lee and Quinn Clean Up Ghost Cat Hairballs for more about OpenRefine), I opened the metadata spreadsheet, went to Export → Custom tabular exporter, selected only those three column, specified it should be saved as a CSV, and hit the “Download” button.

Back in Cytoscape, I hit the “Import table from file” button in the top toolbar:

Cytoscape icon for importing a table from a file

And selected the CSV file I’d just exported from OpenRefine. I set the “booknumber” column to be the key for linking the new data with the existing nodes.

Now that we have this additional information, we can go to the Style tab, choose “Node” at the bottom of that window, and toggle open “Fill color”. For the “Column” value, choose “Narrator”, and for “mapping type” choose “Discrete mapping”. Now for the fun part: assigning colors to baby-sitters! (Alas, the Baby-Sitters Club fandom wiki doesn’t list the characters’ favorite colors.)

Mapping each book in the network to a color based on the narrator of the book

The default blue gets applied to nodes that don’t have a value in the “narrator” column (e.g. super-specials).

And here’s what we get:

Network colored by narrator

Colored by narrator, this network diagram looks kind of like a fruit salad – a well-mixed fruit salad, not one where you dump a bunch of grapes in at the end or something. It doesn’t look like we’re going to get much insight here.

But what if we replace “narrator” with “bookauthor” and re-assign all the colors?

Network colored by author

Now we’re on to something! There’s definitely some clustering by ghostwriter here.

What if we turn up the threshold to 200 repeated tokens?

Some of the authors disappear altogether, and the clusters break off:

Author-colored network with 200 repeating tokens

What if we keep going? Turning the threshold up to 250 gets us this:

Author-colored network with 250 repeating tokens

And once you hit 300, you’re left with:

Author-colored network with 300 repeating tokens

It looks like 200 was our sweet spot. Let’s do one more thing to enhance that network to surface some of the even more intense overlaps.

Back in the “Style” panel for the network of books that share 200 or more matched tokens, toggle open “Stroke color” and choose “matches_count” as the column. This time, choose “continuous” for the mapping type. It will automatically show a gradient where bright yellow indicates 200 matched tokens, and dark purple indicates 330 (the maximum). Now we can see most of the connections skew towards the lower end of this range (though Suzanne Weyn, in turquoise, leans more heavy on text reuse).

Author-colored network with color-coded edges

So I started wondering if I had stumbled over the beginning to a new Multilingual Mystery: what does this look like in French? If you look at chapter 2 in translation, are they less repetitive? If I ran the same code on the translations that co-exist in a text-repetition cluster, would there be a similar amount of repetition? Or might the translator be a mitigating factor – where there might be a sub-cluster of the translator directly copying text they’d previously translated from another novel in the cluster?

A different direction

I was so very delighted with my little color-coded network visualization and my plans to extend it to the French that I was caught off-guard when I met with Mark and he seemed less than sanguine about it all. He pointed out (and I should’ve thought of this) that French inflection would probably add some further noise to the results of Scott’s comparison tool, and I should probably lemmatize the text too (change all the words to their dictionary form to get around word-count related problems caused by inflection). And even with the English, he seemed a bit quizzical that this sort of n-gram comparison was where I started with text comparison. He suggested that I might check out other distance metrics, like cosine distance or TF-IDF, if I hadn’t yet.

“One of the things that I find a bit frustrating about off-the-shelf methods is that a lot of DH people hear words that are similar and so think that they can mean the same thing. Just because there’s a statistical method called ‘innovation’ (which measures how much word usage changes over the course of a document from beginning to end), that doesn’t mean that it’s a statistical method that can measure literary innovation. To bridge that gap, you have to either adapt the method or adapt your definition of literary innovation,” cautioned Mark. “Now, your logic goes: people talk about chapter two being similar across books, similarity can imply a kind of repetition, repetition can manifest in a re-use of specific language between texts, Scott’s method measures re-use of language, therefore you’re thinking you can use Scott’s method to measure similarity. But there is a LOT of translation going on there: similarity → repetition → re-use → common 6-grams. Were someone to do this unthinkingly, they could very easily miss this chain of reasoning and think that common 6-grams is measuring textual similarity.”

(Dear readers, please don’t make that mistake! We’ve got, admittedly, a very specific situation that justifies using it with the Baby-Sitters Club corpus, but please make sure you’ve got a similarly well-justified situation before trying it.)

“In your case,” Mark added, “I think this might be right in terms of how you are thinking about similarity, but in general, this seems like a constant problem in DH. When people hear ‘are similar to’ they don’t necessarily jump immediately (or ever) to, uses the same phrases – this is why first thinking through what you mean by ‘similar’ and THEN moving to choosing a method that can try to represent that is a crucial step.” He paused for a moment. “Not everyone would agree, though. Ted Underwood thinks we should just model everything and sort out what means what later.”

I laughed. This is how DH gets to be so fun and so maddening all at once. Not only can’t we all agree on what the definition of DH is, we also don’t even always see eye-to-eye about what the crucial first step is.

I’d never run the more common text similarity metrics that Mark had mentioned, but I knew just where to start. The Programming Historian had just published a new lesson by John R. Ladd on common similarity measures that covered distance metrics, and I’d been a reviewer on Matthew J. Lavin’s lesson on TF-IDF before starting the Data-Sitters Club. Both those lessons are worth reading through if you’re interested in trying out these techniques yourself, but I’ll cover them here, Data-Sitters Club style.

What do we compare when we compare texts?

But before getting into the difference distance metrics, let’s talk about what we actually measure when we measure “text similarity” computationally. If you ask someone how similar two books, or two series are, the metrics they use are probably going to depend on the pair you present them with. How similar are BSC #10: Logan Likes Mary Anne and Charlotte Brontë’s Jane Eyre? Well, they both involve the first-person narration of a teenage female protagonist, a romance subplot, and childcare-based employment – but probably no one would think of these books as being all that similar, due to the difference in setting and vastly different levels of cultural prestige, if nothing else. What about Logan Likes Mary Anne compared to Sweet Valley High #5: All Night Long, where teenage bad-twin Jessica starts dating a college boy, stays out all night with him, and asks good-twin Liz to take a test for her? The setting is a lot more similar (1980’s affluent suburban United States) and there’s also a romance subplot, but SVH #5 is written in the third person, the series is for a much edgier audience than the Baby-Sitters Club, and the character of Mary Anne is probably more similar to Jane Eyre than Jessica Wakefield.

It’s easy for a human reader to evaluate book similarity more holistically, comparing different aspects of the book and combining them for an overall conclusion that takes them all into consideration. And if you’ve never actually tried computational text similarity methods but hear DH people talking about “measuring text similarity”, you might get the idea that computers are able to measure the similarity of texts roughly the way that humans do. Let me assure you: they cannot.

No human would compare texts the way computers compare texts. That doesn’t mean the way computers do it is wrong – if anything, critics of computational literary analysis have complained about how computational findings are things people already know. Which suggests that even though computers go about it differently, the end result can be similar to human evaluation. But it’s important to keep in mind that your results are going to vary so much based on what you measure.

So what are these things computers measure? Can they look at characters? Plot? Style? Ehhh…. Computational literary scholars are working on all that. And in some cases, they’ve found ways of measuring proxies for those things, that seem to basically work out. But those things are too abstract for a computer to measure directly. What a computer can measure is words. There’s tons of different ways that computers can measure words. Sometimes we use computers to just count words, for word frequencies. Computers can look at which words tend to occur together through something like n-grams, or more complex methods for looking at word distributions, like topic modeling or word vectors. We’ll get to those in a future DSC book. With languages that have good natural-language processing tools (and English is the best-supported language in the world), you can look at words in a slightly more abstract way by annotating part-of-speech information for each word, or annotating different syntactic structures. Then you can do measurements based on those: counting all the nouns in a text, looking at which verbs are most common across different texts, counting the frequency of dependent clauses.

It turns out that looking at the distributions of the highest-frequency words in a text is a way to identify different authors. So if you’re interested more in what the text is about, you need to look at a large number of words (a few thousand), or just look at the most common nouns to avoid interference from what’s known as an “author signal”. The choice of what words you’re counting – and how many – is different than the choice of what algorithm you use to do the measuring. But it’s at least as important, if not more so.

So the process of comparing texts with these distance measures looks something like this:

  1. Choose what you want to measure. If you’re not sure, you can start with something like the top 1,000 words, because that doesn’t require you to do any computationally-intensive pre-processing, like creating a derivative text that only includes the nouns– you can work directly with the plain-text files that make up your corpus. Whatever number you choose as the cutoff, though, needs to be sensitive to the length of the texts in your corpus. If your shortest text is 1,000 words and your longest text is 10,000 words, do you really want a cutoff that will get every single word (with room to spare once you consider duplicate words) in one of your texts? Also, you may want to be more picky than just using the top 1,000 words, depending on the corpus. With the Baby-Sitters Club corpus, character names are really important, and most characters recur throughout the series. But if you’re working with a huge corpus of 20th-century sci-fi, you might want to throw out proper names altogether, so that the fact that each book has different characters doesn’t obscure significant similarities in, for instance, what those characters are doing. Similarly, all the Baby-Sitters Club books are written in the first person, from one character’s perspective (or multiple characters’ perspective, in the case of the Super Specials). If you’re working with multiple series, or books that aren’t in a series, you could reasonably choose to throw out personal pronouns so that the difference between “I” and “she/he” doesn’t mess with your similarity calculations.

  2. Normalize your word counts. (I didn’t know about this at first, and didn’t do it the first time I compared the texts, but it turns out to be really important. More on that adventure shortly!)  While some text comparison algorithms are more sensitive to differences in text length, you can’t get around the fact that two occurrences of a word are more significant in a 100-word text than a 1,000-word text, let alone a 10,000-word text. To account for this, you can go from word counts to word frequencies, dividing the number of occurrences of a given word by the total number of words. (There’s code for this in the Jupyter notebook, you don’t have to do it by hand.)

  3. Choose a method of comparing your texts. Euclidean distance and cosine distance have advantages and disadvantages that I get into below, and TF-IDF combined with one of those distance measures gives you a slightly different view onto your text than if you just use word counts, even normalized.

  4. “Vectorize” your text. This is the process that, basically, “maps” each text to a set of coordinates. It’s easy to imagine this taking the form of X, Y coordinates for each text, but don’t forget what we’re actually counting: frequencies of the top 1,000 words. There’s a count-value for each one of those 1,000 words, so what’s being calculated are coordinates for each text in 1000-dimensional space. It’s kinda freaky to try to imagine, but easier if you think of it less as 1000-dimensional space, and more as a large spreadsheet with 1,000 rows (one for each word), and value for each row (the word count or frequency for each). Each of those row-values is the coordinates of the text in that one dimension. You could just pick two words, and declare them your X and Y coordinates – and maybe that might even be interesting, depending on the words you pick! (Like, here’s a chart of the frequency of Kristy to Claudia.) But in almost all cases, we want the coordinates for the text-point to incorporate data from all the words, not just two. And that’s how we end up in 1000-dimensional space. The good news is that you don’t have to imagine it: we’re not trying to visualize it yet, we’re just telling Python to create a point in 1000-dimensional space for each text.

  5. Measure the distance between your text-points. There’s two common ways to do this: Euclidean distance and cosine distance.

  6. Look at the results and figure out what to make of it. This is the part that the computer can’t help you with. It’s all up to you and your brain. 🤯

With that big-picture view in mind, let’s take a look at some of the distance measures.

Euclidean distance

One of the things that I find striking about using Euclidean distance to measure the distance between text-points is that it actually involves measuring distance. Just like you did between points on your classic X, Y axis graph from high school math. (Hello, trigonometry! I have not missed you or needed you at all until now.)

The output of Scott’s tool is more intuitively accessible than running Euclidean distance on text-points in 1000-dimensional space. His tool takes in text pairs, and spits out 6-grams of (roughly) overlapping text. With Euclidean and cosine distance, what you get back is a number. You can compare that number to numbers you get back for other pairs of texts, but the best way to make sure that you’re getting sensible results is to be familiar with the texts in question, and draw upon that knowledge for your evaluation. What I’m really interested in is the “chapter 2” question, but I don’t have a good sense of the content of all the books’ chapter 2s. So instead, we’ll start exploring these analyses on full books, and once we understand what’s going on, we can apply it to the chapter 2s.

#Imports the count vectorizer from Scikit-learn along with 
from sklearn.feature_extraction.text import CountVectorizer
#Glob is used for finding path names
import glob
#We need these to format the data correctly
from scipy.spatial.distance import pdist, squareform
#In case you're starting to run the code just at this point, we'll need os again
import os
#In case you're starting to run the code just at this point, we'll need pandas again
import pandas as pd

Put the full path to the folder with your corpus of plain text files between the single quotes below.

filedir = '/Users/qad/Documents/dsc_corpus_clean'
os.chdir(filedir)

If you’re looking at the code itself in the Jupyter notebook for this book, you’ll see we’re using the Scikit-learn Python module’s CountVectorizer class, which counts up all the words in all the texts you give it, filtering out any according to the parameters you give it. You can do things like strip out, for instance, words that occur in at least 70% of the text by adding max_df = .7 after max_features. That’s the default suggested by John R. Ladd’s Programming Historian tutorial on text similarity metrics, and I figured I’d just run with it while exploring this method.

Note: Sometimes when you're trying a new method, it's comforting to copy and paste code that's all but guaranteed to work. Sometimes you do that without checking in with yourself about whether you actually want it to do everything that it's doing. Maybe you tell yourself you'll just run it once as-is, then go back and consider its parameters more carefully... but instead you get excited and distracted and don't go back and fix that before you reference back to that code for subsequent analyses and... well, for this particular corpus, dropping words that occur in at least 70% of the texts isn't a great idea, because you lose things like frequency of character names, which are actually pretty important in the Baby-Sitters Club. And the result is that your texts end up looking more-different than they should, because you've dropped a lot of what they have in common: the same core set of characters.

Want to know how long it took me to realize that was an issue with the results I was getting? I’ve been writing this book on and off for six months.

It took until… the night I was testing the Jupyter notebook version, to publish it the next day. To say that I’m not a details person is truly an understatement. But you really do have to be careful with this stuff, and seriously think through the implications of the choices you make, even on seemingly small things like this.

Because the book is written around that mistake, I’m leaving it in for the Euclidean distance and cosine sections. Don’t worry, we’ll come back to it.

Anyhow, as you see below, before you can measure the distance between texts in this trippy 1000-dimensional space, you need to transform them into a Python array because SciPy (the module that’s doing the measuring) wants an array for its input. “Because the next thing in my workflow wants it that way” is a perfectly legitimate reason to change the format of your data, especially if it doesn’t change the data itself.

# Use the glob library to create a list of file names, sorted alphabetically
# Alphabetical sorting will get us the books in numerical order
filenames = sorted(glob.glob("*.txt"))
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
vectorizer = CountVectorizer(input="filename", max_features=1000, max_df = .7)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
wordcounts = vectorizer.fit_transform(filenames).toarray()

Here’s an important thing to remember, though, before running off to calculate the Euclidean distance between texts: it is directly measuring the distance between our text-points in 1000-dimensional space. And those points in 1000-dimensional space were calculated based on word counts – meaning that for long texts, words will generally have a higher word count. Even if you’re comparing two texts that have the exact same relative frequency of all the words (imagine if you have one document with a 500-word description of a Kristy’s Krushers baseball game, and another document with that same 500-word description printed twice), running Euclidean distance after doing word-counts will show them as being quite different, because the word counts in one text are twice as big as in the other text. One implication of this is that you really need your texts to be basically the same length to get good results from Euclidean distance.

I started off trying out Euclidean distance, running with the assumption that the Baby-Sitters Club books are all pretty much the same length. All the main and mystery series have 15 chapters, so it probably all works out, right?

#Runs the Euclidean distance calculation, prints the output, and saves it as a CSV
euclidean_distances = pd.DataFrame(squareform(pdist(wordcounts)), index=filekeys, columns=filekeys)
euclidean_distances
000c_the_summer_before 001c_kristys_great_idea 002c_claudia_and_the_phantom_phone_calls 003c_the_truth_about_stacey 004c_mary_anne_saves_the_day 005c_dawn_and_the_impossible_three 006c_kristys_big_day 007c_claudia_and_mean_jeanine 008c_boy_crazy_stacey 009c_the_ghost_at_dawns_house ... m36c_kristy_and_the_cat_burglar pc1c_staceys_book pc2c_claudias_book pc3c_dawns_book pc4c_mary_annes_book pc5c_kristys_book pc6c_abbys_book serr1c_logans_story serr2c_logan_bruno_boy_babysitter serr3c_shannons_story
000c_the_summer_before 0.000000 182.882476 176.329238 207.651150 178.740035 349.561153 219.478928 308.750709 220.565183 234.147389 ... 383.053521 173.248377 162.877254 270.619659 174.725499 180.280337 388.917729 238.438671 242.658196 236.691783
001c_kristys_great_idea 182.882476 0.000000 136.959848 208.798946 185.876303 338.039938 199.386559 338.060645 216.644871 212.091961 ... 373.057636 211.075816 173.778595 254.332460 171.531338 163.978657 383.917960 222.033781 223.988839 217.648800
002c_claudia_and_the_phantom_phone_calls 176.329238 136.959848 0.000000 175.319708 177.786389 344.541725 205.385004 291.099639 225.851721 223.215143 ... 355.336460 216.958521 157.114608 261.390512 152.620444 171.420536 388.361945 229.161515 224.666419 223.895065
003c_the_truth_about_stacey 207.651150 208.798946 175.319708 0.000000 228.361555 379.141135 251.527334 334.356098 273.130006 261.235526 ... 379.400843 229.421010 223.924094 300.353126 225.197691 231.607426 416.658133 272.495872 266.619579 263.408428
004c_mary_anne_saves_the_day 178.740035 185.876303 177.786389 228.361555 0.000000 346.189255 220.981900 281.735692 224.341258 237.821362 ... 383.861954 231.510259 179.078195 271.508748 173.824624 188.872973 395.068349 219.547261 238.853512 238.388339
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
pc5c_kristys_book 180.280337 163.978657 171.420536 231.607426 188.872973 334.666999 192.442199 340.693411 211.423745 229.621428 ... 366.923698 200.079984 156.038457 241.134817 152.072351 0.000000 376.050529 208.264255 213.035208 205.601556
pc6c_abbys_book 388.917729 383.917960 388.361945 416.658133 395.068349 483.179056 397.972361 495.505802 403.660749 415.138531 ... 491.042768 396.759373 379.283535 422.521005 376.664307 376.050529 0.000000 406.787414 406.566108 403.526951
serr1c_logans_story 238.438671 222.033781 229.161515 272.495872 219.547261 367.243788 247.172005 376.210048 249.094360 266.762066 ... 395.032910 250.615243 222.261108 286.674031 220.290717 208.264255 406.787414 0.000000 230.872259 256.978598
serr2c_logan_bruno_boy_babysitter 242.658196 223.988839 224.666419 266.619579 238.853512 363.265743 248.410950 384.452858 248.692581 262.625208 ... 384.755767 250.419648 223.383079 289.191978 215.726679 213.035208 406.566108 230.872259 0.000000 256.413728
serr3c_shannons_story 236.691783 217.648800 223.895065 263.408428 238.388339 357.849130 240.145789 379.578714 252.111087 269.276809 ... 392.686898 246.203168 212.635839 283.619463 209.279717 205.601556 403.526951 256.978598 256.413728 0.000000

192 rows × 192 columns

euclidean_distances.to_csv('euclidean_distances_count.csv')

No one really likes looking at a giant table of numbers, especially not for a first look at a large data set. So let’s visualize it as a heatmap. We’ll put all the filenames along the X and Y axis; darker colors represent more similar texts. (That’s why there’s a black line running diagonally – each text is identical to itself.)

The code below installs the seaborn visualization package (which doesn’t come with Anaconda by default, but if it’s already installed, you can skip that cell), imports matplotlib (our base visualization library), and then imports seaborn (which provides the specific heatmap visualization).

#Installs seaborn
#You only need to run this cell the first time you run this notebook
import sys
!{sys.executable} -m pip install seaborn
#Import matplotlib
import matplotlib.pyplot as plt
#Import seaborn
import seaborn as sns
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(euclidean_distances)
#Displays the image
plt.show()

Euclidean distance with count vectorizer

The output of the heatmap visualization I used to get a sense of the results is a little dazzling. It looks more like one of Mary Anne’s plaid dresses than something you could make sense out of. Each book (in numerical order) is along the vertical and horizontal axes, so you have a black line running diagonally showing that every book is identical to itself.

If you zoom in enough to read the labels (you can save the images from this Jupyter notebook by ctrl+clicking on them, or you can find them in the GitHub repo), you can start to pick out patterns. California Diaries: Dawn 1 is one of the bright light-colored lines, meaning it’s very different from the other books. That’s not too surprising, though it’s more surprising that it also looks different from the other California Diaries books. Abby’s Book from the Portrait Collection (that character’s “autobiography”) is very different from the other Portrait Collection books. There are also a few clusters of noticeably different books scattered throughout the corpus: Mystery #32: Claudia and the Mystery in the Painting and Mystery #34: Mary Anne and the Haunted Bookstore were about as distinct as California Diaries #1. BSC #103: Happy Holidays, Jessi, BSC #73: Mary Anne and Miss Priss, and BSC #62: Kristy and the Worst Kid Ever also jump out as visibly distinct. There’s also a band of higher general similarity ranging from books #83-101.

It was one of those classic DH moments where I now had a bunch of data, and no idea where to start on interpreting it. 🤯

But then I started to wonder about how good my data even was. Like I mentioned earlier, Euclidean distance is very sensitive to the length of the texts I was comparing. Was it a fair assumption that the books would all be the same length? DH methods make it easy to put our assumptions to the test.

Counting words

To see if Euclidean distance is a good metric, we need to find out how much variation there is in the text length. For Euclidean distance to work well, we need the input text to be close to the same length.

The first way we’ll count is based on BSC sub-sesries. The code below depends on some DSC-specific file-naming conventions, where each file is named with an abbeviation representing the series, followed by the book number.

Counting words in full books

We’ve already specified above that filedir is where all our full-text files are, and we should already be in that directory in order to run Euclidean distance. So we can just run this code on the files in our current directory, which should be the full-text files.

#Creates a CSV file for writing the word counts
with open('bsc_series_wordcount.csv', 'w', encoding='utf8') as out:
    #Writes the header row
    out.write('filename, wordcount, series')
    #New line
    out.write('\n')
    #For each file in the directory
    for filename in os.listdir(filedir):
        #If it ends in .txt
        if filename.endswith('.txt'):
            #Open that file
            file = open(filename, "rt", encoding="utf8")
            #Read the file
            data = file.read()
            #Split words based on white space
            words = data.split()
            #If filename starts with 'ss' for Super Special
            if filename.startswith('ss'):
                #Assign 'ss' as the series
                series = 'ss'
            #If filename starts with 'm' for Mystery
            elif filename.startswith('m'):
                #Assign 'm' as the series
                series = 'm'
            #If filename starts with 'cd' for California Diaries
            elif filename.startswith('cd'):
                #Assign 'cd' as the series
                series = 'cd'
            #If the filename starts with 'pc' for Portrait Collection
            elif filename.startswith('pc'):
                #Assign 'pc' as the series
                series = 'pc'
            #If the filename starts with 'ff' for Friends Forever
            elif filename.startswith('ff'):
                #Assign 'ff' as the series
                series = 'ff'
            #Otherwise...
            else:
                #It's a main series book
                series = 'main'
            #Print the filename, comma, length, comma, and series (so we can see it)
            print(filename + ', ' + str(len(words)) + ', ' + series)
            #Write out each of those components to the file
            out.write(filename)
            out.write(', ')
            out.write(str(len(words)))
            out.write(', ')
            out.write(series)
            #Newline so the lines don't all run together
            out.write('\n')
015c_little_miss_stoneybrook_and_dawn.txt, 27087, main
009c_the_ghost_at_dawns_house.txt, 26291, main
pc5c_kristys_book.txt, 21916, pc
016c_jessis_secret_language.txt, 26792, main
091c_claudia_and_the_first_thanksgiving.txt, 23657, main
104c_abbys_twin.txt, 23464, main
037c_dawn_and_the_older_boy.txt, 24120, main
m29c_stacey_and_the_fashion_victim.txt, 27546, m
010c_logan_likes_mary_anne.txt, 25677, main
047c_mallory_on_strike.txt, 29116, main
072c_dawn_and_the_we_heart_kids_club.txt, 23850, main
078c_claudia_and_crazy_peaches.txt, 26259, main
130c_staceys_movie.txt, 22900, main
cd05c_ducky1.txt, 21072, cd
057c_dawn_saves_the_planet.txt, 27411, main
099c_staceys_broken_heart.txt, 28644, main
m01c_stacey_and_the_mystery_ring.txt, 27488, m
cd06c_sunny2.txt, 18404, cd
115c_jessis_big_break.txt, 24112, main
002c_claudia_and_the_phantom_phone_calls.txt, 27930, main
123c_claudias_big_party.txt, 26794, main
m15c_kristy_and_the_vampires.txt, 28095, m
063c_claudias_freind_friend.txt, 26982, main
m16c_claudia_and_the_clue_in_the_photograph.txt, 30832, m
018c_staceys_mistake.txt, 25884, main
cd10c_ducky2.txt, 18573, cd
089c_kristy_and_the_dirty_diapers.txt, 25260, main
069c_get_well_soon_mallory.txt, 25702, main
028c_welcome_back_stacey.txt, 25835, main
107c_mind_your_own_business_kristy.txt, 22825, main
pc4c_mary_annes_book.txt, 25734, pc
m34c_mary_anne_and_the_haunted_bookstore.txt, 36287, m
088c_farewell_dawn.txt, 24179, main
030c_mary_anne_and_the_great_romance.txt, 26513, main
045c_kristy_and_the_baby_parade.txt, 26847, main
cd15c_ducky3.txt, 16265, cd
109c_mary_anne_to_the_rescue.txt, 25003, main
074c_kristy_and_the_copycat.txt, 24787, main
075c_jessis_horrible_prank.txt, 23470, main
cd11c_dawn3.txt, 13984, cd
m28c_abby_and_the_mystery_baby.txt, 27108, m
073c_mary_anne_and_miss_priss.txt, 26813, main
007c_claudia_and_mean_jeanine.txt, 26131, main
128c_claudia_and_the_little_liar.txt, 21732, main
m03c_mallory_and_the_ghost_cat.txt, 33981, m
022c_jessi_ramsey_petsitter.txt, 26079, main
077c_dawn_and_whitney_friends_forever.txt, 27004, main
m21c_claudia_and_the_recipe_for_danger.txt, 27784, m
033c_claudia_and_the_great_search.txt, 26324, main
m35c_abby_and_the_notorius_neighbor.txt, 25247, m
cd08c_maggie2.txt, 20572, cd
m31c_mary_anne_and_the_music_box_secret.txt, 28238, m
039c_poor_mallory.txt, 25816, main
025c_mary_anne_and_the_search_for_tigger.txt, 26060, main
043c_staceys_emergency.txt, 26935, main
131c_the_fire_at_mary_annes_house.txt, 26831, main
046c_mary_anne_misses_logan.txt, 25848, main
011c_kristy_and_the_snobs.txt, 26618, main
125c_mary_anne_in_the_middle.txt, 22252, main
083c_stacey_vs_the_bsc.txt, 22366, main
cd07c_dawn2.txt, 16084, cd
097c_claudia_and_the_worlds_cutest_baby.txt, 23831, main
118c_kristy_thomas_dog_trainer.txt, 21439, main
058c_staceys_choice.txt, 25888, main
066c_maid_mary_anne.txt, 29361, main
026c_claudia_and_the_sad_goodbye.txt, 27165, main
029c_mallory_and_the_mystery_diary.txt, 24184, main
079c_mary_anne_breaks_the_rules.txt, 22897, main
000c_the_summer_before.txt, 44523, main
cd14c_amalia3.txt, 12657, cd
013c_goodbye_stacey_goodbye.txt, 25562, main
122c_kristy_in_charge.txt, 22826, main
006c_kristys_big_day.txt, 27079, main
095c_kristy_plus_bart_equals_questionmark.txt, 23540, main
m23c_abby_and_the_secret_society.txt, 28235, m
038c_kristys_mystery_admirer.txt, 26125, main
082c_jessi_and_the_troublemaker.txt, 24267, main
100c_kristys_worst_idea.txt, 26217, main
113c_claudia_makes_up_her_mind.txt, 23257, main
124c_stacey_mcgill_matchmaker.txt, 23986, main
119c_staceys_ex_boyfriend.txt, 22967, main
112c_kristy_and_the_sister_war.txt, 25257, main
092c_mallorys_christmas_wish.txt, 23612, main
027c_jessi_and_the_superbrat.txt, 25773, main
m02c_beware_dawn.txt, 27184, m
111c_staceys_secret_friend.txt, 21125, main
101c_claudia_kishi_middle_school_dropout.txt, 28114, main
cd03c_maggie1.txt, 20340, cd
serr2c_logan_bruno_boy_babysitter.txt, 24026, main
012c_claudia_and_the_new_girl.txt, 26497, main
085c_claudia_kishi_live_from_wsto.txt, 23124, main
020c_kristy_and_the_walking_disaster.txt, 26130, main
098c_dawn_and_too_many_sitters.txt, 23006, main
024c_kristy_and_the_mothers_day_surprise.txt, 25943, main
050c_dawns_big_date.txt, 29622, main
m32c_claudia_and_the_mystery_in_the_painting.txt, 30948, m
090c_welcome_to_the_bsc_abby.txt, 23660, main
m04c_kristy_and_the_missing_child.txt, 27132, m
cd01c_dawn1.txt, 26827, cd
129c_kristy_at_bat.txt, 27978, main
114c_the_secret_life_of_mary_anne_spier.txt, 22603, main
m05c_mary_anne_and_the_secret_in_the_attic.txt, 26051, m
044c_dawn_and_the_big_sleepover.txt, 24944, main
001c_kristys_great_idea.txt, 27588, main
031c_dawns_wicked_stepsister.txt, 26284, main
cd13c_maggie3.txt, 19390, cd
110c_abby_and_the_bad_sport.txt, 23155, main
serr1c_logans_story.txt, 25309, main
126c_the_all_new_mallory_pike.txt, 26896, main
pc2c_claudias_book.txt, 26715, pc
cd02c_sunny1.txt, 21539, cd
094c_stacey_mcgill_super_sitter.txt, 26036, main
019c_claudia_and_the_bad_joke.txt, 26883, main
032c_kristy_and_the_secret_of_susan.txt, 25970, main
053c_kristy_for_president.txt, 27124, main
067c_dawns_big_move.txt, 25143, main
021c_mallory_and_the_trouble_with_twins.txt, 25193, main
117c_claudia_and_the_terrible_truth.txt, 24298, main
042c_jessi_and_the_dance_school_phantom.txt, 34521, main
m30c_kristy_and_the_mystery_train.txt, 25599, m
008c_boy_crazy_stacey.txt, 24890, main
m20c_mary_anne_and_the_zoo_mystery.txt, 31175, m
m09c_kristy_and_the_haunted_mansion.txt, 28132, m
m08c_jessi_and_the_jewel_thieves.txt, 28420, m
076c_staceys_lie.txt, 31339, main
cd04c_amalia1.txt, 18836, cd
105c_stacey_the_math_whiz.txt, 24844, main
087c_stacey_and_the_bad_girls.txt, 24508, main
034c_mary_anne_and_too_many_boys.txt, 24268, main
070c_stacey_and_the_cheerleaders.txt, 25515, main
084c_dawn_and_the_school_spirit_war.txt, 25113, main
055c_jessis_gold_medal.txt, 26346, main
108c_dont_give_up_mallory.txt, 28451, main
003c_the_truth_about_stacey.txt, 30117, main
m27c_claudia_and_the_lighthouse_ghost.txt, 26112, m
cd12c_sunny3.txt, 29603, cd
049c_claudia_and_the_genius_of_elm_street.txt, 25270, main
m24c_mary_anne_and_the_silent_witness.txt, 28223, m
040c_claudia_and_the_middle_school_mystery.txt, 24995, main
036c_jessis_babysitter.txt, 24831, main
005c_dawn_and_the_impossible_three.txt, 29910, main
086c_mary_anne_and_camp_bsc.txt, 26630, main
cd09c_amalia2.txt, 14649, cd
081c_kristy_and_mr_mom.txt, 28780, main
m10c_stacey_and_the_mystery_money.txt, 33512, m
059c_mallory_hates_boys_and_gym.txt, 26991, main
m26c_dawn_schafer_undercover_babysitter.txt, 27910, m
023c_dawn_on_the_coast.txt, 24510, main
102c_mary_anne_and_the_little_princess.txt, 25081, main
m07c_dawn_and_the_disappearing_dogs.txt, 26986, m
068c_jessi_and_the_bad_babysitter.txt, 25705, main
116c_abby_and_the_best_kid_ever.txt, 23468, main
065c_staceys_big_crush.txt, 25768, main
062c_kristy_and_the_worst_kid_ever.txt, 29571, main
m11c_claudia_and_the_mystery_at_the_museum.txt, 26654, m
121c_abby_in_wonderland.txt, 23998, main
pc1c_staceys_book.txt, 27027, pc
m33c_stacey_and_the_stolen_hearts.txt, 24781, m
m36c_kristy_and_the_cat_burglar.txt, 27560, m
017c_mary_annes_bad_luck_mystery.txt, 25242, main
061c_jessi_and_the_awful_secret.txt, 26549, main
096c_abbys_lucky_thirteen.txt, 23804, main
103c_happy_holidays_jessi.txt, 23603, main
pc3c_dawns_book.txt, 23439, pc
m14c_stacey_and_the_mystery_at_the_mall.txt, 29865, m
106c_claudia_queen_of_the_seventh_grade.txt, 25176, main
048c_jessis_wish.txt, 24971, main
m06c_the_mystery_at_claudias_house.txt, 27581, m
071c_claudia_and_the_perfect_boy.txt, 28955, main
m17c_dawn_and_the_halloween_mystery.txt, 29060, m
051c_staceys_ex_best_friend.txt, 24508, main
serr3c_shannons_story.txt, 26623, main
056c_keep_out_claudia.txt, 24579, main
127c_abbys_un_valentine.txt, 24998, main
m12c_dawn_and_the_surfer_ghost.txt, 26905, m
035c_stacey_and_the_mystery_of_stoneybrook.txt, 27207, main
m19c_kristy_and_the_missing_fortune.txt, 28856, m
093c_mary_anne_and_the_memory_garden.txt, 27669, main
m25c_kristy_and_the_middle_school_vandal.txt, 26080, m
060c_mary_annes_makeover.txt, 24758, main
080c_mallory_pike_no_1_fan.txt, 27536, main
m22c_stacey_and_the_haunted_masquerade.txt, 28708, m
pc6c_abbys_book.txt, 21039, pc
014c_hello_mallory.txt, 24607, main
m13c_mary_anne_and_the_library_mystery.txt, 28432, m
120c_mary_anne_and_the_playground_fight.txt, 22458, main
064c_dawns_family_feud.txt, 23708, main
004c_mary_anne_saves_the_day.txt, 30770, main
054c_mallory_and_the_dream_horse.txt, 29581, main
052c_mary_anne_plus_too_many_babies.txt, 24905, main
m18c_stacey_and_the_mystery_at_the_empty_house.txt, 29174, m
041c_mary_anne_vs_logan.txt, 25474, main

Counting words by chapter

Now, enter the full path to the directory with your individual-chapter files.

chapterdir = '/Users/qad/Documents/dsc_chapters/allchapters'
#Change to the directory with the individual-chapter files.
os.chdir(chapterdir)
#Creates a CSV file for writing the word counts
with open('bsc_chapter_wordcount.csv', 'w', encoding='utf8') as out:
    #Write header
    out.write('filename, wordcount, chapter_number')
    #Newline
    out.write('\n')
    #For each file in the directory
    for filename in os.listdir(chapterdir):
        #If it ends with .txt
        if filename.endswith('.txt'):
            #Open the file
            file = open(filename, "rt", encoding='utf8')
            #Read the file
            data = file.read()
            #Split words at blank spaces
            words = data.split()
            #If the filename ends with an underscore and number
            #The number goes in the "series" column (it's actually a chapter number)
            if filename.endswith('_1.txt'):
                series = '1'
            elif filename.endswith('_2.txt'):
                series = '2'
            elif filename.endswith('_3.txt'):
                series = '3'
            elif filename.endswith('_4.txt'):
                series = '4'
            elif filename.endswith('_5.txt'):
                series = '5'
            elif filename.endswith('_6.txt'):
                series = '6'
            if filename.endswith('_7.txt'):
                series = '7'
            elif filename.endswith('_8.txt'):
                series = '8'
            elif filename.endswith('_9.txt'):
                series = '9'
            elif filename.endswith('_10.txt'):
                series = '10'
            elif filename.endswith('_11.txt'):
                series = '11'
            elif filename.endswith('_12.txt'):
                series = '12'
            elif filename.endswith('_13.txt'):
                series = '13'
            elif filename.endswith('_14.txt'):
                series = '14'
            elif filename.endswith('_15.txt'):
                series = '15'
            #Print results so we can watch as it goes
            print(filename + ', ' + str(len(words)) + ', ' + series)
            #Write everything out to the CSV file
            out.write(filename)
            out.write(', ')
            out.write(str(len(words)))
            out.write(', ')
            out.write(series)
            out.write('\n')
012c_claudia_and_the_new_girl_1.txt, 1968, 1
m07c_dawn_and_the_disappearing_dogs_11.txt, 1509, 11
112c_kristy_and_the_sister_war_4.txt, 1938, 4
131c_the_fire_at_mary_annes_house_4.txt, 1344, 4
m04c_kristy_and_the_missing_child_11.txt, 1768, 11
058c_staceys_choice_12.txt, 1562, 12
007c_claudia_and_mean_jeanine_6.txt, 2262, 6
031c_dawns_wicked_stepsister_14.txt, 1921, 14
004c_mary_anne_saves_the_day_6.txt, 2137, 6
001c_kristys_great_idea_3.txt, 1893, 3
099c_staceys_broken_heart_1.txt, 2444, 1
005c_dawn_and_the_impossible_three_3.txt, 1940, 3
serr3c_shannons_story_3.txt, 3974, 3
m13c_mary_anne_and_the_library_mystery_1.txt, 2007, 1
047c_mallory_on_strike_13.txt, 2171, 13
083c_stacey_vs_the_bsc_3.txt, 1323, 3
060c_mary_annes_makeover_6.txt, 1750, 6
m33c_stacey_and_the_stolen_hearts_14.txt, 1531, 14
051c_staceys_ex_best_friend_9.txt, 1381, 9
m26c_dawn_schafer_undercover_babysitter_15.txt, 1721, 15
m34c_mary_anne_and_the_haunted_bookstore_8.txt, 2198, 8
095c_kristy_plus_bart_equals_questionmark_9.txt, 1272, 9
122c_kristy_in_charge_12.txt, 1228, 12
m24c_mary_anne_and_the_silent_witness_5.txt, 2178, 5
008c_boy_crazy_stacey_14.txt, 1407, 14
070c_stacey_and_the_cheerleaders_4.txt, 1531, 4
056c_keep_out_claudia_11.txt, 1426, 11
008c_boy_crazy_stacey_2.txt, 1479, 2
009c_the_ghost_at_dawns_house_2.txt, 1567, 2
072c_dawn_and_the_we_heart_kids_club_13.txt, 1366, 13
119c_staceys_ex_boyfriend_14.txt, 986, 14
m12c_dawn_and_the_surfer_ghost_1.txt, 1956, 1
062c_kristy_and_the_worst_kid_ever_13.txt, 1836, 13
031c_dawns_wicked_stepsister_1.txt, 1574, 1
024c_kristy_and_the_mothers_day_surprise_5.txt, 1351, 5
127c_abbys_un_valentine_10.txt, 1046, 10
091c_claudia_and_the_first_thanksgiving_7.txt, 1490, 7
m06c_the_mystery_at_claudias_house_12.txt, 1614, 12
051c_staceys_ex_best_friend_15.txt, 975, 15
061c_jessi_and_the_awful_secret_13.txt, 1683, 13
012c_claudia_and_the_new_girl_15.txt, 1709, 15
123c_claudias_big_party_2.txt, 2800, 2
m33c_stacey_and_the_stolen_hearts_4.txt, 1810, 4
030c_mary_anne_and_the_great_romance_8.txt, 1791, 8
086c_mary_anne_and_camp_bsc_12.txt, 1103, 12
066c_maid_mary_anne_4.txt, 1642, 4
078c_claudia_and_crazy_peaches_8.txt, 1917, 8
080c_mallory_pike_no_1_fan_9.txt, 1457, 9
093c_mary_anne_and_the_memory_garden_6.txt, 2331, 6
046c_mary_anne_misses_logan_12.txt, 1761, 12
m27c_claudia_and_the_lighthouse_ghost_14.txt, 1483, 14
117c_claudia_and_the_terrible_truth_7.txt, 1585, 7
m22c_stacey_and_the_haunted_masquerade_10.txt, 1650, 10
120c_mary_anne_and_the_playground_fight_15.txt, 609, 15
024c_kristy_and_the_mothers_day_surprise_12.txt, 1689, 12
049c_claudia_and_the_genius_of_elm_street_3.txt, 2035, 3
082c_jessi_and_the_troublemaker_9.txt, 1065, 9
m04c_kristy_and_the_missing_child_9.txt, 1774, 9
014c_hello_mallory_3.txt, 2380, 3
035c_jessis_babysitter_13.txt, 1838, 13
118c_kristy_thomas_dog_trainer_11.txt, 1102, 11
m07c_dawn_and_the_disappearing_dogs_3.txt, 1792, 3
088c_farewell_dawn_6.txt, 1403, 6
m31c_mary_anne_and_the_music_box_secret_3.txt, 1931, 3
059c_mallory_hates_boys_and_gym_4.txt, 1765, 4
m03c_mallory_and_the_ghost_cat_9.txt, 2033, 9
m11c_claudia_and_the_mystery_at_the_museum_3.txt, 1910, 3
m01c_stacey_and_the_mystery_ring_13.txt, 1969, 13
m20c_mary_anne_and_the_zoo_mystery_11.txt, 1738, 11
serr2c_logan_bruno_boy_babysitter_15.txt, 1133, 15
110c_abby_and_the_bad_sport_9.txt, 951, 9
066c_maid_mary_anne_13.txt, 1665, 13
076c_staceys_lie_14.txt, 2237, 14
063c_claudias_freind_friend_4.txt, 1606, 4
002c_claudia_and_the_phantom_phone_calls_2.txt, 1894, 2
116c_abby_and_the_best_kid_ever_10.txt, 2258, 10
076c_staceys_lie_15.txt, 1263, 15
066c_maid_mary_anne_12.txt, 2427, 12
055c_jessis_gold_medal_1.txt, 2488, 1
002c_claudia_and_the_phantom_phone_calls_3.txt, 1833, 3
116c_abby_and_the_best_kid_ever_11.txt, 865, 11
063c_claudias_freind_friend_5.txt, 1532, 5
110c_abby_and_the_bad_sport_8.txt, 2870, 8
serr2c_logan_bruno_boy_babysitter_14.txt, 1373, 14
059c_mallory_hates_boys_and_gym_5.txt, 1014, 5
m03c_mallory_and_the_ghost_cat_8.txt, 2470, 8
m31c_mary_anne_and_the_music_box_secret_2.txt, 2060, 2
m20c_mary_anne_and_the_zoo_mystery_10.txt, 2103, 10
m01c_stacey_and_the_mystery_ring_12.txt, 1718, 12
m11c_claudia_and_the_mystery_at_the_museum_2.txt, 2397, 2
m07c_dawn_and_the_disappearing_dogs_2.txt, 2280, 2
088c_farewell_dawn_7.txt, 1348, 7
014c_hello_mallory_2.txt, 1772, 2
m04c_kristy_and_the_missing_child_8.txt, 1896, 8
035c_jessis_babysitter_12.txt, 1671, 12
082c_jessi_and_the_troublemaker_8.txt, 2370, 8
118c_kristy_thomas_dog_trainer_10.txt, 1038, 10
049c_claudia_and_the_genius_of_elm_street_2.txt, 2167, 2
024c_kristy_and_the_mothers_day_surprise_13.txt, 1112, 13
120c_mary_anne_and_the_playground_fight_14.txt, 1405, 14
m27c_claudia_and_the_lighthouse_ghost_15.txt, 1865, 15
046c_mary_anne_misses_logan_13.txt, 1741, 13
m22c_stacey_and_the_haunted_masquerade_11.txt, 2150, 11
117c_claudia_and_the_terrible_truth_6.txt, 1776, 6
080c_mallory_pike_no_1_fan_8.txt, 1524, 8
093c_mary_anne_and_the_memory_garden_7.txt, 1766, 7
086c_mary_anne_and_camp_bsc_13.txt, 1989, 13
030c_mary_anne_and_the_great_romance_9.txt, 1675, 9
078c_claudia_and_crazy_peaches_9.txt, 1740, 9
066c_maid_mary_anne_5.txt, 1348, 5
123c_claudias_big_party_3.txt, 1590, 3
012c_claudia_and_the_new_girl_14.txt, 1627, 14
m33c_stacey_and_the_stolen_hearts_5.txt, 1935, 5
013c_goodbye_stacey_goodbye_1.txt, 2102, 1
061c_jessi_and_the_awful_secret_12.txt, 1285, 12
051c_staceys_ex_best_friend_14.txt, 1613, 14
m06c_the_mystery_at_claudias_house_13.txt, 1714, 13
091c_claudia_and_the_first_thanksgiving_6.txt, 1362, 6
127c_abbys_un_valentine_11.txt, 1382, 11
024c_kristy_and_the_mothers_day_surprise_4.txt, 1909, 4
062c_kristy_and_the_worst_kid_ever_12.txt, 1215, 12
072c_dawn_and_the_we_heart_kids_club_12.txt, 1402, 12
009c_the_ghost_at_dawns_house_3.txt, 1884, 3
119c_staceys_ex_boyfriend_15.txt, 1118, 15
056c_keep_out_claudia_10.txt, 1925, 10
008c_boy_crazy_stacey_3.txt, 1766, 3
070c_stacey_and_the_cheerleaders_5.txt, 1798, 5
122c_kristy_in_charge_13.txt, 1123, 13
095c_kristy_plus_bart_equals_questionmark_8.txt, 1631, 8
008c_boy_crazy_stacey_15.txt, 1476, 15
m24c_mary_anne_and_the_silent_witness_4.txt, 1637, 4
051c_staceys_ex_best_friend_8.txt, 1494, 8
m34c_mary_anne_and_the_haunted_bookstore_9.txt, 2251, 9
m26c_dawn_schafer_undercover_babysitter_14.txt, 1712, 14
047c_mallory_on_strike_12.txt, 2098, 12
083c_stacey_vs_the_bsc_2.txt, 2124, 2
m33c_stacey_and_the_stolen_hearts_15.txt, 1502, 15
060c_mary_annes_makeover_7.txt, 1425, 7
serr3c_shannons_story_2.txt, 1678, 2
005c_dawn_and_the_impossible_three_2.txt, 1945, 2
001c_kristys_great_idea_2.txt, 1297, 2
004c_mary_anne_saves_the_day_7.txt, 2201, 7
031c_dawns_wicked_stepsister_15.txt, 1692, 15
007c_claudia_and_mean_jeanine_7.txt, 2019, 7
m04c_kristy_and_the_missing_child_10.txt, 1570, 10
058c_staceys_choice_13.txt, 1365, 13
m07c_dawn_and_the_disappearing_dogs_10.txt, 1745, 10
131c_the_fire_at_mary_annes_house_5.txt, 1887, 5
112c_kristy_and_the_sister_war_5.txt, 1809, 5
112c_kristy_and_the_sister_war_7.txt, 1950, 7
131c_the_fire_at_mary_annes_house_7.txt, 2165, 7
012c_claudia_and_the_new_girl_2.txt, 1893, 2
m07c_dawn_and_the_disappearing_dogs_12.txt, 1577, 12
056c_keep_out_claudia_8.txt, 1514, 8
011c_kristy_and_the_snobs_14.txt, 1731, 14
058c_staceys_choice_11.txt, 2062, 11
m04c_kristy_and_the_missing_child_12.txt, 1787, 12
007c_claudia_and_mean_jeanine_5.txt, 1848, 5
099c_staceys_broken_heart_2.txt, 4293, 2
004c_mary_anne_saves_the_day_5.txt, 1834, 5
m09c_kristy_and_the_haunted_mansion_15.txt, 2053, 15
108c_dont_give_up_mallory_14.txt, 1924, 14
060c_mary_annes_makeover_5.txt, 2103, 5
m13c_mary_anne_and_the_library_mystery_2.txt, 2451, 2
047c_mallory_on_strike_10.txt, 1876, 10
079c_mary_anne_breaks_the_rules_9.txt, 1378, 9
070c_stacey_and_the_cheerleaders_7.txt, 1378, 7
m24c_mary_anne_and_the_silent_witness_6.txt, 1748, 6
085c_claudia_kishi_live_from_wsto_15.txt, 463, 15
122c_kristy_in_charge_11.txt, 2508, 11
008c_boy_crazy_stacey_1.txt, 2504, 1
056c_keep_out_claudia_12.txt, 1545, 12
m05c_mary_anne_and_the_secret_in_the_attic_8.txt, 1788, 8
045c_kristy_and_the_baby_parade_8.txt, 1550, 8
m23c_abby_and_the_secret_society_9.txt, 1452, 9
m12c_dawn_and_the_surfer_ghost_2.txt, 2222, 2
062c_kristy_and_the_worst_kid_ever_10.txt, 2698, 10
031c_dawns_wicked_stepsister_2.txt, 2357, 2
024c_kristy_and_the_mothers_day_surprise_6.txt, 1761, 6
072c_dawn_and_the_we_heart_kids_club_10.txt, 1414, 10
009c_the_ghost_at_dawns_house_1.txt, 2086, 1
091c_claudia_and_the_first_thanksgiving_4.txt, 1959, 4
m06c_the_mystery_at_claudias_house_11.txt, 1856, 11
061c_jessi_and_the_awful_secret_10.txt, 1247, 10
013c_goodbye_stacey_goodbye_3.txt, 1543, 3
127c_abbys_un_valentine_13.txt, 1536, 13
066c_maid_mary_anne_7.txt, 1248, 7
041c_mary_anne_vs_logan_15.txt, 1547, 15
086c_mary_anne_and_camp_bsc_11.txt, 1317, 11
m33c_stacey_and_the_stolen_hearts_7.txt, 1591, 7
123c_claudias_big_party_1.txt, 2467, 1
040c_claudia_and_the_middle_school_mystery_15.txt, 1372, 15
093c_mary_anne_and_the_memory_garden_5.txt, 2603, 5
117c_claudia_and_the_terrible_truth_4.txt, 1747, 4
m22c_stacey_and_the_haunted_masquerade_13.txt, 2527, 13
046c_mary_anne_misses_logan_11.txt, 1538, 11
m25c_kristy_and_the_middle_school_vandal_9.txt, 1144, 9
073c_mary_anne_and_miss_priss_14.txt, 1517, 14
118c_kristy_thomas_dog_trainer_12.txt, 982, 12
035c_jessis_babysitter_10.txt, 1774, 10
024c_kristy_and_the_mothers_day_surprise_11.txt, 1772, 11
m01c_stacey_and_the_mystery_ring_10.txt, 1996, 10
m20c_mary_anne_and_the_zoo_mystery_12.txt, 2107, 12
108c_dont_give_up_mallory_9.txt, 1409, 9
059c_mallory_hates_boys_and_gym_7.txt, 1360, 7
088c_farewell_dawn_5.txt, 1664, 5
116c_abby_and_the_best_kid_ever_13.txt, 1528, 13
002c_claudia_and_the_phantom_phone_calls_1.txt, 2650, 1
063c_claudias_freind_friend_7.txt, 1587, 7
055c_jessis_gold_medal_3.txt, 1766, 3
066c_maid_mary_anne_10.txt, 2214, 10
005c_dawn_and_the_impossible_three_15.txt, 2076, 15
077c_dwn_and_whitney_friends_forever_8.txt, 1595, 8
m36c_kristy_and_the_cat_burglar_9.txt, 1588, 9
005c_dawn_and_the_impossible_three_14.txt, 1982, 14
m36c_kristy_and_the_cat_burglar_8.txt, 1506, 8
077c_dwn_and_whitney_friends_forever_9.txt, 1886, 9
055c_jessis_gold_medal_2.txt, 2952, 2
063c_claudias_freind_friend_6.txt, 1927, 6
116c_abby_and_the_best_kid_ever_12.txt, 946, 12
066c_maid_mary_anne_11.txt, 2301, 11
088c_farewell_dawn_4.txt, 1960, 4
m07c_dawn_and_the_disappearing_dogs_1.txt, 2134, 1
m20c_mary_anne_and_the_zoo_mystery_13.txt, 2081, 13
m01c_stacey_and_the_mystery_ring_11.txt, 1515, 11
m11c_claudia_and_the_mystery_at_the_museum_1.txt, 1968, 1
059c_mallory_hates_boys_and_gym_6.txt, 1684, 6
108c_dont_give_up_mallory_8.txt, 1503, 8
m31c_mary_anne_and_the_music_box_secret_1.txt, 2082, 1
049c_claudia_and_the_genius_of_elm_street_1.txt, 1593, 1
024c_kristy_and_the_mothers_day_surprise_10.txt, 1626, 10
118c_kristy_thomas_dog_trainer_13.txt, 991, 13
035c_jessis_babysitter_11.txt, 1507, 11
014c_hello_mallory_1.txt, 1828, 1
073c_mary_anne_and_miss_priss_15.txt, 833, 15
m22c_stacey_and_the_haunted_masquerade_12.txt, 1995, 12
117c_claudia_and_the_terrible_truth_5.txt, 1444, 5
m25c_kristy_and_the_middle_school_vandal_8.txt, 2930, 8
046c_mary_anne_misses_logan_10.txt, 1494, 10
093c_mary_anne_and_the_memory_garden_4.txt, 1606, 4
040c_claudia_and_the_middle_school_mystery_14.txt, 1576, 14
m33c_stacey_and_the_stolen_hearts_6.txt, 1526, 6
066c_maid_mary_anne_6.txt, 1672, 6
086c_mary_anne_and_camp_bsc_10.txt, 1570, 10
041c_mary_anne_vs_logan_14.txt, 1742, 14
127c_abbys_un_valentine_12.txt, 1586, 12
061c_jessi_and_the_awful_secret_11.txt, 1101, 11
m06c_the_mystery_at_claudias_house_10.txt, 1776, 10
091c_claudia_and_the_first_thanksgiving_5.txt, 1494, 5
013c_goodbye_stacey_goodbye_2.txt, 1539, 2
072c_dawn_and_the_we_heart_kids_club_11.txt, 1414, 11
024c_kristy_and_the_mothers_day_surprise_7.txt, 1773, 7
031c_dawns_wicked_stepsister_3.txt, 1696, 3
062c_kristy_and_the_worst_kid_ever_11.txt, 1888, 11
m12c_dawn_and_the_surfer_ghost_3.txt, 1879, 3
m23c_abby_and_the_secret_society_8.txt, 1627, 8
m05c_mary_anne_and_the_secret_in_the_attic_9.txt, 1813, 9
056c_keep_out_claudia_13.txt, 1597, 13
045c_kristy_and_the_baby_parade_9.txt, 1659, 9
085c_claudia_kishi_live_from_wsto_14.txt, 1759, 14
m24c_mary_anne_and_the_silent_witness_7.txt, 1823, 7
122c_kristy_in_charge_10.txt, 1559, 10
070c_stacey_and_the_cheerleaders_6.txt, 1865, 6
060c_mary_annes_makeover_4.txt, 1116, 4
108c_dont_give_up_mallory_15.txt, 635, 15
079c_mary_anne_breaks_the_rules_8.txt, 1422, 8
083c_stacey_vs_the_bsc_1.txt, 1976, 1
047c_mallory_on_strike_11.txt, 1120, 11
m13c_mary_anne_and_the_library_mystery_3.txt, 2037, 3
005c_dawn_and_the_impossible_three_1.txt, 2445, 1
serr3c_shannons_story_1.txt, 2410, 1
m09c_kristy_and_the_haunted_mansion_14.txt, 1710, 14
099c_staceys_broken_heart_3.txt, 1225, 3
001c_kristys_great_idea_1.txt, 2319, 1
004c_mary_anne_saves_the_day_4.txt, 1711, 4
011c_kristy_and_the_snobs_15.txt, 1670, 15
007c_claudia_and_mean_jeanine_4.txt, 1547, 4
058c_staceys_choice_10.txt, 1713, 10
m04c_kristy_and_the_missing_child_13.txt, 1817, 13
056c_keep_out_claudia_9.txt, 1492, 9
131c_the_fire_at_mary_annes_house_6.txt, 1863, 6
112c_kristy_and_the_sister_war_6.txt, 1297, 6
m07c_dawn_and_the_disappearing_dogs_13.txt, 1808, 13
012c_claudia_and_the_new_girl_3.txt, 2001, 3
058c_staceys_choice_14.txt, 1914, 14
011c_kristy_and_the_snobs_11.txt, 1505, 11
012c_claudia_and_the_new_girl_7.txt, 1725, 7
112c_kristy_and_the_sister_war_2.txt, 2362, 2
098c_dawn_and_too_many_sitters_9.txt, 1699, 9
131c_the_fire_at_mary_annes_house_2.txt, 2118, 2
serr3c_shannons_story_5.txt, 1406, 5
005c_dawn_and_the_impossible_three_5.txt, 2206, 5
035c_jessis_babysitter_9.txt, 1445, 9
001c_kristys_great_idea_5.txt, 1534, 5
099c_staceys_broken_heart_7.txt, 1265, 7
m09c_kristy_and_the_haunted_mansion_10.txt, 2090, 10
031c_dawns_wicked_stepsister_12.txt, 1454, 12
070c_stacey_and_the_cheerleaders_2.txt, 2574, 2
122c_kristy_in_charge_14.txt, 1271, 14
129c_kristy_at_bat_9.txt, 1614, 9
m24c_mary_anne_and_the_silent_witness_3.txt, 1832, 3
008c_boy_crazy_stacey_12.txt, 1247, 12
085c_claudia_kishi_live_from_wsto_10.txt, 1264, 10
m26c_dawn_schafer_undercover_babysitter_13.txt, 1738, 13
m13c_mary_anne_and_the_library_mystery_7.txt, 1979, 7
047c_mallory_on_strike_15.txt, 2441, 15
083c_stacey_vs_the_bsc_5.txt, 1369, 5
108c_dont_give_up_mallory_11.txt, 2053, 11
m33c_stacey_and_the_stolen_hearts_12.txt, 1413, 12
m12c_dawn_and_the_surfer_ghost_7.txt, 1846, 7
024c_kristy_and_the_mothers_day_surprise_3.txt, 1645, 3
062c_kristy_and_the_worst_kid_ever_15.txt, 1552, 15
031c_dawns_wicked_stepsister_7.txt, 1892, 7
072c_dawn_and_the_we_heart_kids_club_15.txt, 1115, 15
009c_the_ghost_at_dawns_house_4.txt, 1698, 4
119c_staceys_ex_boyfriend_12.txt, 1186, 12
m28c_abby_and_the_mystery_baby_8.txt, 1662, 8
034c_mary_anne_and_too_many_boys_9.txt, 1410, 9
008c_boy_crazy_stacey_4.txt, 1813, 4
037c_dawn_and_the_older_boy_8.txt, 1529, 8
086c_mary_anne_and_camp_bsc_14.txt, 1532, 14
m08c_jessi_and_the_jewel_thieves_9.txt, 1861, 9
041c_mary_anne_vs_logan_10.txt, 1456, 10
066c_maid_mary_anne_2.txt, 3224, 2
123c_claudias_big_party_4.txt, 2354, 4
012c_claudia_and_the_new_girl_13.txt, 1685, 13
m33c_stacey_and_the_stolen_hearts_2.txt, 2354, 2
032c_kristy_and_the_secret_of_susan_9.txt, 1404, 9
013c_goodbye_stacey_goodbye_6.txt, 1798, 6
m06c_the_mystery_at_claudias_house_14.txt, 1827, 14
091c_claudia_and_the_first_thanksgiving_1.txt, 2553, 1
051c_staceys_ex_best_friend_13.txt, 1823, 13
061c_jessi_and_the_awful_secret_15.txt, 1579, 15
046c_mary_anne_misses_logan_14.txt, 1896, 14
m27c_claudia_and_the_lighthouse_ghost_12.txt, 1085, 12
117c_claudia_and_the_terrible_truth_1.txt, 1820, 1
006c_kristys_big_day_9.txt, 1649, 9
040c_claudia_and_the_middle_school_mystery_10.txt, 1436, 10
128c_claudia_and_the_little_liar_8.txt, 1802, 8
014c_hello_mallory_5.txt, 1610, 5
035c_jessis_babysitter_15.txt, 1544, 15
024c_kristy_and_the_mothers_day_surprise_14.txt, 1889, 14
049c_claudia_and_the_genius_of_elm_street_5.txt, 2310, 5
120c_mary_anne_and_the_playground_fight_13.txt, 1502, 13
073c_mary_anne_and_miss_priss_11.txt, 1284, 11
066c_maid_mary_anne_15.txt, 1004, 15
076c_staceys_lie_12.txt, 2379, 12
055c_jessis_gold_medal_6.txt, 1459, 6
002c_claudia_and_the_phantom_phone_calls_4.txt, 2003, 4
063c_claudias_freind_friend_2.txt, 3264, 2
m21c_claudia_and_the_recipe_for_danger_9.txt, 1813, 9
005c_dawn_and_the_impossible_three_10.txt, 1545, 10
serr2c_logan_bruno_boy_babysitter_13.txt, 1587, 13
m31c_mary_anne_and_the_music_box_secret_5.txt, 1756, 5
059c_mallory_hates_boys_and_gym_2.txt, 3301, 2
m11c_claudia_and_the_mystery_at_the_museum_5.txt, 1743, 5
104c_abbys_twin_8.txt, 1175, 8
m01c_stacey_and_the_mystery_ring_15.txt, 1493, 15
m07c_dawn_and_the_disappearing_dogs_5.txt, 1763, 5
m07c_dawn_and_the_disappearing_dogs_4.txt, 1950, 4
088c_farewell_dawn_1.txt, 2161, 1
059c_mallory_hates_boys_and_gym_3.txt, 2664, 3
m31c_mary_anne_and_the_music_box_secret_4.txt, 1869, 4
m01c_stacey_and_the_mystery_ring_14.txt, 1985, 14
104c_abbys_twin_9.txt, 1225, 9
m11c_claudia_and_the_mystery_at_the_museum_4.txt, 1822, 4
m21c_claudia_and_the_recipe_for_danger_8.txt, 1648, 8
serr2c_logan_bruno_boy_babysitter_12.txt, 1462, 12
005c_dawn_and_the_impossible_three_11.txt, 2002, 11
076c_staceys_lie_13.txt, 1242, 13
066c_maid_mary_anne_14.txt, 1389, 14
063c_claudias_freind_friend_3.txt, 1853, 3
002c_claudia_and_the_phantom_phone_calls_5.txt, 1227, 5
055c_jessis_gold_medal_7.txt, 1356, 7
073c_mary_anne_and_miss_priss_10.txt, 1874, 10
120c_mary_anne_and_the_playground_fight_12.txt, 1804, 12
049c_claudia_and_the_genius_of_elm_street_4.txt, 2151, 4
024c_kristy_and_the_mothers_day_surprise_15.txt, 1617, 15
014c_hello_mallory_4.txt, 1751, 4
035c_jessis_babysitter_14.txt, 1542, 14
128c_claudia_and_the_little_liar_9.txt, 1536, 9
093c_mary_anne_and_the_memory_garden_1.txt, 2080, 1
040c_claudia_and_the_middle_school_mystery_11.txt, 1773, 11
006c_kristys_big_day_8.txt, 2336, 8
m27c_claudia_and_the_lighthouse_ghost_13.txt, 1430, 13
046c_mary_anne_misses_logan_15.txt, 1521, 15
013c_goodbye_stacey_goodbye_7.txt, 1809, 7
032c_kristy_and_the_secret_of_susan_8.txt, 1319, 8
061c_jessi_and_the_awful_secret_14.txt, 1859, 14
051c_staceys_ex_best_friend_12.txt, 1828, 12
m06c_the_mystery_at_claudias_house_15.txt, 1921, 15
012c_claudia_and_the_new_girl_12.txt, 1698, 12
123c_claudias_big_party_5.txt, 2204, 5
m33c_stacey_and_the_stolen_hearts_3.txt, 1545, 3
041c_mary_anne_vs_logan_11.txt, 1689, 11
m08c_jessi_and_the_jewel_thieves_8.txt, 1745, 8
086c_mary_anne_and_camp_bsc_15.txt, 878, 15
066c_maid_mary_anne_3.txt, 2752, 3
037c_dawn_and_the_older_boy_9.txt, 1187, 9
008c_boy_crazy_stacey_5.txt, 2563, 5
m28c_abby_and_the_mystery_baby_9.txt, 1682, 9
034c_mary_anne_and_too_many_boys_8.txt, 1093, 8
009c_the_ghost_at_dawns_house_5.txt, 1419, 5
072c_dawn_and_the_we_heart_kids_club_14.txt, 1099, 14
119c_staceys_ex_boyfriend_13.txt, 1539, 13
031c_dawns_wicked_stepsister_6.txt, 1597, 6
062c_kristy_and_the_worst_kid_ever_14.txt, 1324, 14
024c_kristy_and_the_mothers_day_surprise_2.txt, 2203, 2
m12c_dawn_and_the_surfer_ghost_6.txt, 1991, 6
047c_mallory_on_strike_14.txt, 2259, 14
083c_stacey_vs_the_bsc_4.txt, 1528, 4
m13c_mary_anne_and_the_library_mystery_6.txt, 1719, 6
m33c_stacey_and_the_stolen_hearts_13.txt, 1602, 13
060c_mary_annes_makeover_1.txt, 2151, 1
108c_dont_give_up_mallory_10.txt, 1675, 10
m26c_dawn_schafer_undercover_babysitter_12.txt, 1766, 12
122c_kristy_in_charge_15.txt, 1326, 15
085c_claudia_kishi_live_from_wsto_11.txt, 1371, 11
008c_boy_crazy_stacey_13.txt, 1202, 13
m24c_mary_anne_and_the_silent_witness_2.txt, 2433, 2
129c_kristy_at_bat_8.txt, 1938, 8
070c_stacey_and_the_cheerleaders_3.txt, 2501, 3
031c_dawns_wicked_stepsister_13.txt, 1850, 13
m09c_kristy_and_the_haunted_mansion_11.txt, 1434, 11
001c_kristys_great_idea_4.txt, 2068, 4
004c_mary_anne_saves_the_day_1.txt, 2767, 1
099c_staceys_broken_heart_6.txt, 1706, 6
005c_dawn_and_the_impossible_three_4.txt, 1689, 4
035c_jessis_babysitter_8.txt, 1230, 8
serr3c_shannons_story_4.txt, 2871, 4
012c_claudia_and_the_new_girl_6.txt, 1946, 6
131c_the_fire_at_mary_annes_house_3.txt, 1424, 3
098c_dawn_and_too_many_sitters_8.txt, 1442, 8
112c_kristy_and_the_sister_war_3.txt, 1578, 3
007c_claudia_and_mean_jeanine_1.txt, 2149, 1
058c_staceys_choice_15.txt, 1692, 15
011c_kristy_and_the_snobs_10.txt, 1722, 10
011c_kristy_and_the_snobs_12.txt, 1941, 12
m04c_kristy_and_the_missing_child_14.txt, 1861, 14
007c_claudia_and_mean_jeanine_3.txt, 1481, 3
112c_kristy_and_the_sister_war_1.txt, 1996, 1
131c_the_fire_at_mary_annes_house_1.txt, 2032, 1
012c_claudia_and_the_new_girl_4.txt, 1911, 4
m07c_dawn_and_the_disappearing_dogs_14.txt, 1937, 14
005c_dawn_and_the_impossible_three_6.txt, 2078, 6
serr3c_shannons_story_6.txt, 1121, 6
m09c_kristy_and_the_haunted_mansion_13.txt, 1963, 13
031c_dawns_wicked_stepsister_11.txt, 1743, 11
099c_staceys_broken_heart_4.txt, 2162, 4
004c_mary_anne_saves_the_day_3.txt, 2196, 3
001c_kristys_great_idea_6.txt, 2244, 6
084c_dawn_and_the_school_spirit_war_9.txt, 1567, 9
085c_claudia_kishi_live_from_wsto_8.txt, 2131, 8
008c_boy_crazy_stacey_11.txt, 1435, 11
085c_claudia_kishi_live_from_wsto_13.txt, 1347, 13
070c_stacey_and_the_cheerleaders_1.txt, 2716, 1
052c_mary_anne_plus_too_many_babies_8.txt, 1404, 8
108c_dont_give_up_mallory_12.txt, 2084, 12
060c_mary_annes_makeover_3.txt, 1299, 3
m33c_stacey_and_the_stolen_hearts_11.txt, 1726, 11
m14c_stacey_and_the_mystery_at_the_mall_9.txt, 2095, 9
m13c_mary_anne_and_the_library_mystery_4.txt, 1919, 4
083c_stacey_vs_the_bsc_6.txt, 1402, 6
046c_mary_anne_misses_logan_9.txt, 1762, 9
m26c_dawn_schafer_undercover_babysitter_10.txt, 1819, 10
119c_staceys_ex_boyfriend_11.txt, 2190, 11
017c_mary_annes_bad_luck_mystery_8.txt, 1654, 8
009c_the_ghost_at_dawns_house_7.txt, 1537, 7
101c_claudia_kishi_middle_school_dropout_9.txt, 1867, 9
m12c_dawn_and_the_surfer_ghost_4.txt, 1726, 4
031c_dawns_wicked_stepsister_4.txt, 2031, 4
008c_boy_crazy_stacey_7.txt, 1546, 7
056c_keep_out_claudia_14.txt, 1617, 14
m16c_claudia_and_the_clue_in_the_photograph_9.txt, 2086, 9
m33c_stacey_and_the_stolen_hearts_1.txt, 1262, 1
123c_claudias_big_party_7.txt, 2164, 7
012c_claudia_and_the_new_girl_10.txt, 1865, 10
066c_maid_mary_anne_1.txt, 3306, 1
041c_mary_anne_vs_logan_13.txt, 1663, 13
127c_abbys_un_valentine_15.txt, 1097, 15
091c_claudia_and_the_first_thanksgiving_2.txt, 2914, 2
051c_staceys_ex_best_friend_10.txt, 1732, 10
013c_goodbye_stacey_goodbye_5.txt, 1747, 5
117c_claudia_and_the_terrible_truth_2.txt, 2388, 2
m22c_stacey_and_the_haunted_masquerade_15.txt, 1377, 15
m27c_claudia_and_the_lighthouse_ghost_11.txt, 1407, 11
093c_mary_anne_and_the_memory_garden_3.txt, 2284, 3
040c_claudia_and_the_middle_school_mystery_13.txt, 1690, 13
105c_stacey_the_math_whiz_8.txt, 1940, 8
049c_claudia_and_the_genius_of_elm_street_6.txt, 1903, 6
118c_kristy_thomas_dog_trainer_14.txt, 1049, 14
014c_hello_mallory_6.txt, 1222, 6
m06c_the_mystery_at_claudias_house_9.txt, 1594, 9
120c_mary_anne_and_the_playground_fight_10.txt, 1179, 10
073c_mary_anne_and_miss_priss_12.txt, 2385, 12
005c_dawn_and_the_impossible_three_13.txt, 1872, 13
serr2c_logan_bruno_boy_babysitter_10.txt, 1384, 10
055c_jessis_gold_medal_5.txt, 1335, 5
063c_claudias_freind_friend_1.txt, 1991, 1
116c_abby_and_the_best_kid_ever_15.txt, 1201, 15
002c_claudia_and_the_phantom_phone_calls_7.txt, 2446, 7
076c_staceys_lie_11.txt, 2090, 11
088c_farewell_dawn_3.txt, 1468, 3
m07c_dawn_and_the_disappearing_dogs_6.txt, 1556, 6
m11c_claudia_and_the_mystery_at_the_museum_6.txt, 1843, 6
041c_mary_anne_vs_logan_9.txt, 1437, 9
089c_kristy_and_the_dirty_diapers_9.txt, 1119, 9
m20c_mary_anne_and_the_zoo_mystery_14.txt, 2104, 14
m31c_mary_anne_and_the_music_box_secret_6.txt, 1965, 6
059c_mallory_hates_boys_and_gym_1.txt, 2520, 1
m20c_mary_anne_and_the_zoo_mystery_15.txt, 702, 15
089c_kristy_and_the_dirty_diapers_8.txt, 1329, 8
041c_mary_anne_vs_logan_8.txt, 1407, 8
m11c_claudia_and_the_mystery_at_the_museum_7.txt, 1940, 7
m31c_mary_anne_and_the_music_box_secret_7.txt, 1835, 7
088c_farewell_dawn_2.txt, 3611, 2
m07c_dawn_and_the_disappearing_dogs_7.txt, 1843, 7
116c_abby_and_the_best_kid_ever_14.txt, 1387, 14
002c_claudia_and_the_phantom_phone_calls_6.txt, 1750, 6
055c_jessis_gold_medal_4.txt, 1650, 4
076c_staceys_lie_10.txt, 1320, 10
serr2c_logan_bruno_boy_babysitter_11.txt, 991, 11
005c_dawn_and_the_impossible_three_12.txt, 2032, 12
073c_mary_anne_and_miss_priss_13.txt, 1699, 13
120c_mary_anne_and_the_playground_fight_11.txt, 993, 11
m06c_the_mystery_at_claudias_house_8.txt, 2089, 8
118c_kristy_thomas_dog_trainer_15.txt, 1156, 15
014c_hello_mallory_7.txt, 1303, 7
049c_claudia_and_the_genius_of_elm_street_7.txt, 1190, 7
040c_claudia_and_the_middle_school_mystery_12.txt, 1380, 12
093c_mary_anne_and_the_memory_garden_2.txt, 2125, 2
105c_stacey_the_math_whiz_9.txt, 813, 9
m22c_stacey_and_the_haunted_masquerade_14.txt, 2313, 14
117c_claudia_and_the_terrible_truth_3.txt, 1849, 3
m27c_claudia_and_the_lighthouse_ghost_10.txt, 2043, 10
051c_staceys_ex_best_friend_11.txt, 1336, 11
091c_claudia_and_the_first_thanksgiving_3.txt, 1650, 3
013c_goodbye_stacey_goodbye_4.txt, 1565, 4
127c_abbys_un_valentine_14.txt, 2103, 14
041c_mary_anne_vs_logan_12.txt, 1787, 12
012c_claudia_and_the_new_girl_11.txt, 1720, 11
123c_claudias_big_party_6.txt, 1542, 6
m16c_claudia_and_the_clue_in_the_photograph_8.txt, 2055, 8
008c_boy_crazy_stacey_6.txt, 1832, 6
056c_keep_out_claudia_15.txt, 1674, 15
031c_dawns_wicked_stepsister_5.txt, 1691, 5
024c_kristy_and_the_mothers_day_surprise_1.txt, 2152, 1
m12c_dawn_and_the_surfer_ghost_5.txt, 1479, 5
017c_mary_annes_bad_luck_mystery_9.txt, 1750, 9
119c_staceys_ex_boyfriend_10.txt, 1310, 10
101c_claudia_kishi_middle_school_dropout_8.txt, 1541, 8
009c_the_ghost_at_dawns_house_6.txt, 1250, 6
m26c_dawn_schafer_undercover_babysitter_11.txt, 1776, 11
m33c_stacey_and_the_stolen_hearts_10.txt, 1486, 10
060c_mary_annes_makeover_2.txt, 3050, 2
108c_dont_give_up_mallory_13.txt, 2062, 13
046c_mary_anne_misses_logan_8.txt, 1607, 8
083c_stacey_vs_the_bsc_7.txt, 1164, 7
m13c_mary_anne_and_the_library_mystery_5.txt, 1873, 5
m14c_stacey_and_the_mystery_at_the_mall_8.txt, 1937, 8
052c_mary_anne_plus_too_many_babies_9.txt, 1324, 9
085c_claudia_kishi_live_from_wsto_12.txt, 776, 12
008c_boy_crazy_stacey_10.txt, 1468, 10
m24c_mary_anne_and_the_silent_witness_1.txt, 2037, 1
085c_claudia_kishi_live_from_wsto_9.txt, 1019, 9
099c_staceys_broken_heart_5.txt, 2416, 5
084c_dawn_and_the_school_spirit_war_8.txt, 1986, 8
001c_kristys_great_idea_7.txt, 1789, 7
004c_mary_anne_saves_the_day_2.txt, 1826, 2
031c_dawns_wicked_stepsister_10.txt, 1516, 10
m09c_kristy_and_the_haunted_mansion_12.txt, 1748, 12
serr3c_shannons_story_7.txt, 1493, 7
005c_dawn_and_the_impossible_three_7.txt, 1730, 7
m07c_dawn_and_the_disappearing_dogs_15.txt, 1740, 15
012c_claudia_and_the_new_girl_5.txt, 1601, 5
011c_kristy_and_the_snobs_13.txt, 1846, 13
007c_claudia_and_mean_jeanine_2.txt, 1683, 2
m04c_kristy_and_the_missing_child_15.txt, 2022, 15
033c_claudia_and_the_great_search_9.txt, 1496, 9
130c_staceys_movie_6.txt, 1453, 6
026c_claudia_and_the_sad_goodbye_5.txt, 1618, 5
m16c_claudia_and_the_clue_in_the_photograph_11.txt, 2108, 11
053c_kristy_for_president_4.txt, 1564, 4
131c_the_fire_at_mary_annes_house_15.txt, 1881, 15
m10c_stacey_and_the_mystery_money_1.txt, 2393, 1
130c_staceys_movie_13.txt, 802, 13
090c_welcome_to_the_bsc_abby_15.txt, 882, 15
m27c_claudia_and_the_lighthouse_ghost_8.txt, 1814, 8
125c_mary_anne_in_the_middle_9.txt, 1070, 9
062c_kristy_and_the_worst_kid_ever_9.txt, 2554, 9
109c_mary_anne_to_the_rescue_6.txt, 1387, 6
m24c_mary_anne_and_the_silent_witness_12.txt, 2153, 12
097c_claudia_and_the_worlds_cutest_baby_12.txt, 972, 12
064c_dawns_family_feud_15.txt, 780, 15
m20c_mary_anne_and_the_zoo_mystery_5.txt, 1804, 5
073c_mary_anne_and_miss_priss_9.txt, 1598, 9
072c_dawn_and_the_we_heart_kids_club_9.txt, 1738, 9
009c_the_ghost_at_dawns_house_13.txt, 1608, 13
m23c_abby_and_the_secret_society_13.txt, 1963, 13
067c_dawns_big_move_6.txt, 1655, 6
118c_kristy_thomas_dog_trainer_3.txt, 1376, 3
029c_mallory_and_the_mystery_diary_9.txt, 1771, 9
064c_dawns_family_feud_7.txt, 1329, 7
011c_kristy_and_the_snobs_5.txt, 1135, 5
087c_stacey_and_the_bad_girls_9.txt, 1128, 9
023c_dawn_on_the_coast_9.txt, 1374, 9
071c_claudia_and_the_perfect_boy_6.txt, 1370, 6
m35c_abby_and_the_notorius_neighbor_7.txt, 1785, 7
126c_the_all_new_mallory_pike_7.txt, 1978, 7
098c_dawn_and_too_many_sitters_14.txt, 1440, 14
m35c_abby_and_the_notorius_neighbor_12.txt, 1689, 12
100c_kristys_worst_idea_14.txt, 1501, 14
048c_jessis_wish_1.txt, 1906, 1
m26c_dawn_schafer_undercover_babysitter_4.txt, 1691, 4
022c_jessi_ramsey_petsitter_7.txt, 1404, 7
081c_kristy_and_mr_mom_4.txt, 2103, 4
021c_mallory_and_the_trouble_with_twins_15.txt, 1338, 15
069c_get_well_soon_mallory_13.txt, 1482, 13
094c_stacey_mcgill_super_sitter_13.txt, 1217, 13
120c_mary_anne_and_the_playground_fight_2.txt, 3052, 2
025c_mary_anne_and_the_search_for_tigger_8.txt, 1599, 8
m22c_stacey_and_the_haunted_masquerade_5.txt, 1615, 5
028c_welcome_back_stacey_7.txt, 2093, 7
115c_jessis_big_break_15.txt, 1723, 15
113c_claudia_makes_up_her_mind_12.txt, 1529, 12
093c_mary_anne_and_the_memory_garden_10.txt, 1621, 10
027c_jessi_and_the_superbrat_5.txt, 1743, 5
068c_jessi_and_the_bad_babysitter_13.txt, 1055, 13
103c_happy_holidays_jessi_10.txt, 1492, 10
m29c_stacey_and_the_fashion_victim_8.txt, 1578, 8
065c_staceys_big_crush_9.txt, 1518, 9
034c_mary_anne_and_too_many_boys_11.txt, 1571, 11
049c_claudia_and_the_genius_of_elm_street_12.txt, 1836, 12
m09c_kristy_and_the_haunted_mansion_3.txt, 1707, 3
114c_the_secret_life_of_mary_anne_spier_1.txt, 1447, 1
075c_jessis_horrible_prank_11.txt, 1201, 11
091c_claudia_and_the_first_thanksgiving_11.txt, 925, 11
111c_staceys_secret_friend_3.txt, 1008, 3
061c_jessi_and_the_awful_secret_3.txt, 1991, 3
127c_abbys_un_valentine_9.txt, 1156, 9
m02c_beware_dawn_8.txt, 1747, 8
m30c_kristy_and_the_mystery_train_7.txt, 1264, 7
m36c_kristy_and_the_cat_burglar_15.txt, 1756, 15
074c_kristy_and_the_copycat_12.txt, 1204, 12
069c_get_well_soon_mallory_8.txt, 1736, 8
039c_poor_mallory_9.txt, 1656, 9
097c_claudia_and_the_worlds_cutest_baby_9.txt, 1740, 9
010c_logan_likes_mary_anne_3.txt, 1457, 3
090c_welcome_to_the_bsc_abby_1.txt, 2276, 1
068c_jessi_and_the_bad_babysitter_4.txt, 1568, 4
m13c_mary_anne_and_the_library_mystery_13.txt, 1444, 13
028c_welcome_back_stacey_14.txt, 1726, 14
016c_jessis_secret_language_7.txt, 1613, 7
017c_mary_annes_bad_luck_mystery_12.txt, 1266, 12
057c_dawn_saves_the_planet_12.txt, 2012, 12
115c_jessis_big_break_5.txt, 919, 5
092c_mallorys_christmas_wish_7.txt, 1789, 7
057c_dawn_saves_the_planet_8.txt, 1875, 8
serr2c_logan_bruno_boy_babysitter_7.txt, 1142, 7
m30c_kristy_and_the_mystery_train_14.txt, 1024, 14
040c_claudia_and_the_middle_school_mystery_5.txt, 1582, 5
m32c_claudia_and_the_mystery_in_the_painting_2.txt, 3028, 2
116c_abby_and_the_best_kid_ever_1.txt, 1503, 1
043c_staceys_emergency_6.txt, 1797, 6
043c_staceys_emergency_7.txt, 1714, 7
m32c_claudia_and_the_mystery_in_the_painting_3.txt, 3603, 3
040c_claudia_and_the_middle_school_mystery_4.txt, 1866, 4
serr2c_logan_bruno_boy_babysitter_6.txt, 1499, 6
m30c_kristy_and_the_mystery_train_15.txt, 995, 15
115c_jessis_big_break_4.txt, 2388, 4
057c_dawn_saves_the_planet_13.txt, 1698, 13
057c_dawn_saves_the_planet_9.txt, 2131, 9
092c_mallorys_christmas_wish_6.txt, 1375, 6
016c_jessis_secret_language_6.txt, 1842, 6
028c_welcome_back_stacey_15.txt, 1558, 15
017c_mary_annes_bad_luck_mystery_13.txt, 1585, 13
068c_jessi_and_the_bad_babysitter_5.txt, 1363, 5
m13c_mary_anne_and_the_library_mystery_12.txt, 1870, 12
097c_claudia_and_the_worlds_cutest_baby_8.txt, 1301, 8
010c_logan_likes_mary_anne_2.txt, 1472, 2
074c_kristy_and_the_copycat_13.txt, 1468, 13
m36c_kristy_and_the_cat_burglar_14.txt, 1392, 14
094c_stacey_mcgill_super_sitter_1.txt, 1995, 1
039c_poor_mallory_8.txt, 1941, 8
069c_get_well_soon_mallory_9.txt, 2203, 9
m02c_beware_dawn_9.txt, 1978, 9
m30c_kristy_and_the_mystery_train_6.txt, 1811, 6
127c_abbys_un_valentine_8.txt, 1814, 8
061c_jessi_and_the_awful_secret_2.txt, 3280, 2
038c_kristys_mystery_admirer_1.txt, 1642, 1
111c_staceys_secret_friend_2.txt, 2709, 2
091c_claudia_and_the_first_thanksgiving_10.txt, 1179, 10
075c_jessis_horrible_prank_10.txt, 1474, 10
m09c_kristy_and_the_haunted_mansion_2.txt, 2320, 2
054c_mallory_and_the_dream_horse_1.txt, 2033, 1
049c_claudia_and_the_genius_of_elm_street_13.txt, 1625, 13
034c_mary_anne_and_too_many_boys_10.txt, 1462, 10
065c_staceys_big_crush_8.txt, 1666, 8
075c_jessis_horrible_prank_1.txt, 1401, 1
093c_mary_anne_and_the_memory_garden_11.txt, 2027, 11
m29c_stacey_and_the_fashion_victim_9.txt, 1544, 9
103c_happy_holidays_jessi_11.txt, 1133, 11
068c_jessi_and_the_bad_babysitter_12.txt, 1523, 12
027c_jessi_and_the_superbrat_4.txt, 1648, 4
028c_welcome_back_stacey_6.txt, 1825, 6
m22c_stacey_and_the_haunted_masquerade_4.txt, 1555, 4
025c_mary_anne_and_the_search_for_tigger_9.txt, 1685, 9
113c_claudia_makes_up_her_mind_13.txt, 905, 13
115c_jessis_big_break_14.txt, 1209, 14
069c_get_well_soon_mallory_12.txt, 1601, 12
120c_mary_anne_and_the_playground_fight_3.txt, 1393, 3
094c_stacey_mcgill_super_sitter_12.txt, 1193, 12
081c_kristy_and_mr_mom_5.txt, 2625, 5
022c_jessi_ramsey_petsitter_6.txt, 1544, 6
021c_mallory_and_the_trouble_with_twins_14.txt, 1577, 14
100c_kristys_worst_idea_15.txt, 1909, 15
m35c_abby_and_the_notorius_neighbor_13.txt, 1285, 13
098c_dawn_and_too_many_sitters_15.txt, 591, 15
126c_the_all_new_mallory_pike_6.txt, 2044, 6
003c_the_truth_about_stacey_1.txt, 2010, 1
m26c_dawn_schafer_undercover_babysitter_5.txt, 2266, 5
011c_kristy_and_the_snobs_4.txt, 1985, 4
m35c_abby_and_the_notorius_neighbor_6.txt, 1559, 6
071c_claudia_and_the_perfect_boy_7.txt, 2037, 7
023c_dawn_on_the_coast_8.txt, 1605, 8
087c_stacey_and_the_bad_girls_8.txt, 924, 8
029c_mallory_and_the_mystery_diary_8.txt, 1552, 8
064c_dawns_family_feud_6.txt, 1699, 6
118c_kristy_thomas_dog_trainer_2.txt, 3138, 2
m23c_abby_and_the_secret_society_12.txt, 1634, 12
067c_dawns_big_move_7.txt, 1146, 7
m20c_mary_anne_and_the_zoo_mystery_4.txt, 1476, 4
009c_the_ghost_at_dawns_house_12.txt, 1669, 12
072c_dawn_and_the_we_heart_kids_club_8.txt, 1811, 8
073c_mary_anne_and_miss_priss_8.txt, 1940, 8
064c_dawns_family_feud_14.txt, 1906, 14
097c_claudia_and_the_worlds_cutest_baby_13.txt, 1435, 13
m24c_mary_anne_and_the_silent_witness_13.txt, 2045, 13
042c_jessi_and_the_dance_school_phantom_1.txt, 2669, 1
109c_mary_anne_to_the_rescue_7.txt, 1865, 7
062c_kristy_and_the_worst_kid_ever_8.txt, 1532, 8
090c_welcome_to_the_bsc_abby_14.txt, 1364, 14
125c_mary_anne_in_the_middle_8.txt, 1520, 8
m27c_claudia_and_the_lighthouse_ghost_9.txt, 958, 9
131c_the_fire_at_mary_annes_house_14.txt, 1460, 14
130c_staceys_movie_12.txt, 1463, 12
053c_kristy_for_president_5.txt, 1285, 5
m16c_claudia_and_the_clue_in_the_photograph_10.txt, 2115, 10
033c_claudia_and_the_great_search_8.txt, 1334, 8
026c_claudia_and_the_sad_goodbye_4.txt, 1248, 4
130c_staceys_movie_7.txt, 2219, 7
m16c_claudia_and_the_clue_in_the_photograph_12.txt, 1775, 12
053c_kristy_for_president_7.txt, 2431, 7
086c_mary_anne_and_camp_bsc_8.txt, 1907, 8
130c_staceys_movie_5.txt, 1128, 5
026c_claudia_and_the_sad_goodbye_6.txt, 1794, 6
m19c_kristy_and_the_missing_fortune_8.txt, 1992, 8
027c_jessi_and_the_superbrat_15.txt, 1865, 15
130c_staceys_movie_10.txt, 948, 10
m10c_stacey_and_the_mystery_money_2.txt, 2658, 2
102c_mary_anne_and_the_little_princess_9.txt, 2048, 9
096c_abbys_lucky_thirteen_8.txt, 1081, 8
109c_mary_anne_to_the_rescue_5.txt, 1504, 5
042c_jessi_and_the_dance_school_phantom_3.txt, 2307, 3
m24c_mary_anne_and_the_silent_witness_11.txt, 2017, 11
009c_the_ghost_at_dawns_house_10.txt, 1503, 10
m20c_mary_anne_and_the_zoo_mystery_6.txt, 1993, 6
097c_claudia_and_the_worlds_cutest_baby_11.txt, 1972, 11
067c_dawns_big_move_5.txt, 1902, 5
m23c_abby_and_the_secret_society_10.txt, 2093, 10
071c_claudia_and_the_perfect_boy_5.txt, 1532, 5
m35c_abby_and_the_notorius_neighbor_4.txt, 1849, 4
011c_kristy_and_the_snobs_6.txt, 1862, 6
m11c_claudia_and_the_mystery_at_the_museum_15.txt, 1216, 15
064c_dawns_family_feud_4.txt, 1628, 4
m18c_stacey_and_the_mystery_at_the_empty_house_8.txt, 1951, 8
022c_jessi_ramsey_petsitter_4.txt, 1618, 4
081c_kristy_and_mr_mom_7.txt, 1808, 7
029c_mallory_and_the_mystery_diary_15.txt, 1669, 15
m14c_stacey_and_the_mystery_at_the_mall_15.txt, 2005, 15
048c_jessis_wish_2.txt, 2631, 2
m26c_dawn_schafer_undercover_babysitter_7.txt, 1576, 7
001c_kristys_great_idea_15.txt, 2426, 15
003c_the_truth_about_stacey_3.txt, 2229, 3
126c_the_all_new_mallory_pike_4.txt, 1433, 4
m35c_abby_and_the_notorius_neighbor_11.txt, 1548, 11
113c_claudia_makes_up_her_mind_11.txt, 1996, 11
m22c_stacey_and_the_haunted_masquerade_6.txt, 1585, 6
028c_welcome_back_stacey_4.txt, 1473, 4
094c_stacey_mcgill_super_sitter_10.txt, 1807, 10
120c_mary_anne_and_the_playground_fight_1.txt, 1835, 1
069c_get_well_soon_mallory_10.txt, 1880, 10
047c_mallory_on_strike_9.txt, 1185, 9
034c_mary_anne_and_too_many_boys_12.txt, 1244, 12
027c_jessi_and_the_superbrat_6.txt, 1566, 6
068c_jessi_and_the_bad_babysitter_10.txt, 1824, 10
103c_happy_holidays_jessi_13.txt, 1554, 13
093c_mary_anne_and_the_memory_garden_13.txt, 1698, 13
075c_jessis_horrible_prank_3.txt, 2377, 3
049c_claudia_and_the_genius_of_elm_street_11.txt, 2359, 11
054c_mallory_and_the_dream_horse_3.txt, 1728, 3
091c_claudia_and_the_first_thanksgiving_12.txt, 1348, 12
113c_claudia_makes_up_her_mind_8.txt, 1530, 8
m34c_mary_anne_and_the_haunted_bookstore_14.txt, 2405, 14
114c_the_secret_life_of_mary_anne_spier_2.txt, 2800, 2
075c_jessis_horrible_prank_12.txt, 1941, 12
m30c_kristy_and_the_mystery_train_4.txt, 2477, 4
038c_kristys_mystery_admirer_3.txt, 2102, 3
092c_mallorys_christmas_wish_14.txt, 1546, 14
094c_stacey_mcgill_super_sitter_3.txt, 1978, 3
074c_kristy_and_the_copycat_11.txt, 1483, 11
017c_mary_annes_bad_luck_mystery_11.txt, 1856, 11
016c_jessis_secret_language_4.txt, 2343, 4
m13c_mary_anne_and_the_library_mystery_10.txt, 1893, 10
090c_welcome_to_the_bsc_abby_2.txt, 2965, 2
068c_jessi_and_the_bad_babysitter_7.txt, 1620, 7
serr2c_logan_bruno_boy_babysitter_4.txt, 1206, 4
092c_mallorys_christmas_wish_4.txt, 1132, 4
057c_dawn_saves_the_planet_11.txt, 1956, 11
023c_dawn_on_the_coast_15.txt, 1762, 15
115c_jessis_big_break_6.txt, 1408, 6
043c_staceys_emergency_5.txt, 1724, 5
021c_mallory_and_the_trouble_with_twins_8.txt, 1587, 8
116c_abby_and_the_best_kid_ever_2.txt, 3293, 2
040c_claudia_and_the_middle_school_mystery_6.txt, 1809, 6
m32c_claudia_and_the_mystery_in_the_painting_1.txt, 2110, 1
116c_abby_and_the_best_kid_ever_3.txt, 885, 3
040c_claudia_and_the_middle_school_mystery_7.txt, 1543, 7
043c_staceys_emergency_4.txt, 1497, 4
021c_mallory_and_the_trouble_with_twins_9.txt, 1562, 9
092c_mallorys_christmas_wish_5.txt, 1405, 5
115c_jessis_big_break_7.txt, 1628, 7
023c_dawn_on_the_coast_14.txt, 1369, 14
057c_dawn_saves_the_planet_10.txt, 1461, 10
serr2c_logan_bruno_boy_babysitter_5.txt, 1822, 5
m13c_mary_anne_and_the_library_mystery_11.txt, 1836, 11
068c_jessi_and_the_bad_babysitter_6.txt, 1646, 6
090c_welcome_to_the_bsc_abby_3.txt, 1436, 3
017c_mary_annes_bad_luck_mystery_10.txt, 1692, 10
016c_jessis_secret_language_5.txt, 1273, 5
074c_kristy_and_the_copycat_10.txt, 1245, 10
094c_stacey_mcgill_super_sitter_2.txt, 3627, 2
010c_logan_likes_mary_anne_1.txt, 2381, 1
092c_mallorys_christmas_wish_15.txt, 857, 15
038c_kristys_mystery_admirer_2.txt, 2324, 2
061c_jessi_and_the_awful_secret_1.txt, 2227, 1
m30c_kristy_and_the_mystery_train_5.txt, 1973, 5
075c_jessis_horrible_prank_13.txt, 1513, 13
114c_the_secret_life_of_mary_anne_spier_3.txt, 1884, 3
111c_staceys_secret_friend_1.txt, 1664, 1
091c_claudia_and_the_first_thanksgiving_13.txt, 1353, 13
m34c_mary_anne_and_the_haunted_bookstore_15.txt, 949, 15
113c_claudia_makes_up_her_mind_9.txt, 978, 9
049c_claudia_and_the_genius_of_elm_street_10.txt, 1139, 10
054c_mallory_and_the_dream_horse_2.txt, 2387, 2
m09c_kristy_and_the_haunted_mansion_1.txt, 2215, 1
103c_happy_holidays_jessi_12.txt, 1100, 12
068c_jessi_and_the_bad_babysitter_11.txt, 1420, 11
027c_jessi_and_the_superbrat_7.txt, 1919, 7
075c_jessis_horrible_prank_2.txt, 1266, 2
093c_mary_anne_and_the_memory_garden_12.txt, 1726, 12
034c_mary_anne_and_too_many_boys_13.txt, 1409, 13
094c_stacey_mcgill_super_sitter_11.txt, 1500, 11
047c_mallory_on_strike_8.txt, 1937, 8
069c_get_well_soon_mallory_11.txt, 1959, 11
113c_claudia_makes_up_her_mind_10.txt, 1774, 10
028c_welcome_back_stacey_5.txt, 1563, 5
m22c_stacey_and_the_haunted_masquerade_7.txt, 1845, 7
003c_the_truth_about_stacey_2.txt, 2311, 2
001c_kristys_great_idea_14.txt, 1842, 14
m26c_dawn_schafer_undercover_babysitter_6.txt, 1995, 6
048c_jessis_wish_3.txt, 2384, 3
m35c_abby_and_the_notorius_neighbor_10.txt, 1615, 10
126c_the_all_new_mallory_pike_5.txt, 1474, 5
m18c_stacey_and_the_mystery_at_the_empty_house_9.txt, 1763, 9
m14c_stacey_and_the_mystery_at_the_mall_14.txt, 2303, 14
029c_mallory_and_the_mystery_diary_14.txt, 1585, 14
081c_kristy_and_mr_mom_6.txt, 2507, 6
022c_jessi_ramsey_petsitter_5.txt, 1647, 5
064c_dawns_family_feud_5.txt, 2243, 5
m11c_claudia_and_the_mystery_at_the_museum_14.txt, 1523, 14
m35c_abby_and_the_notorius_neighbor_5.txt, 1371, 5
071c_claudia_and_the_perfect_boy_4.txt, 1899, 4
011c_kristy_and_the_snobs_7.txt, 1886, 7
067c_dawns_big_move_4.txt, 1750, 4
m23c_abby_and_the_secret_society_11.txt, 1806, 11
118c_kristy_thomas_dog_trainer_1.txt, 1229, 1
097c_claudia_and_the_worlds_cutest_baby_10.txt, 1340, 10
009c_the_ghost_at_dawns_house_11.txt, 1790, 11
m20c_mary_anne_and_the_zoo_mystery_7.txt, 3430, 7
102c_mary_anne_and_the_little_princess_8.txt, 1357, 8
m24c_mary_anne_and_the_silent_witness_10.txt, 1783, 10
042c_jessi_and_the_dance_school_phantom_2.txt, 2980, 2
109c_mary_anne_to_the_rescue_4.txt, 1333, 4
096c_abbys_lucky_thirteen_9.txt, 2517, 9
130c_staceys_movie_11.txt, 1332, 11
m10c_stacey_and_the_mystery_money_3.txt, 2136, 3
027c_jessi_and_the_superbrat_14.txt, 1366, 14
m19c_kristy_and_the_missing_fortune_9.txt, 1703, 9
026c_claudia_and_the_sad_goodbye_7.txt, 1940, 7
130c_staceys_movie_4.txt, 2228, 4
086c_mary_anne_and_camp_bsc_9.txt, 1830, 9
053c_kristy_for_president_6.txt, 1632, 6
m16c_claudia_and_the_clue_in_the_photograph_13.txt, 1960, 13
090c_welcome_to_the_bsc_abby_13.txt, 901, 13
027c_jessi_and_the_superbrat_10.txt, 1667, 10
131c_the_fire_at_mary_annes_house_13.txt, 1592, 13
m10c_stacey_and_the_mystery_money_7.txt, 2536, 7
130c_staceys_movie_15.txt, 1262, 15
053c_kristy_for_president_2.txt, 2517, 2
107c_mind_your_own_business_kristy_8.txt, 1098, 8
026c_claudia_and_the_sad_goodbye_3.txt, 1898, 3
m20c_mary_anne_and_the_zoo_mystery_3.txt, 2198, 3
009c_the_ghost_at_dawns_house_15.txt, 1894, 15
097c_claudia_and_the_worlds_cutest_baby_14.txt, 767, 14
064c_dawns_family_feud_13.txt, 1070, 13
042c_jessi_and_the_dance_school_phantom_6.txt, 2054, 6
m24c_mary_anne_and_the_silent_witness_14.txt, 1788, 14
058c_staceys_choice_8.txt, 1473, 8
011c_kristy_and_the_snobs_3.txt, 1846, 3
m35c_abby_and_the_notorius_neighbor_1.txt, 1841, 1
064c_dawns_family_feud_1.txt, 1896, 1
m11c_claudia_and_the_mystery_at_the_museum_10.txt, 1866, 10
044c_dawn_and_the_big_sleepover_9.txt, 1704, 9
118c_kristy_thomas_dog_trainer_5.txt, 1762, 5
050c_dawns_big_date_9.txt, 1513, 9
m23c_abby_and_the_secret_society_15.txt, 1923, 15
m22c_stacey_and_the_haunted_masquerade_3.txt, 1793, 3
028c_welcome_back_stacey_1.txt, 1902, 1
115c_jessis_big_break_13.txt, 1187, 13
113c_claudia_makes_up_her_mind_14.txt, 1601, 14
069c_get_well_soon_mallory_15.txt, 688, 15
094c_stacey_mcgill_super_sitter_15.txt, 1426, 15
120c_mary_anne_and_the_playground_fight_4.txt, 967, 4
081c_kristy_and_mr_mom_2.txt, 3072, 2
022c_jessi_ramsey_petsitter_1.txt, 2155, 1
m14c_stacey_and_the_mystery_at_the_mall_10.txt, 1947, 10
029c_mallory_and_the_mystery_diary_10.txt, 1535, 10
021c_mallory_and_the_trouble_with_twins_13.txt, 2039, 13
126c_the_all_new_mallory_pike_1.txt, 1504, 1
100c_kristys_worst_idea_12.txt, 1186, 12
m35c_abby_and_the_notorius_neighbor_14.txt, 1380, 14
098c_dawn_and_too_many_sitters_12.txt, 1420, 12
048c_jessis_wish_7.txt, 1571, 7
m26c_dawn_schafer_undercover_babysitter_2.txt, 2322, 2
003c_the_truth_about_stacey_6.txt, 1365, 6
001c_kristys_great_idea_10.txt, 2429, 10
m09c_kristy_and_the_haunted_mansion_5.txt, 1644, 5
054c_mallory_and_the_dream_horse_6.txt, 2124, 6
049c_claudia_and_the_genius_of_elm_street_14.txt, 1426, 14
serr1c_logans_story_9.txt, 1704, 9
m01c_stacey_and_the_mystery_ring_8.txt, 1849, 8
075c_jessis_horrible_prank_6.txt, 1221, 6
027c_jessi_and_the_superbrat_3.txt, 1704, 3
068c_jessi_and_the_bad_babysitter_15.txt, 1274, 15
m30c_kristy_and_the_mystery_train_1.txt, 1691, 1
061c_jessi_and_the_awful_secret_5.txt, 2324, 5
121c_abby_in_wonderland_9.txt, 1624, 9
038c_kristys_mystery_admirer_6.txt, 1480, 6
m34c_mary_anne_and_the_haunted_bookstore_11.txt, 716, 11
111c_staceys_secret_friend_5.txt, 1479, 5
114c_the_secret_life_of_mary_anne_spier_7.txt, 2063, 7
m15c_kristy_and_the_vampires_9.txt, 1781, 9
028c_welcome_back_stacey_12.txt, 1415, 12
016c_jessis_secret_language_1.txt, 1988, 1
017c_mary_annes_bad_luck_mystery_14.txt, 1996, 14
068c_jessi_and_the_bad_babysitter_2.txt, 3278, 2
090c_welcome_to_the_bsc_abby_7.txt, 1861, 7
m13c_mary_anne_and_the_library_mystery_15.txt, 2040, 15
103c_happy_holidays_jessi_9.txt, 1365, 9
092c_mallorys_christmas_wish_11.txt, 1076, 11
010c_logan_likes_mary_anne_5.txt, 1902, 5
076c_staceys_lie_8.txt, 2882, 8
074c_kristy_and_the_copycat_14.txt, 1541, 14
m36c_kristy_and_the_cat_burglar_13.txt, 2020, 13
094c_stacey_mcgill_super_sitter_6.txt, 1279, 6
122c_kristy_in_charge_9.txt, 1478, 9
040c_claudia_and_the_middle_school_mystery_3.txt, 2020, 3
m32c_claudia_and_the_mystery_in_the_painting_4.txt, 2277, 4
116c_abby_and_the_best_kid_ever_7.txt, 1714, 7
serr2c_logan_bruno_boy_babysitter_1.txt, 2769, 1
m30c_kristy_and_the_mystery_train_12.txt, 1008, 12
057c_dawn_saves_the_planet_14.txt, 1682, 14
023c_dawn_on_the_coast_10.txt, 1325, 10
115c_jessis_big_break_3.txt, 1605, 3
092c_mallorys_christmas_wish_1.txt, 2050, 1
115c_jessis_big_break_2.txt, 2978, 2
023c_dawn_on_the_coast_11.txt, 1419, 11
057c_dawn_saves_the_planet_15.txt, 1407, 15
m30c_kristy_and_the_mystery_train_13.txt, 678, 13
m32c_claudia_and_the_mystery_in_the_painting_5.txt, 3083, 5
040c_claudia_and_the_middle_school_mystery_2.txt, 2045, 2
116c_abby_and_the_best_kid_ever_6.txt, 1865, 6
043c_staceys_emergency_1.txt, 2069, 1
122c_kristy_in_charge_8.txt, 1755, 8
m36c_kristy_and_the_cat_burglar_12.txt, 1610, 12
094c_stacey_mcgill_super_sitter_7.txt, 1954, 7
074c_kristy_and_the_copycat_15.txt, 920, 15
076c_staceys_lie_9.txt, 1393, 9
092c_mallorys_christmas_wish_10.txt, 1561, 10
103c_happy_holidays_jessi_8.txt, 1221, 8
010c_logan_likes_mary_anne_4.txt, 1828, 4
090c_welcome_to_the_bsc_abby_6.txt, 1634, 6
068c_jessi_and_the_bad_babysitter_3.txt, 1713, 3
m13c_mary_anne_and_the_library_mystery_14.txt, 1738, 14
028c_welcome_back_stacey_13.txt, 1748, 13
017c_mary_annes_bad_luck_mystery_15.txt, 1254, 15
m15c_kristy_and_the_vampires_8.txt, 1486, 8
114c_the_secret_life_of_mary_anne_spier_6.txt, 1725, 6
m34c_mary_anne_and_the_haunted_bookstore_10.txt, 2688, 10
111c_staceys_secret_friend_4.txt, 1474, 4
061c_jessi_and_the_awful_secret_4.txt, 1586, 4
038c_kristys_mystery_admirer_7.txt, 1304, 7
121c_abby_in_wonderland_8.txt, 1289, 8
075c_jessis_horrible_prank_7.txt, 1305, 7
068c_jessi_and_the_bad_babysitter_14.txt, 1875, 14
027c_jessi_and_the_superbrat_2.txt, 1788, 2
m01c_stacey_and_the_mystery_ring_9.txt, 1601, 9
054c_mallory_and_the_dream_horse_7.txt, 2076, 7
serr1c_logans_story_8.txt, 1920, 8
049c_claudia_and_the_genius_of_elm_street_15.txt, 930, 15
m09c_kristy_and_the_haunted_mansion_4.txt, 1898, 4
098c_dawn_and_too_many_sitters_13.txt, 1384, 13
m35c_abby_and_the_notorius_neighbor_15.txt, 1545, 15
100c_kristys_worst_idea_13.txt, 1119, 13
001c_kristys_great_idea_11.txt, 1549, 11
003c_the_truth_about_stacey_7.txt, 1774, 7
m26c_dawn_schafer_undercover_babysitter_3.txt, 1914, 3
048c_jessis_wish_6.txt, 1640, 6
029c_mallory_and_the_mystery_diary_11.txt, 1509, 11
m14c_stacey_and_the_mystery_at_the_mall_11.txt, 1641, 11
081c_kristy_and_mr_mom_3.txt, 2154, 3
021c_mallory_and_the_trouble_with_twins_12.txt, 1785, 12
069c_get_well_soon_mallory_14.txt, 1909, 14
120c_mary_anne_and_the_playground_fight_5.txt, 1832, 5
094c_stacey_mcgill_super_sitter_14.txt, 1005, 14
m22c_stacey_and_the_haunted_masquerade_2.txt, 2652, 2
113c_claudia_makes_up_her_mind_15.txt, 1686, 15
115c_jessis_big_break_12.txt, 1247, 12
m23c_abby_and_the_secret_society_14.txt, 1622, 14
050c_dawns_big_date_8.txt, 1956, 8
067c_dawns_big_move_1.txt, 2542, 1
044c_dawn_and_the_big_sleepover_8.txt, 1808, 8
118c_kristy_thomas_dog_trainer_4.txt, 1360, 4
m11c_claudia_and_the_mystery_at_the_museum_11.txt, 1840, 11
011c_kristy_and_the_snobs_2.txt, 1657, 2
071c_claudia_and_the_perfect_boy_1.txt, 1915, 1
058c_staceys_choice_9.txt, 1388, 9
m24c_mary_anne_and_the_silent_witness_15.txt, 1287, 15
042c_jessi_and_the_dance_school_phantom_7.txt, 2141, 7
109c_mary_anne_to_the_rescue_1.txt, 2079, 1
064c_dawns_family_feud_12.txt, 1093, 12
097c_claudia_and_the_worlds_cutest_baby_15.txt, 634, 15
m20c_mary_anne_and_the_zoo_mystery_2.txt, 2450, 2
009c_the_ghost_at_dawns_house_14.txt, 1407, 14
026c_claudia_and_the_sad_goodbye_2.txt, 2068, 2
107c_mind_your_own_business_kristy_9.txt, 2004, 9
130c_staceys_movie_1.txt, 1442, 1
053c_kristy_for_president_3.txt, 2833, 3
131c_the_fire_at_mary_annes_house_12.txt, 1504, 12
m10c_stacey_and_the_mystery_money_6.txt, 2190, 6
130c_staceys_movie_14.txt, 1422, 14
027c_jessi_and_the_superbrat_11.txt, 1843, 11
090c_welcome_to_the_bsc_abby_12.txt, 1288, 12
m10c_stacey_and_the_mystery_money_4.txt, 2579, 4
131c_the_fire_at_mary_annes_house_10.txt, 1973, 10
090c_welcome_to_the_bsc_abby_10.txt, 1616, 10
027c_jessi_and_the_superbrat_13.txt, 1780, 13
130c_staceys_movie_3.txt, 1634, 3
m16c_claudia_and_the_clue_in_the_photograph_14.txt, 2289, 14
053c_kristy_for_president_1.txt, 1858, 1
064c_dawns_family_feud_10.txt, 1986, 10
109c_mary_anne_to_the_rescue_3.txt, 2037, 3
042c_jessi_and_the_dance_school_phantom_5.txt, 2363, 5
124c_stacey_mcgill_matchmaker_9.txt, 1477, 9
064c_dawns_family_feud_2.txt, 1802, 2
m11c_claudia_and_the_mystery_at_the_museum_13.txt, 1892, 13
m35c_abby_and_the_notorius_neighbor_2.txt, 2872, 2
071c_claudia_and_the_perfect_boy_3.txt, 2139, 3
m17c_dawn_and_the_halloween_mystery_9.txt, 1946, 9
067c_dawns_big_move_3.txt, 1764, 3
074c_kristy_and_the_copycat_8.txt, 819, 8
118c_kristy_thomas_dog_trainer_6.txt, 1690, 6
106c_claudia_queen_of_the_seventh_grade_9.txt, 1350, 9
120c_mary_anne_and_the_playground_fight_7.txt, 1822, 7
115c_jessis_big_break_10.txt, 1168, 10
028c_welcome_back_stacey_2.txt, 1683, 2
048c_jessis_wish_4.txt, 1575, 4
m26c_dawn_schafer_undercover_babysitter_1.txt, 1990, 1
003c_the_truth_about_stacey_5.txt, 2349, 5
001c_kristys_great_idea_13.txt, 2167, 13
126c_the_all_new_mallory_pike_2.txt, 2444, 2
100c_kristys_worst_idea_11.txt, 1452, 11
098c_dawn_and_too_many_sitters_11.txt, 1372, 11
021c_mallory_and_the_trouble_with_twins_10.txt, 1526, 10
081c_kristy_and_mr_mom_1.txt, 1913, 1
022c_jessi_ramsey_petsitter_2.txt, 2240, 2
m14c_stacey_and_the_mystery_at_the_mall_13.txt, 1669, 13
029c_mallory_and_the_mystery_diary_13.txt, 1707, 13
054c_mallory_and_the_dream_horse_5.txt, 1903, 5
m09c_kristy_and_the_haunted_mansion_6.txt, 1996, 6
119c_staceys_ex_boyfriend_9.txt, 1251, 9
103c_happy_holidays_jessi_15.txt, 803, 15
075c_jessis_horrible_prank_5.txt, 1576, 5
093c_mary_anne_and_the_memory_garden_15.txt, 1099, 15
100c_kristys_worst_idea_9.txt, 1481, 9
034c_mary_anne_and_too_many_boys_14.txt, 1355, 14
038c_kristys_mystery_admirer_5.txt, 1533, 5
061c_jessi_and_the_awful_secret_6.txt, 2006, 6
m30c_kristy_and_the_mystery_train_2.txt, 3746, 2
114c_the_secret_life_of_mary_anne_spier_4.txt, 1626, 4
075c_jessis_horrible_prank_14.txt, 2447, 14
111c_staceys_secret_friend_6.txt, 1308, 6
091c_claudia_and_the_first_thanksgiving_14.txt, 1665, 14
m34c_mary_anne_and_the_haunted_bookstore_12.txt, 2618, 12
018c_staceys_mistake_8.txt, 1310, 8
068c_jessi_and_the_bad_babysitter_1.txt, 2170, 1
090c_welcome_to_the_bsc_abby_4.txt, 2115, 4
028c_welcome_back_stacey_11.txt, 1435, 11
016c_jessis_secret_language_2.txt, 2501, 2
094c_stacey_mcgill_super_sitter_5.txt, 2514, 5
m36c_kristy_and_the_cat_burglar_10.txt, 2136, 10
010c_logan_likes_mary_anne_6.txt, 1811, 6
015c_little_miss_stoneybrook_and_dawn_9.txt, 1607, 9
092c_mallorys_christmas_wish_12.txt, 1651, 12
116c_abby_and_the_best_kid_ever_4.txt, 1600, 4
m32c_claudia_and_the_mystery_in_the_painting_7.txt, 1645, 7
043c_staceys_emergency_3.txt, 1945, 3
092c_mallorys_christmas_wish_2.txt, 2771, 2
023c_dawn_on_the_coast_13.txt, 1137, 13
m30c_kristy_and_the_mystery_train_11.txt, 926, 11
serr2c_logan_bruno_boy_babysitter_2.txt, 2764, 2
m30c_kristy_and_the_mystery_train_10.txt, 1652, 10
serr2c_logan_bruno_boy_babysitter_3.txt, 1728, 3
092c_mallorys_christmas_wish_3.txt, 1418, 3
115c_jessis_big_break_1.txt, 2119, 1
023c_dawn_on_the_coast_12.txt, 1731, 12
043c_staceys_emergency_2.txt, 2080, 2
116c_abby_and_the_best_kid_ever_5.txt, 986, 5
m32c_claudia_and_the_mystery_in_the_painting_6.txt, 1572, 6
040c_claudia_and_the_middle_school_mystery_1.txt, 1825, 1
010c_logan_likes_mary_anne_7.txt, 1686, 7
092c_mallorys_christmas_wish_13.txt, 1922, 13
015c_little_miss_stoneybrook_and_dawn_8.txt, 1698, 8
094c_stacey_mcgill_super_sitter_4.txt, 1806, 4
m36c_kristy_and_the_cat_burglar_11.txt, 1701, 11
016c_jessis_secret_language_3.txt, 1461, 3
028c_welcome_back_stacey_10.txt, 1278, 10
090c_welcome_to_the_bsc_abby_5.txt, 1019, 5
018c_staceys_mistake_9.txt, 1592, 9
091c_claudia_and_the_first_thanksgiving_15.txt, 1185, 15
111c_staceys_secret_friend_7.txt, 1315, 7
m34c_mary_anne_and_the_haunted_bookstore_13.txt, 1334, 13
075c_jessis_horrible_prank_15.txt, 1056, 15
114c_the_secret_life_of_mary_anne_spier_5.txt, 1416, 5
m30c_kristy_and_the_mystery_train_3.txt, 3075, 3
038c_kristys_mystery_admirer_4.txt, 1784, 4
061c_jessi_and_the_awful_secret_7.txt, 1261, 7
100c_kristys_worst_idea_8.txt, 2174, 8
034c_mary_anne_and_too_many_boys_15.txt, 1513, 15
103c_happy_holidays_jessi_14.txt, 1700, 14
027c_jessi_and_the_superbrat_1.txt, 2327, 1
093c_mary_anne_and_the_memory_garden_14.txt, 928, 14
075c_jessis_horrible_prank_4.txt, 1319, 4
119c_staceys_ex_boyfriend_8.txt, 2064, 8
m09c_kristy_and_the_haunted_mansion_7.txt, 1787, 7
054c_mallory_and_the_dream_horse_4.txt, 1690, 4
021c_mallory_and_the_trouble_with_twins_11.txt, 1733, 11
029c_mallory_and_the_mystery_diary_12.txt, 1379, 12
m14c_stacey_and_the_mystery_at_the_mall_12.txt, 1719, 12
022c_jessi_ramsey_petsitter_3.txt, 1898, 3
001c_kristys_great_idea_12.txt, 1569, 12
003c_the_truth_about_stacey_4.txt, 2094, 4
048c_jessis_wish_5.txt, 1920, 5
098c_dawn_and_too_many_sitters_10.txt, 932, 10
100c_kristys_worst_idea_10.txt, 1371, 10
126c_the_all_new_mallory_pike_3.txt, 2124, 3
115c_jessis_big_break_11.txt, 1810, 11
028c_welcome_back_stacey_3.txt, 2416, 3
m22c_stacey_and_the_haunted_masquerade_1.txt, 1934, 1
120c_mary_anne_and_the_playground_fight_6.txt, 1573, 6
106c_claudia_queen_of_the_seventh_grade_8.txt, 1375, 8
118c_kristy_thomas_dog_trainer_7.txt, 1574, 7
067c_dawns_big_move_2.txt, 2239, 2
074c_kristy_and_the_copycat_9.txt, 1526, 9
071c_claudia_and_the_perfect_boy_2.txt, 3340, 2
m35c_abby_and_the_notorius_neighbor_3.txt, 1691, 3
011c_kristy_and_the_snobs_1.txt, 2162, 1
m17c_dawn_and_the_halloween_mystery_8.txt, 1834, 8
m11c_claudia_and_the_mystery_at_the_museum_12.txt, 1671, 12
064c_dawns_family_feud_3.txt, 1848, 3
124c_stacey_mcgill_matchmaker_8.txt, 2346, 8
042c_jessi_and_the_dance_school_phantom_4.txt, 2391, 4
109c_mary_anne_to_the_rescue_2.txt, 3395, 2
m20c_mary_anne_and_the_zoo_mystery_1.txt, 2240, 1
064c_dawns_family_feud_11.txt, 1744, 11
m16c_claudia_and_the_clue_in_the_photograph_15.txt, 1439, 15
026c_claudia_and_the_sad_goodbye_1.txt, 2872, 1
130c_staceys_movie_2.txt, 2960, 2
027c_jessi_and_the_superbrat_12.txt, 1461, 12
090c_welcome_to_the_bsc_abby_11.txt, 1344, 11
m10c_stacey_and_the_mystery_money_5.txt, 1963, 5
131c_the_fire_at_mary_annes_house_11.txt, 1855, 11
m27c_claudia_and_the_lighthouse_ghost_1.txt, 1924, 1
042c_jessi_and_the_dance_school_phantom_12.txt, 1914, 12
m19c_kristy_and_the_missing_fortune_2.txt, 2753, 2
m10c_stacey_and_the_mystery_money_8.txt, 2008, 8
110c_abby_and_the_bad_sport_10.txt, 1854, 10
016c_jessis_secret_language_15.txt, 1652, 15
086c_mary_anne_and_camp_bsc_2.txt, 2649, 2
107c_mind_your_own_business_kristy_7.txt, 1919, 7
m10c_stacey_and_the_mystery_money_14.txt, 2311, 14
088c_farewell_dawn_14.txt, 1483, 14
m05c_mary_anne_and_the_secret_in_the_attic_11.txt, 1892, 11
014c_hello_mallory_12.txt, 1608, 12
102c_mary_anne_and_the_little_princess_3.txt, 1498, 3
042c_jessi_and_the_dance_school_phantom_9.txt, 2110, 9
117c_claudia_and_the_terrible_truth_11.txt, 1608, 11
096c_abbys_lucky_thirteen_2.txt, 1474, 2
083c_stacey_vs_the_bsc_10.txt, 2044, 10
058c_staceys_choice_7.txt, 1617, 7
025c_mary_anne_and_the_search_for_tigger_11.txt, 1745, 11
m17c_dawn_and_the_halloween_mystery_5.txt, 1970, 5
033c_claudia_and_the_great_search_13.txt, 1366, 13
124c_stacey_mcgill_matchmaker_5.txt, 885, 5
106c_claudia_queen_of_the_seventh_grade_5.txt, 1585, 5
077c_dwn_and_whitney_friends_forever_10.txt, 1499, 10
044c_dawn_and_the_big_sleepover_6.txt, 1483, 6
112c_kristy_and_the_sister_war_12.txt, 1628, 12
050c_dawns_big_date_6.txt, 1397, 6
074c_kristy_and_the_copycat_4.txt, 2050, 4
099c_staceys_broken_heart_14.txt, 1409, 14
003c_the_truth_about_stacey_11.txt, 2254, 11
025c_mary_anne_and_the_search_for_tigger_1.txt, 2361, 1
047c_mallory_on_strike_3.txt, 2181, 3
m18c_stacey_and_the_mystery_at_the_empty_house_2.txt, 2181, 2
071c_claudia_and_the_perfect_boy_12.txt, 1294, 12
089c_kristy_and_the_dirty_diapers_13.txt, 1680, 13
003c_the_truth_about_stacey_9.txt, 1826, 9
048c_jessis_wish_8.txt, 1574, 8
119c_staceys_ex_boyfriend_5.txt, 1689, 5
032c_kristy_and_the_secret_of_susan_13.txt, 1634, 13
serr1c_logans_story_6.txt, 1553, 6
102c_mary_anne_and_the_little_princess_15.txt, 788, 15
054c_mallory_and_the_dream_horse_9.txt, 2126, 9
100c_kristys_worst_idea_5.txt, 1745, 5
053c_kristy_for_president_13.txt, 1336, 13
m01c_stacey_and_the_mystery_ring_7.txt, 1850, 7
m29c_stacey_and_the_fashion_victim_1.txt, 2171, 1
007c_claudia_and_mean_jeanine_11.txt, 1904, 11
075c_jessis_horrible_prank_9.txt, 1611, 9
m02c_beware_dawn_1.txt, 2122, 1
038c_kristys_mystery_admirer_9.txt, 1878, 9
101c_claudia_kishi_middle_school_dropout_15.txt, 1486, 15
121c_abby_in_wonderland_6.txt, 1193, 6
125c_mary_anne_in_the_middle_13.txt, 1219, 13
087c_stacey_and_the_bad_girls_12.txt, 1869, 12
113c_claudia_makes_up_her_mind_2.txt, 2811, 2
114c_the_secret_life_of_mary_anne_spier_8.txt, 2008, 8
m15c_kristy_and_the_vampires_6.txt, 1994, 6
050c_dawns_big_date_14.txt, 2585, 14
090c_welcome_to_the_bsc_abby_8.txt, 1487, 8
018c_staceys_mistake_4.txt, 1627, 4
103c_happy_holidays_jessi_6.txt, 2174, 6
015c_little_miss_stoneybrook_and_dawn_5.txt, 1638, 5
045c_kristy_and_the_baby_parade_10.txt, 1582, 10
018c_staceys_mistake_10.txt, 1752, 10
069c_get_well_soon_mallory_1.txt, 2178, 1
081c_kristy_and_mr_mom_12.txt, 2280, 12
096c_abbys_lucky_thirteen_11.txt, 1116, 11
094c_stacey_mcgill_super_sitter_9.txt, 1024, 9
076c_staceys_lie_7.txt, 1966, 7
122c_kristy_in_charge_6.txt, 1224, 6
052c_mary_anne_plus_too_many_babies_13.txt, 1467, 13
021c_mallory_and_the_trouble_with_twins_2.txt, 2092, 2
116c_abby_and_the_best_kid_ever_8.txt, 1716, 8
128c_claudia_and_the_little_liar_13.txt, 1366, 13
m18c_stacey_and_the_mystery_at_the_empty_house_10.txt, 2053, 10
010c_logan_likes_mary_anne_13.txt, 1765, 13
114c_the_secret_life_of_mary_anne_spier_12.txt, 1712, 12
m32c_claudia_and_the_mystery_in_the_painting_14.txt, 1039, 14
044c_dawn_and_the_big_sleepover_12.txt, 1263, 12
015c_little_miss_stoneybrook_and_dawn_14.txt, 3116, 14
057c_dawn_saves_the_planet_1.txt, 2235, 1
m15c_kristy_and_the_vampires_15.txt, 876, 15
m15c_kristy_and_the_vampires_14.txt, 1809, 14
015c_little_miss_stoneybrook_and_dawn_15.txt, 958, 15
044c_dawn_and_the_big_sleepover_13.txt, 1268, 13
010c_logan_likes_mary_anne_12.txt, 1306, 12
m18c_stacey_and_the_mystery_at_the_empty_house_11.txt, 1877, 11
128c_claudia_and_the_little_liar_12.txt, 990, 12
116c_abby_and_the_best_kid_ever_9.txt, 1721, 9
m32c_claudia_and_the_mystery_in_the_painting_15.txt, 1627, 15
114c_the_secret_life_of_mary_anne_spier_13.txt, 730, 13
052c_mary_anne_plus_too_many_babies_12.txt, 1654, 12
122c_kristy_in_charge_7.txt, 1932, 7
021c_mallory_and_the_trouble_with_twins_3.txt, 1550, 3
081c_kristy_and_mr_mom_13.txt, 1476, 13
039c_poor_mallory_1.txt, 1771, 1
018c_staceys_mistake_11.txt, 1654, 11
076c_staceys_lie_6.txt, 1619, 6
094c_stacey_mcgill_super_sitter_8.txt, 1711, 8
096c_abbys_lucky_thirteen_10.txt, 1341, 10
015c_little_miss_stoneybrook_and_dawn_4.txt, 1645, 4
045c_kristy_and_the_baby_parade_11.txt, 1803, 11
103c_happy_holidays_jessi_7.txt, 1592, 7
097c_claudia_and_the_worlds_cutest_baby_1.txt, 2252, 1
018c_staceys_mistake_5.txt, 1978, 5
090c_welcome_to_the_bsc_abby_9.txt, 1472, 9
m15c_kristy_and_the_vampires_7.txt, 2067, 7
114c_the_secret_life_of_mary_anne_spier_9.txt, 1177, 9
125c_mary_anne_in_the_middle_12.txt, 1371, 12
113c_claudia_makes_up_her_mind_3.txt, 975, 3
087c_stacey_and_the_bad_girls_13.txt, 1308, 13
121c_abby_in_wonderland_7.txt, 1774, 7
101c_claudia_kishi_middle_school_dropout_14.txt, 2037, 14
038c_kristys_mystery_admirer_8.txt, 1825, 8
127c_abbys_un_valentine_1.txt, 1617, 1
007c_claudia_and_mean_jeanine_10.txt, 1693, 10
075c_jessis_horrible_prank_8.txt, 1762, 8
053c_kristy_for_president_12.txt, 1057, 12
100c_kristys_worst_idea_4.txt, 1203, 4
m01c_stacey_and_the_mystery_ring_6.txt, 1510, 6
065c_staceys_big_crush_1.txt, 2538, 1
serr1c_logans_story_7.txt, 1809, 7
032c_kristy_and_the_secret_of_susan_12.txt, 1531, 12
054c_mallory_and_the_dream_horse_8.txt, 1903, 8
102c_mary_anne_and_the_little_princess_14.txt, 1283, 14
119c_staceys_ex_boyfriend_4.txt, 1806, 4
048c_jessis_wish_9.txt, 1420, 9
003c_the_truth_about_stacey_8.txt, 1699, 8
089c_kristy_and_the_dirty_diapers_12.txt, 1102, 12
071c_claudia_and_the_perfect_boy_13.txt, 2846, 13
m18c_stacey_and_the_mystery_at_the_empty_house_3.txt, 2151, 3
047c_mallory_on_strike_2.txt, 1920, 2
099c_staceys_broken_heart_15.txt, 1262, 15
003c_the_truth_about_stacey_10.txt, 2105, 10
074c_kristy_and_the_copycat_5.txt, 1565, 5
050c_dawns_big_date_7.txt, 2021, 7
112c_kristy_and_the_sister_war_13.txt, 1217, 13
106c_claudia_queen_of_the_seventh_grade_4.txt, 2006, 4
077c_dwn_and_whitney_friends_forever_11.txt, 1549, 11
044c_dawn_and_the_big_sleepover_7.txt, 2213, 7
124c_stacey_mcgill_matchmaker_4.txt, 2058, 4
029c_mallory_and_the_mystery_diary_1.txt, 2346, 1
023c_dawn_on_the_coast_1.txt, 2213, 1
087c_stacey_and_the_bad_girls_1.txt, 2239, 1
033c_claudia_and_the_great_search_12.txt, 1557, 12
m17c_dawn_and_the_halloween_mystery_4.txt, 2017, 4
025c_mary_anne_and_the_search_for_tigger_10.txt, 1769, 10
062c_kristy_and_the_worst_kid_ever_1.txt, 2240, 1
058c_staceys_choice_6.txt, 1525, 6
083c_stacey_vs_the_bsc_11.txt, 1356, 11
102c_mary_anne_and_the_little_princess_2.txt, 3231, 2
096c_abbys_lucky_thirteen_3.txt, 3336, 3
117c_claudia_and_the_terrible_truth_10.txt, 1179, 10
042c_jessi_and_the_dance_school_phantom_8.txt, 1893, 8
014c_hello_mallory_13.txt, 1708, 13
m05c_mary_anne_and_the_secret_in_the_attic_10.txt, 1536, 10
072c_dawn_and_the_we_heart_kids_club_1.txt, 2313, 1
073c_mary_anne_and_miss_priss_1.txt, 2071, 1
107c_mind_your_own_business_kristy_6.txt, 1337, 6
086c_mary_anne_and_camp_bsc_3.txt, 1333, 3
033c_claudia_and_the_great_search_1.txt, 2047, 1
m10c_stacey_and_the_mystery_money_15.txt, 1914, 15
016c_jessis_secret_language_14.txt, 1745, 14
110c_abby_and_the_bad_sport_11.txt, 1263, 11
m10c_stacey_and_the_mystery_money_9.txt, 2429, 9
042c_jessi_and_the_dance_school_phantom_13.txt, 2277, 13
125c_mary_anne_in_the_middle_1.txt, 1385, 1
m19c_kristy_and_the_missing_fortune_3.txt, 1819, 3
110c_abby_and_the_bad_sport_13.txt, 2016, 13
m19c_kristy_and_the_missing_fortune_1.txt, 2222, 1
m27c_claudia_and_the_lighthouse_ghost_2.txt, 3402, 2
125c_mary_anne_in_the_middle_3.txt, 1689, 3
042c_jessi_and_the_dance_school_phantom_11.txt, 2365, 11
033c_claudia_and_the_great_search_3.txt, 2496, 3
086c_mary_anne_and_camp_bsc_1.txt, 2272, 1
107c_mind_your_own_business_kristy_4.txt, 1348, 4
m05c_mary_anne_and_the_secret_in_the_attic_12.txt, 1868, 12
014c_hello_mallory_11.txt, 1484, 11
048c_jessis_wish_15.txt, 1400, 15
073c_mary_anne_and_miss_priss_3.txt, 1779, 3
serr3c_shannons_story_14.txt, 1380, 14
072c_dawn_and_the_we_heart_kids_club_3.txt, 2761, 3
083c_stacey_vs_the_bsc_13.txt, 1748, 13
058c_staceys_choice_4.txt, 1129, 4
062c_kristy_and_the_worst_kid_ever_3.txt, 2233, 3
117c_claudia_and_the_terrible_truth_12.txt, 1421, 12
096c_abbys_lucky_thirteen_1.txt, 1755, 1
029c_mallory_and_the_mystery_diary_3.txt, 1601, 3
038c_kristys_mystery_admirer_15.txt, 1704, 15
124c_stacey_mcgill_matchmaker_6.txt, 2807, 6
025c_mary_anne_and_the_search_for_tigger_12.txt, 1732, 12
033c_claudia_and_the_great_search_10.txt, 1602, 10
m17c_dawn_and_the_halloween_mystery_6.txt, 1876, 6
087c_stacey_and_the_bad_girls_3.txt, 2451, 3
023c_dawn_on_the_coast_3.txt, 2019, 3
112c_kristy_and_the_sister_war_11.txt, 1466, 11
050c_dawns_big_date_5.txt, 2124, 5
074c_kristy_and_the_copycat_7.txt, 1003, 7
044c_dawn_and_the_big_sleepover_5.txt, 1014, 5
077c_dwn_and_whitney_friends_forever_13.txt, 1458, 13
118c_kristy_thomas_dog_trainer_9.txt, 1535, 9
106c_claudia_queen_of_the_seventh_grade_6.txt, 1686, 6
002c_claudia_and_the_phantom_phone_calls_15.txt, 627, 15
120c_mary_anne_and_the_playground_fight_8.txt, 1606, 8
003c_the_truth_about_stacey_12.txt, 2110, 12
025c_mary_anne_and_the_search_for_tigger_2.txt, 2001, 2
089c_kristy_and_the_dirty_diapers_10.txt, 1214, 10
m18c_stacey_and_the_mystery_at_the_empty_house_1.txt, 2120, 1
071c_claudia_and_the_perfect_boy_11.txt, 1630, 11
serr1c_logans_story_5.txt, 1686, 5
032c_kristy_and_the_secret_of_susan_10.txt, 1866, 10
119c_staceys_ex_boyfriend_6.txt, 1059, 6
m09c_kristy_and_the_haunted_mansion_9.txt, 1829, 9
007c_claudia_and_mean_jeanine_12.txt, 1404, 12
m29c_stacey_and_the_fashion_victim_2.txt, 2412, 2
065c_staceys_big_crush_3.txt, 1651, 3
m01c_stacey_and_the_mystery_ring_4.txt, 2060, 4
100c_kristys_worst_idea_6.txt, 1625, 6
053c_kristy_for_president_10.txt, 2408, 10
m08c_jessi_and_the_jewel_thieves_15.txt, 1937, 15
061c_jessi_and_the_awful_secret_9.txt, 1278, 9
127c_abbys_un_valentine_3.txt, 867, 3
121c_abby_in_wonderland_5.txt, 2007, 5
m02c_beware_dawn_2.txt, 1965, 2
m15c_kristy_and_the_vampires_5.txt, 1715, 5
087c_stacey_and_the_bad_girls_11.txt, 966, 11
113c_claudia_makes_up_her_mind_1.txt, 1852, 1
111c_staceys_secret_friend_9.txt, 1179, 9
125c_mary_anne_in_the_middle_10.txt, 1660, 10
018c_staceys_mistake_7.txt, 2019, 7
109c_mary_anne_to_the_rescue_15.txt, 1621, 15
096c_abbys_lucky_thirteen_12.txt, 1920, 12
063c_claudias_freind_friend_14.txt, 1220, 14
076c_staceys_lie_4.txt, 2116, 4
018c_staceys_mistake_13.txt, 1922, 13
039c_poor_mallory_3.txt, 2011, 3
069c_get_well_soon_mallory_2.txt, 2244, 2
081c_kristy_and_mr_mom_11.txt, 1291, 11
084c_dawn_and_the_school_spirit_war_15.txt, 548, 15
097c_claudia_and_the_worlds_cutest_baby_3.txt, 1602, 3
103c_happy_holidays_jessi_5.txt, 1472, 5
045c_kristy_and_the_baby_parade_13.txt, 1492, 13
015c_little_miss_stoneybrook_and_dawn_6.txt, 1806, 6
037c_dawn_and_the_older_boy_14.txt, 1663, 14
010c_logan_likes_mary_anne_9.txt, 1394, 9
107c_mind_your_own_business_kristy_15.txt, 680, 15
m32c_claudia_and_the_mystery_in_the_painting_8.txt, 2187, 8
114c_the_secret_life_of_mary_anne_spier_11.txt, 1479, 11
128c_claudia_and_the_little_liar_10.txt, 917, 10
m18c_stacey_and_the_mystery_at_the_empty_house_13.txt, 1761, 13
010c_logan_likes_mary_anne_10.txt, 2037, 10
021c_mallory_and_the_trouble_with_twins_1.txt, 2082, 1
122c_kristy_in_charge_5.txt, 1049, 5
079c_mary_anne_breaks_the_rules_14.txt, 1756, 14
052c_mary_anne_plus_too_many_babies_10.txt, 1729, 10
057c_dawn_saves_the_planet_2.txt, 1636, 2
044c_dawn_and_the_big_sleepover_11.txt, 2288, 11
044c_dawn_and_the_big_sleepover_10.txt, 1367, 10
057c_dawn_saves_the_planet_3.txt, 1925, 3
052c_mary_anne_plus_too_many_babies_11.txt, 1487, 11
079c_mary_anne_breaks_the_rules_15.txt, 1559, 15
122c_kristy_in_charge_4.txt, 1359, 4
114c_the_secret_life_of_mary_anne_spier_10.txt, 1000, 10
m32c_claudia_and_the_mystery_in_the_painting_9.txt, 1726, 9
107c_mind_your_own_business_kristy_14.txt, 1746, 14
010c_logan_likes_mary_anne_11.txt, 1633, 11
m18c_stacey_and_the_mystery_at_the_empty_house_12.txt, 1618, 12
128c_claudia_and_the_little_liar_11.txt, 1441, 11
045c_kristy_and_the_baby_parade_12.txt, 1938, 12
015c_little_miss_stoneybrook_and_dawn_7.txt, 1600, 7
103c_happy_holidays_jessi_4.txt, 1449, 4
097c_claudia_and_the_worlds_cutest_baby_2.txt, 2705, 2
084c_dawn_and_the_school_spirit_war_14.txt, 1236, 14
010c_logan_likes_mary_anne_8.txt, 1431, 8
037c_dawn_and_the_older_boy_15.txt, 437, 15
063c_claudias_freind_friend_15.txt, 1025, 15
076c_staceys_lie_5.txt, 1861, 5
096c_abbys_lucky_thirteen_13.txt, 1062, 13
081c_kristy_and_mr_mom_10.txt, 1905, 10
069c_get_well_soon_mallory_3.txt, 1605, 3
m25c_kristy_and_the_middle_school_vandal_14.txt, 1268, 14
039c_poor_mallory_2.txt, 2696, 2
018c_staceys_mistake_12.txt, 1669, 12
109c_mary_anne_to_the_rescue_14.txt, 1803, 14
018c_staceys_mistake_6.txt, 1479, 6
087c_stacey_and_the_bad_girls_10.txt, 1556, 10
125c_mary_anne_in_the_middle_11.txt, 1215, 11
111c_staceys_secret_friend_8.txt, 964, 8
m15c_kristy_and_the_vampires_4.txt, 1864, 4
m02c_beware_dawn_3.txt, 2026, 3
127c_abbys_un_valentine_2.txt, 4087, 2
061c_jessi_and_the_awful_secret_8.txt, 1842, 8
121c_abby_in_wonderland_4.txt, 1627, 4
m01c_stacey_and_the_mystery_ring_5.txt, 1981, 5
065c_staceys_big_crush_2.txt, 2922, 2
m08c_jessi_and_the_jewel_thieves_14.txt, 1615, 14
053c_kristy_for_president_11.txt, 2272, 11
100c_kristys_worst_idea_7.txt, 2232, 7
m29c_stacey_and_the_fashion_victim_3.txt, 1711, 3
007c_claudia_and_mean_jeanine_13.txt, 2162, 13
m09c_kristy_and_the_haunted_mansion_8.txt, 1738, 8
119c_staceys_ex_boyfriend_7.txt, 1378, 7
032c_kristy_and_the_secret_of_susan_11.txt, 1733, 11
serr1c_logans_story_4.txt, 1972, 4
071c_claudia_and_the_perfect_boy_10.txt, 1729, 10
089c_kristy_and_the_dirty_diapers_11.txt, 1681, 11
025c_mary_anne_and_the_search_for_tigger_3.txt, 1837, 3
003c_the_truth_about_stacey_13.txt, 3167, 13
047c_mallory_on_strike_1.txt, 2762, 1
120c_mary_anne_and_the_playground_fight_9.txt, 886, 9
044c_dawn_and_the_big_sleepover_4.txt, 2340, 4
077c_dwn_and_whitney_friends_forever_12.txt, 1396, 12
002c_claudia_and_the_phantom_phone_calls_14.txt, 2612, 14
106c_claudia_queen_of_the_seventh_grade_7.txt, 1798, 7
118c_kristy_thomas_dog_trainer_8.txt, 1457, 8
074c_kristy_and_the_copycat_6.txt, 1800, 6
050c_dawns_big_date_4.txt, 2575, 4
112c_kristy_and_the_sister_war_10.txt, 1805, 10
m17c_dawn_and_the_halloween_mystery_7.txt, 1980, 7
033c_claudia_and_the_great_search_11.txt, 1683, 11
025c_mary_anne_and_the_search_for_tigger_13.txt, 1544, 13
023c_dawn_on_the_coast_2.txt, 1841, 2
087c_stacey_and_the_bad_girls_2.txt, 2218, 2
038c_kristys_mystery_admirer_14.txt, 1495, 14
029c_mallory_and_the_mystery_diary_2.txt, 1505, 2
124c_stacey_mcgill_matchmaker_7.txt, 1243, 7
117c_claudia_and_the_terrible_truth_13.txt, 1561, 13
102c_mary_anne_and_the_little_princess_1.txt, 2075, 1
058c_staceys_choice_5.txt, 1544, 5
083c_stacey_vs_the_bsc_12.txt, 1501, 12
062c_kristy_and_the_worst_kid_ever_2.txt, 2500, 2
072c_dawn_and_the_we_heart_kids_club_2.txt, 2249, 2
serr3c_shannons_story_15.txt, 1472, 15
073c_mary_anne_and_miss_priss_2.txt, 2592, 2
048c_jessis_wish_14.txt, 963, 14
014c_hello_mallory_10.txt, 1673, 10
m05c_mary_anne_and_the_secret_in_the_attic_13.txt, 1525, 13
033c_claudia_and_the_great_search_2.txt, 2885, 2
107c_mind_your_own_business_kristy_5.txt, 1129, 5
042c_jessi_and_the_dance_school_phantom_10.txt, 2231, 10
125c_mary_anne_in_the_middle_2.txt, 3098, 2
m27c_claudia_and_the_lighthouse_ghost_3.txt, 1634, 3
110c_abby_and_the_bad_sport_12.txt, 1100, 12
130c_staceys_movie_9.txt, 983, 9
086c_mary_anne_and_camp_bsc_4.txt, 3008, 4
107c_mind_your_own_business_kristy_1.txt, 1773, 1
m10c_stacey_and_the_mystery_money_12.txt, 1775, 12
033c_claudia_and_the_great_search_6.txt, 1343, 6
016c_jessis_secret_language_13.txt, 1667, 13
m27c_claudia_and_the_lighthouse_ghost_7.txt, 2010, 7
125c_mary_anne_in_the_middle_6.txt, 1079, 6
042c_jessi_and_the_dance_school_phantom_14.txt, 2186, 14
m19c_kristy_and_the_missing_fortune_4.txt, 1625, 4
062c_kristy_and_the_worst_kid_ever_6.txt, 1485, 6
058c_staceys_choice_1.txt, 1803, 1
102c_mary_anne_and_the_little_princess_5.txt, 951, 5
109c_mary_anne_to_the_rescue_9.txt, 1489, 9
096c_abbys_lucky_thirteen_4.txt, 1580, 4
014c_hello_mallory_14.txt, 1625, 14
048c_jessis_wish_10.txt, 1491, 10
serr3c_shannons_story_11.txt, 2076, 11
073c_mary_anne_and_miss_priss_6.txt, 1993, 6
072c_dawn_and_the_we_heart_kids_club_6.txt, 1356, 6
088c_farewell_dawn_12.txt, 2185, 12
067c_dawns_big_move_9.txt, 1564, 9
112c_kristy_and_the_sister_war_14.txt, 1613, 14
074c_kristy_and_the_copycat_2.txt, 3150, 2
106c_claudia_queen_of_the_seventh_grade_3.txt, 1632, 3
002c_claudia_and_the_phantom_phone_calls_10.txt, 1859, 10
124c_stacey_mcgill_matchmaker_3.txt, 1868, 3
064c_dawns_family_feud_8.txt, 1269, 8
029c_mallory_and_the_mystery_diary_6.txt, 1575, 6
038c_kristys_mystery_admirer_10.txt, 1498, 10
m35c_abby_and_the_notorius_neighbor_8.txt, 1812, 8
071c_claudia_and_the_perfect_boy_9.txt, 1622, 9
087c_stacey_and_the_bad_girls_6.txt, 1379, 6
023c_dawn_on_the_coast_6.txt, 1849, 6
m17c_dawn_and_the_halloween_mystery_3.txt, 1949, 3
033c_claudia_and_the_great_search_15.txt, 1780, 15
089c_kristy_and_the_dirty_diapers_15.txt, 1009, 15
126c_the_all_new_mallory_pike_8.txt, 2106, 8
071c_claudia_and_the_perfect_boy_14.txt, 1127, 14
m18c_stacey_and_the_mystery_at_the_empty_house_4.txt, 1864, 4
022c_jessi_ramsey_petsitter_8.txt, 1424, 8
047c_mallory_on_strike_5.txt, 1885, 5
099c_staceys_broken_heart_12.txt, 2051, 12
028c_welcome_back_stacey_8.txt, 1780, 8
025c_mary_anne_and_the_search_for_tigger_7.txt, 1338, 7
m29c_stacey_and_the_fashion_victim_7.txt, 1884, 7
100c_kristys_worst_idea_3.txt, 1933, 3
053c_kristy_for_president_15.txt, 948, 15
m08c_jessi_and_the_jewel_thieves_10.txt, 1782, 10
065c_staceys_big_crush_6.txt, 1589, 6
m01c_stacey_and_the_mystery_ring_1.txt, 2044, 1
032c_kristy_and_the_secret_of_susan_15.txt, 1638, 15
102c_mary_anne_and_the_little_princess_13.txt, 1140, 13
119c_staceys_ex_boyfriend_3.txt, 1374, 3
125c_mary_anne_in_the_middle_15.txt, 1474, 15
087c_stacey_and_the_bad_girls_14.txt, 1450, 14
113c_claudia_makes_up_her_mind_4.txt, 1632, 4
101c_claudia_kishi_middle_school_dropout_13.txt, 1934, 13
127c_abbys_un_valentine_6.txt, 1135, 6
m30c_kristy_and_the_mystery_train_8.txt, 1564, 8
m02c_beware_dawn_7.txt, 1825, 7
m25c_kristy_and_the_middle_school_vandal_10.txt, 1688, 10
039c_poor_mallory_6.txt, 1815, 6
081c_kristy_and_mr_mom_14.txt, 1254, 14
069c_get_well_soon_mallory_7.txt, 1555, 7
076c_staceys_lie_1.txt, 2234, 1
063c_claudias_freind_friend_11.txt, 2109, 11
037c_dawn_and_the_older_boy_11.txt, 1945, 11
084c_dawn_and_the_school_spirit_war_10.txt, 1975, 10
097c_claudia_and_the_worlds_cutest_baby_6.txt, 1869, 6
015c_little_miss_stoneybrook_and_dawn_3.txt, 1774, 3
018c_staceys_mistake_2.txt, 2020, 2
109c_mary_anne_to_the_rescue_10.txt, 1666, 10
016c_jessis_secret_language_8.txt, 1420, 8
050c_dawns_big_date_12.txt, 1074, 12
m15c_kristy_and_the_vampires_13.txt, 2024, 13
015c_little_miss_stoneybrook_and_dawn_12.txt, 1745, 12
057c_dawn_saves_the_planet_7.txt, 1987, 7
092c_mallorys_christmas_wish_8.txt, 1647, 8
044c_dawn_and_the_big_sleepover_14.txt, 1526, 14
serr2c_logan_bruno_boy_babysitter_8.txt, 1876, 8
128c_claudia_and_the_little_liar_15.txt, 751, 15
010c_logan_likes_mary_anne_15.txt, 1664, 15
107c_mind_your_own_business_kristy_10.txt, 1287, 10
114c_the_secret_life_of_mary_anne_spier_14.txt, 839, 14
m32c_claudia_and_the_mystery_in_the_painting_12.txt, 2162, 12
043c_staceys_emergency_9.txt, 1755, 9
079c_mary_anne_breaks_the_rules_11.txt, 1228, 11
052c_mary_anne_plus_too_many_babies_15.txt, 1611, 15
021c_mallory_and_the_trouble_with_twins_4.txt, 1360, 4
052c_mary_anne_plus_too_many_babies_14.txt, 1468, 14
079c_mary_anne_breaks_the_rules_10.txt, 1539, 10
122c_kristy_in_charge_1.txt, 1334, 1
043c_staceys_emergency_8.txt, 2290, 8
021c_mallory_and_the_trouble_with_twins_5.txt, 1310, 5
010c_logan_likes_mary_anne_14.txt, 1910, 14
128c_claudia_and_the_little_liar_14.txt, 850, 14
m32c_claudia_and_the_mystery_in_the_painting_13.txt, 1125, 13
114c_the_secret_life_of_mary_anne_spier_15.txt, 697, 15
107c_mind_your_own_business_kristy_11.txt, 1553, 11
serr2c_logan_bruno_boy_babysitter_9.txt, 1290, 9
044c_dawn_and_the_big_sleepover_15.txt, 772, 15
092c_mallorys_christmas_wish_9.txt, 1412, 9
015c_little_miss_stoneybrook_and_dawn_13.txt, 1922, 13
057c_dawn_saves_the_planet_6.txt, 1620, 6
m15c_kristy_and_the_vampires_12.txt, 1741, 12
050c_dawns_big_date_13.txt, 1835, 13
016c_jessis_secret_language_9.txt, 1987, 9
109c_mary_anne_to_the_rescue_11.txt, 1136, 11
018c_staceys_mistake_3.txt, 1805, 3
037c_dawn_and_the_older_boy_10.txt, 1353, 10
015c_little_miss_stoneybrook_and_dawn_2.txt, 1769, 2
097c_claudia_and_the_worlds_cutest_baby_7.txt, 1749, 7
084c_dawn_and_the_school_spirit_war_11.txt, 1646, 11
103c_happy_holidays_jessi_1.txt, 1767, 1
069c_get_well_soon_mallory_6.txt, 1510, 6
081c_kristy_and_mr_mom_15.txt, 924, 15
039c_poor_mallory_7.txt, 1388, 7
m25c_kristy_and_the_middle_school_vandal_11.txt, 1450, 11
063c_claudias_freind_friend_10.txt, 1511, 10
m30c_kristy_and_the_mystery_train_9.txt, 1715, 9
m02c_beware_dawn_6.txt, 1707, 6
101c_claudia_kishi_middle_school_dropout_12.txt, 1849, 12
121c_abby_in_wonderland_1.txt, 1901, 1
127c_abbys_un_valentine_7.txt, 1359, 7
125c_mary_anne_in_the_middle_14.txt, 1119, 14
113c_claudia_makes_up_her_mind_5.txt, 1005, 5
087c_stacey_and_the_bad_girls_15.txt, 1128, 15
m15c_kristy_and_the_vampires_1.txt, 2002, 1
119c_staceys_ex_boyfriend_2.txt, 2536, 2
032c_kristy_and_the_secret_of_susan_14.txt, 1745, 14
serr1c_logans_story_1.txt, 2649, 1
102c_mary_anne_and_the_little_princess_12.txt, 1616, 12
m08c_jessi_and_the_jewel_thieves_11.txt, 1920, 11
053c_kristy_for_president_14.txt, 1717, 14
100c_kristys_worst_idea_2.txt, 3371, 2
065c_staceys_big_crush_7.txt, 1487, 7
m29c_stacey_and_the_fashion_victim_6.txt, 1782, 6
099c_staceys_broken_heart_13.txt, 1059, 13
025c_mary_anne_and_the_search_for_tigger_6.txt, 1657, 6
028c_welcome_back_stacey_9.txt, 1940, 9
047c_mallory_on_strike_4.txt, 1829, 4
m18c_stacey_and_the_mystery_at_the_empty_house_5.txt, 1811, 5
071c_claudia_and_the_perfect_boy_15.txt, 2591, 15
022c_jessi_ramsey_petsitter_9.txt, 1548, 9
089c_kristy_and_the_dirty_diapers_14.txt, 1227, 14
126c_the_all_new_mallory_pike_9.txt, 1325, 9
023c_dawn_on_the_coast_7.txt, 1710, 7
087c_stacey_and_the_bad_girls_7.txt, 1460, 7
071c_claudia_and_the_perfect_boy_8.txt, 1884, 8
m35c_abby_and_the_notorius_neighbor_9.txt, 1405, 9
033c_claudia_and_the_great_search_14.txt, 1421, 14
m17c_dawn_and_the_halloween_mystery_2.txt, 2203, 2
064c_dawns_family_feud_9.txt, 1415, 9
124c_stacey_mcgill_matchmaker_2.txt, 3288, 2
038c_kristys_mystery_admirer_11.txt, 1886, 11
029c_mallory_and_the_mystery_diary_7.txt, 1579, 7
002c_claudia_and_the_phantom_phone_calls_11.txt, 2338, 11
106c_claudia_queen_of_the_seventh_grade_2.txt, 3059, 2
044c_dawn_and_the_big_sleepover_1.txt, 1978, 1
067c_dawns_big_move_8.txt, 1345, 8
050c_dawns_big_date_1.txt, 2488, 1
074c_kristy_and_the_copycat_3.txt, 2308, 3
112c_kristy_and_the_sister_war_15.txt, 1322, 15
072c_dawn_and_the_we_heart_kids_club_7.txt, 1435, 7
073c_mary_anne_and_miss_priss_7.txt, 2250, 7
serr3c_shannons_story_10.txt, 1463, 10
088c_farewell_dawn_13.txt, 1213, 13
048c_jessis_wish_11.txt, 1670, 11
014c_hello_mallory_15.txt, 1463, 15
102c_mary_anne_and_the_little_princess_4.txt, 2516, 4
096c_abbys_lucky_thirteen_5.txt, 2426, 5
109c_mary_anne_to_the_rescue_8.txt, 909, 8
062c_kristy_and_the_worst_kid_ever_7.txt, 2691, 7
042c_jessi_and_the_dance_school_phantom_15.txt, 2640, 15
125c_mary_anne_in_the_middle_7.txt, 1724, 7
m27c_claudia_and_the_lighthouse_ghost_6.txt, 1303, 6
m19c_kristy_and_the_missing_fortune_5.txt, 1938, 5
016c_jessis_secret_language_12.txt, 1766, 12
086c_mary_anne_and_camp_bsc_5.txt, 2007, 5
130c_staceys_movie_8.txt, 1624, 8
033c_claudia_and_the_great_search_7.txt, 1763, 7
m10c_stacey_and_the_mystery_money_13.txt, 2364, 13
016c_jessis_secret_language_10.txt, 1838, 10
053c_kristy_for_president_8.txt, 1705, 8
m10c_stacey_and_the_mystery_money_11.txt, 2138, 11
033c_claudia_and_the_great_search_5.txt, 1756, 5
026c_claudia_and_the_sad_goodbye_9.txt, 1616, 9
086c_mary_anne_and_camp_bsc_7.txt, 1772, 7
107c_mind_your_own_business_kristy_2.txt, 2954, 2
m19c_kristy_and_the_missing_fortune_7.txt, 1856, 7
m27c_claudia_and_the_lighthouse_ghost_4.txt, 1877, 4
125c_mary_anne_in_the_middle_5.txt, 1427, 5
110c_abby_and_the_bad_sport_15.txt, 926, 15
117c_claudia_and_the_terrible_truth_14.txt, 1677, 14
096c_abbys_lucky_thirteen_7.txt, 1251, 7
102c_mary_anne_and_the_little_princess_6.txt, 2707, 6
083c_stacey_vs_the_bsc_15.txt, 790, 15
058c_staceys_choice_2.txt, 3473, 2
062c_kristy_and_the_worst_kid_ever_5.txt, 1822, 5
088c_farewell_dawn_11.txt, 1476, 11
m20c_mary_anne_and_the_zoo_mystery_9.txt, 1871, 9
serr3c_shannons_story_12.txt, 1656, 12
073c_mary_anne_and_miss_priss_5.txt, 1466, 5
072c_dawn_and_the_we_heart_kids_club_5.txt, 1183, 5
m05c_mary_anne_and_the_secret_in_the_attic_14.txt, 2054, 14
048c_jessis_wish_13.txt, 1463, 13
044c_dawn_and_the_big_sleepover_3.txt, 1903, 3
077c_dwn_and_whitney_friends_forever_15.txt, 1197, 15
002c_claudia_and_the_phantom_phone_calls_13.txt, 1447, 13
074c_kristy_and_the_copycat_1.txt, 2705, 1
050c_dawns_big_date_3.txt, 2696, 3
025c_mary_anne_and_the_search_for_tigger_14.txt, 2061, 14
011c_kristy_and_the_snobs_9.txt, 1845, 9
087c_stacey_and_the_bad_girls_5.txt, 2000, 5
023c_dawn_on_the_coast_5.txt, 1414, 5
038c_kristys_mystery_admirer_13.txt, 1651, 13
029c_mallory_and_the_mystery_diary_5.txt, 1354, 5
081c_kristy_and_mr_mom_8.txt, 1942, 8
m18c_stacey_and_the_mystery_at_the_empty_house_7.txt, 1966, 7
m26c_dawn_schafer_undercover_babysitter_8.txt, 1882, 8
003c_the_truth_about_stacey_14.txt, 2824, 14
m22c_stacey_and_the_haunted_masquerade_9.txt, 1977, 9
025c_mary_anne_and_the_search_for_tigger_4.txt, 1740, 4
099c_staceys_broken_heart_11.txt, 1724, 11
047c_mallory_on_strike_6.txt, 1514, 6
065c_staceys_big_crush_5.txt, 1611, 5
m01c_stacey_and_the_mystery_ring_2.txt, 2076, 2
m08c_jessi_and_the_jewel_thieves_13.txt, 1470, 13
m29c_stacey_and_the_fashion_victim_4.txt, 1849, 4
007c_claudia_and_mean_jeanine_14.txt, 1213, 14
027c_jessi_and_the_superbrat_9.txt, 1479, 9
102c_mary_anne_and_the_little_princess_10.txt, 1088, 10
serr1c_logans_story_3.txt, 1230, 3
113c_claudia_makes_up_her_mind_7.txt, 1511, 7
m15c_kristy_and_the_vampires_3.txt, 2246, 3
m02c_beware_dawn_4.txt, 1842, 4
127c_abbys_un_valentine_5.txt, 2300, 5
101c_claudia_kishi_middle_school_dropout_10.txt, 1665, 10
121c_abby_in_wonderland_3.txt, 1186, 3
103c_happy_holidays_jessi_3.txt, 1747, 3
084c_dawn_and_the_school_spirit_war_13.txt, 1244, 13
097c_claudia_and_the_worlds_cutest_baby_5.txt, 1719, 5
045c_kristy_and_the_baby_parade_15.txt, 1515, 15
037c_dawn_and_the_older_boy_12.txt, 1643, 12
096c_abbys_lucky_thirteen_14.txt, 1078, 14
063c_claudias_freind_friend_12.txt, 1781, 12
076c_staceys_lie_2.txt, 2826, 2
018c_staceys_mistake_15.txt, 1399, 15
m25c_kristy_and_the_middle_school_vandal_13.txt, 1686, 13
039c_poor_mallory_5.txt, 1612, 5
069c_get_well_soon_mallory_4.txt, 1751, 4
050c_dawns_big_date_11.txt, 2003, 11
018c_staceys_mistake_1.txt, 1972, 1
068c_jessi_and_the_bad_babysitter_8.txt, 1992, 8
109c_mary_anne_to_the_rescue_13.txt, 1229, 13
115c_jessis_big_break_9.txt, 970, 9
057c_dawn_saves_the_planet_4.txt, 1791, 4
015c_little_miss_stoneybrook_and_dawn_11.txt, 1981, 11
m15c_kristy_and_the_vampires_10.txt, 1932, 10
021c_mallory_and_the_trouble_with_twins_7.txt, 1912, 7
122c_kristy_in_charge_3.txt, 1007, 3
079c_mary_anne_breaks_the_rules_12.txt, 945, 12
107c_mind_your_own_business_kristy_13.txt, 1496, 13
m32c_claudia_and_the_mystery_in_the_painting_11.txt, 1811, 11
040c_claudia_and_the_middle_school_mystery_9.txt, 1597, 9
m18c_stacey_and_the_mystery_at_the_empty_house_15.txt, 1812, 15
m32c_claudia_and_the_mystery_in_the_painting_10.txt, 1953, 10
040c_claudia_and_the_middle_school_mystery_8.txt, 1481, 8
107c_mind_your_own_business_kristy_12.txt, 1471, 12
m18c_stacey_and_the_mystery_at_the_empty_house_14.txt, 2277, 14
021c_mallory_and_the_trouble_with_twins_6.txt, 1740, 6
079c_mary_anne_breaks_the_rules_13.txt, 925, 13
122c_kristy_in_charge_2.txt, 2673, 2
115c_jessis_big_break_8.txt, 1753, 8
m15c_kristy_and_the_vampires_11.txt, 2077, 11
057c_dawn_saves_the_planet_5.txt, 1995, 5
015c_little_miss_stoneybrook_and_dawn_10.txt, 1502, 10
109c_mary_anne_to_the_rescue_12.txt, 1550, 12
068c_jessi_and_the_bad_babysitter_9.txt, 1384, 9
050c_dawns_big_date_10.txt, 2678, 10
063c_claudias_freind_friend_13.txt, 2123, 13
076c_staceys_lie_3.txt, 3911, 3
096c_abbys_lucky_thirteen_15.txt, 901, 15
069c_get_well_soon_mallory_5.txt, 1401, 5
039c_poor_mallory_4.txt, 1679, 4
m25c_kristy_and_the_middle_school_vandal_12.txt, 1636, 12
018c_staceys_mistake_14.txt, 1686, 14
045c_kristy_and_the_baby_parade_14.txt, 1636, 14
015c_little_miss_stoneybrook_and_dawn_1.txt, 2326, 1
097c_claudia_and_the_worlds_cutest_baby_4.txt, 1774, 4
084c_dawn_and_the_school_spirit_war_12.txt, 993, 12
103c_happy_holidays_jessi_2.txt, 3034, 2
037c_dawn_and_the_older_boy_13.txt, 1406, 13
127c_abbys_un_valentine_4.txt, 1913, 4
121c_abby_in_wonderland_2.txt, 3173, 2
101c_claudia_kishi_middle_school_dropout_11.txt, 2036, 11
m02c_beware_dawn_5.txt, 1575, 5
m15c_kristy_and_the_vampires_2.txt, 2481, 2
113c_claudia_makes_up_her_mind_6.txt, 1472, 6
102c_mary_anne_and_the_little_princess_11.txt, 1539, 11
serr1c_logans_story_2.txt, 2206, 2
119c_staceys_ex_boyfriend_1.txt, 1481, 1
027c_jessi_and_the_superbrat_8.txt, 1617, 8
007c_claudia_and_mean_jeanine_15.txt, 1368, 15
m29c_stacey_and_the_fashion_victim_5.txt, 1889, 5
m01c_stacey_and_the_mystery_ring_3.txt, 1841, 3
065c_staceys_big_crush_4.txt, 1931, 4
m08c_jessi_and_the_jewel_thieves_12.txt, 1683, 12
100c_kristys_worst_idea_1.txt, 1915, 1
047c_mallory_on_strike_7.txt, 1938, 7
025c_mary_anne_and_the_search_for_tigger_5.txt, 1402, 5
m22c_stacey_and_the_haunted_masquerade_8.txt, 1740, 8
099c_staceys_broken_heart_10.txt, 2726, 10
m26c_dawn_schafer_undercover_babysitter_9.txt, 1742, 9
081c_kristy_and_mr_mom_9.txt, 1526, 9
m18c_stacey_and_the_mystery_at_the_empty_house_6.txt, 1969, 6
029c_mallory_and_the_mystery_diary_4.txt, 1517, 4
038c_kristys_mystery_admirer_12.txt, 2019, 12
124c_stacey_mcgill_matchmaker_1.txt, 1679, 1
m17c_dawn_and_the_halloween_mystery_1.txt, 1882, 1
011c_kristy_and_the_snobs_8.txt, 1825, 8
025c_mary_anne_and_the_search_for_tigger_15.txt, 1589, 15
023c_dawn_on_the_coast_4.txt, 1742, 4
087c_stacey_and_the_bad_girls_4.txt, 2432, 4
050c_dawns_big_date_2.txt, 2677, 2
044c_dawn_and_the_big_sleepover_2.txt, 2017, 2
077c_dwn_and_whitney_friends_forever_14.txt, 949, 14
002c_claudia_and_the_phantom_phone_calls_12.txt, 1527, 12
106c_claudia_queen_of_the_seventh_grade_1.txt, 2007, 1
048c_jessis_wish_12.txt, 1363, 12
m20c_mary_anne_and_the_zoo_mystery_8.txt, 2878, 8
088c_farewell_dawn_10.txt, 1541, 10
072c_dawn_and_the_we_heart_kids_club_4.txt, 1194, 4
073c_mary_anne_and_miss_priss_4.txt, 1532, 4
serr3c_shannons_story_13.txt, 992, 13
058c_staceys_choice_3.txt, 1628, 3
083c_stacey_vs_the_bsc_14.txt, 1339, 14
062c_kristy_and_the_worst_kid_ever_4.txt, 2001, 4
096c_abbys_lucky_thirteen_6.txt, 966, 6
117c_claudia_and_the_terrible_truth_15.txt, 1230, 15
102c_mary_anne_and_the_little_princess_7.txt, 1244, 7
110c_abby_and_the_bad_sport_14.txt, 1026, 14
m19c_kristy_and_the_missing_fortune_6.txt, 1832, 6
125c_mary_anne_in_the_middle_4.txt, 1202, 4
m27c_claudia_and_the_lighthouse_ghost_5.txt, 1877, 5
033c_claudia_and_the_great_search_4.txt, 1795, 4
m10c_stacey_and_the_mystery_money_10.txt, 2118, 10
107c_mind_your_own_business_kristy_3.txt, 1030, 3
086c_mary_anne_and_camp_bsc_6.txt, 1463, 6
026c_claudia_and_the_sad_goodbye_8.txt, 1826, 8
053c_kristy_for_president_9.txt, 1561, 9
016c_jessis_secret_language_11.txt, 1696, 11
056c_keep_out_claudia_2.txt, 3183, 2
123c_claudias_big_party_10.txt, 1764, 10
098c_dawn_and_too_many_sitters_6.txt, 1626, 6
111c_staceys_secret_friend_11.txt, 1270, 11
012c_claudia_and_the_new_girl_8.txt, 1503, 8
035c_jessis_babysitter_6.txt, 1906, 6
060c_mary_annes_makeover_11.txt, 1046, 11
121c_abby_in_wonderland_10.txt, 2217, 10
099c_staceys_broken_heart_8.txt, 1711, 8
084c_dawn_and_the_school_spirit_war_5.txt, 1763, 5
070c_stacey_and_the_cheerleaders_14.txt, 1851, 14
052c_mary_anne_plus_too_many_babies_4.txt, 1696, 4
129c_kristy_at_bat_15.txt, 1816, 15
m31c_mary_anne_and_the_music_box_secret_14.txt, 1895, 14
085c_claudia_kishi_live_from_wsto_4.txt, 2278, 4
129c_kristy_at_bat_6.txt, 2120, 6
m34c_mary_anne_and_the_haunted_bookstore_1.txt, 2232, 1
054c_mallory_and_the_dream_horse_15.txt, 1377, 15
095c_kristy_plus_bart_equals_questionmark_12.txt, 1036, 12
079c_mary_anne_breaks_the_rules_3.txt, 1716, 3
m13c_mary_anne_and_the_library_mystery_8.txt, 1737, 8
m14c_stacey_and_the_mystery_at_the_mall_5.txt, 1989, 5
046c_mary_anne_misses_logan_5.txt, 1419, 5
031c_dawns_wicked_stepsister_8.txt, 1668, 8
m23c_abby_and_the_secret_society_3.txt, 1922, 3
m12c_dawn_and_the_surfer_ghost_8.txt, 1628, 8
055c_jessis_gold_medal_10.txt, 1488, 10
017c_mary_annes_bad_luck_mystery_4.txt, 1512, 4
101c_claudia_kishi_middle_school_dropout_5.txt, 1934, 5
039c_poor_mallory_14.txt, 1894, 14
080c_mallory_pike_no_1_fan_11.txt, 1130, 11
034c_mary_anne_and_too_many_boys_6.txt, 1907, 6
030c_mary_anne_and_the_great_romance_11.txt, 1754, 11
m16c_claudia_and_the_clue_in_the_photograph_5.txt, 1807, 5
m28c_abby_and_the_mystery_baby_7.txt, 1743, 7
037c_dawn_and_the_older_boy_7.txt, 1255, 7
m05c_mary_anne_and_the_secret_in_the_attic_2.txt, 2055, 2
043c_staceys_emergency_10.txt, 1318, 10
045c_kristy_and_the_baby_parade_2.txt, 2240, 2
078c_claudia_and_crazy_peaches_1.txt, 2660, 1
m08c_jessi_and_the_jewel_thieves_6.txt, 1801, 6
030c_mary_anne_and_the_great_romance_1.txt, 1848, 1
078c_claudia_and_crazy_peaches_15.txt, 1090, 15
032c_kristy_and_the_secret_of_susan_6.txt, 1485, 6
m19c_kristy_and_the_missing_fortune_12.txt, 1871, 12
013c_goodbye_stacey_goodbye_9.txt, 1984, 9
m29c_stacey_and_the_fashion_victim_11.txt, 1948, 11
059c_mallory_hates_boys_and_gym_10.txt, 1555, 10
004c_mary_anne_saves_the_day_13.txt, 1096, 13
m25c_kristy_and_the_middle_school_vandal_3.txt, 3375, 3
105c_stacey_the_math_whiz_14.txt, 1402, 14
006c_kristys_big_day_6.txt, 2193, 6
013c_goodbye_stacey_goodbye_14.txt, 1553, 14
124c_stacey_mcgill_matchmaker_15.txt, 903, 15
065c_staceys_big_crush_15.txt, 1107, 15
105c_stacey_the_math_whiz_4.txt, 1903, 4
128c_claudia_and_the_little_liar_7.txt, 1367, 7
067c_dawns_big_move_11.txt, 1473, 11
m06c_the_mystery_at_claudias_house_5.txt, 1793, 5
055c_jessis_gold_medal_9.txt, 1199, 9
m21c_claudia_and_the_recipe_for_danger_6.txt, 1836, 6
m36c_kristy_and_the_cat_burglar_3.txt, 2216, 3
126c_the_all_new_mallory_pike_14.txt, 2024, 14
077c_dwn_and_whitney_friends_forever_2.txt, 2503, 2
041c_mary_anne_vs_logan_5.txt, 1673, 5
104c_abbys_twin_7.txt, 1317, 7
089c_kristy_and_the_dirty_diapers_5.txt, 1468, 5
026c_claudia_and_the_sad_goodbye_14.txt, 1576, 14
108c_dont_give_up_mallory_3.txt, 2898, 3
006c_kristys_big_day_13.txt, 1574, 13
006c_kristys_big_day_12.txt, 1558, 12
089c_kristy_and_the_dirty_diapers_4.txt, 1630, 4
041c_mary_anne_vs_logan_4.txt, 1493, 4
104c_abbys_twin_6.txt, 1282, 6
108c_dont_give_up_mallory_2.txt, 3080, 2
026c_claudia_and_the_sad_goodbye_15.txt, 1744, 15
m03c_mallory_and_the_ghost_cat_1.txt, 2656, 1
110c_abby_and_the_bad_sport_1.txt, 1982, 1
077c_dwn_and_whitney_friends_forever_3.txt, 2287, 3
126c_the_all_new_mallory_pike_15.txt, 1197, 15
m21c_claudia_and_the_recipe_for_danger_7.txt, 2030, 7
m36c_kristy_and_the_cat_burglar_2.txt, 2288, 2
055c_jessis_gold_medal_8.txt, 1452, 8
m06c_the_mystery_at_claudias_house_4.txt, 2073, 4
067c_dawns_big_move_10.txt, 1200, 10
128c_claudia_and_the_little_liar_6.txt, 1878, 6
m04c_kristy_and_the_missing_child_1.txt, 1748, 1
082c_jessi_and_the_troublemaker_1.txt, 2900, 1
065c_staceys_big_crush_14.txt, 2277, 14
080c_mallory_pike_no_1_fan_1.txt, 2151, 1
105c_stacey_the_math_whiz_5.txt, 1454, 5
013c_goodbye_stacey_goodbye_15.txt, 1693, 15
006c_kristys_big_day_7.txt, 2424, 7
105c_stacey_the_math_whiz_15.txt, 1201, 15
124c_stacey_mcgill_matchmaker_14.txt, 770, 14
m25c_kristy_and_the_middle_school_vandal_2.txt, 1462, 2
004c_mary_anne_saves_the_day_12.txt, 1547, 12
m29c_stacey_and_the_fashion_victim_10.txt, 1944, 10
059c_mallory_hates_boys_and_gym_11.txt, 1954, 11
013c_goodbye_stacey_goodbye_8.txt, 1273, 8
m19c_kristy_and_the_missing_fortune_13.txt, 1941, 13
032c_kristy_and_the_secret_of_susan_7.txt, 1652, 7
078c_claudia_and_crazy_peaches_14.txt, 1574, 14
m08c_jessi_and_the_jewel_thieves_7.txt, 2096, 7
m05c_mary_anne_and_the_secret_in_the_attic_3.txt, 1733, 3
037c_dawn_and_the_older_boy_6.txt, 1980, 6
045c_kristy_and_the_baby_parade_3.txt, 2356, 3
043c_staceys_emergency_11.txt, 1758, 11
034c_mary_anne_and_too_many_boys_7.txt, 1613, 7
080c_mallory_pike_no_1_fan_10.txt, 1501, 10
039c_poor_mallory_15.txt, 1480, 15
m28c_abby_and_the_mystery_baby_6.txt, 1611, 6
030c_mary_anne_and_the_great_romance_10.txt, 1689, 10
m16c_claudia_and_the_clue_in_the_photograph_4.txt, 2144, 4
017c_mary_annes_bad_luck_mystery_5.txt, 1426, 5
101c_claudia_kishi_middle_school_dropout_4.txt, 1777, 4
m12c_dawn_and_the_surfer_ghost_9.txt, 1711, 9
m23c_abby_and_the_secret_society_2.txt, 2509, 2
031c_dawns_wicked_stepsister_9.txt, 1602, 9
055c_jessis_gold_medal_11.txt, 1133, 11
046c_mary_anne_misses_logan_4.txt, 1923, 4
m14c_stacey_and_the_mystery_at_the_mall_4.txt, 2003, 4
m13c_mary_anne_and_the_library_mystery_9.txt, 1889, 9
079c_mary_anne_breaks_the_rules_2.txt, 2082, 2
054c_mallory_and_the_dream_horse_14.txt, 3323, 14
051c_staceys_ex_best_friend_1.txt, 1909, 1
095c_kristy_plus_bart_equals_questionmark_13.txt, 1420, 13
129c_kristy_at_bat_7.txt, 1923, 7
085c_claudia_kishi_live_from_wsto_5.txt, 1475, 5
m31c_mary_anne_and_the_music_box_secret_15.txt, 2089, 15
095c_kristy_plus_bart_equals_questionmark_1.txt, 2188, 1
052c_mary_anne_plus_too_many_babies_5.txt, 1771, 5
129c_kristy_at_bat_14.txt, 2035, 14
070c_stacey_and_the_cheerleaders_15.txt, 808, 15
099c_staceys_broken_heart_9.txt, 1191, 9
084c_dawn_and_the_school_spirit_war_4.txt, 1334, 4
121c_abby_in_wonderland_11.txt, 1736, 11
060c_mary_annes_makeover_10.txt, 1805, 10
035c_jessis_babysitter_7.txt, 1631, 7
098c_dawn_and_too_many_sitters_7.txt, 1249, 7
012c_claudia_and_the_new_girl_9.txt, 1645, 9
111c_staceys_secret_friend_10.txt, 871, 10
123c_claudias_big_party_11.txt, 1654, 11
056c_keep_out_claudia_3.txt, 1251, 3
123c_claudias_big_party_13.txt, 1898, 13
056c_keep_out_claudia_1.txt, 1799, 1
m02c_beware_dawn_14.txt, 1948, 14
111c_staceys_secret_friend_12.txt, 965, 12
098c_dawn_and_too_many_sitters_5.txt, 1628, 5
104c_abbys_twin_15.txt, 1521, 15
005c_dawn_and_the_impossible_three_9.txt, 2046, 9
035c_jessis_babysitter_5.txt, 1437, 5
060c_mary_annes_makeover_12.txt, 2104, 12
121c_abby_in_wonderland_13.txt, 1326, 13
082c_jessi_and_the_troublemaker_14.txt, 1095, 14
serr3c_shannons_story_9.txt, 1036, 9
001c_kristys_great_idea_9.txt, 1196, 9
084c_dawn_and_the_school_spirit_war_6.txt, 2445, 6
095c_kristy_plus_bart_equals_questionmark_3.txt, 2390, 3
085c_claudia_kishi_live_from_wsto_7.txt, 1640, 7
129c_kristy_at_bat_5.txt, 1444, 5
052c_mary_anne_plus_too_many_babies_7.txt, 1640, 7
m14c_stacey_and_the_mystery_at_the_mall_6.txt, 2055, 6
083c_stacey_vs_the_bsc_9.txt, 1586, 9
046c_mary_anne_misses_logan_6.txt, 1529, 6
095c_kristy_plus_bart_equals_questionmark_11.txt, 1475, 11
051c_staceys_ex_best_friend_3.txt, 1538, 3
m34c_mary_anne_and_the_haunted_bookstore_2.txt, 3809, 2
009c_the_ghost_at_dawns_house_8.txt, 2342, 8
101c_claudia_kishi_middle_school_dropout_6.txt, 1917, 6
017c_mary_annes_bad_luck_mystery_7.txt, 1396, 7
055c_jessis_gold_medal_13.txt, 2592, 13
043c_staceys_emergency_13.txt, 1742, 13
045c_kristy_and_the_baby_parade_1.txt, 1934, 1
037c_dawn_and_the_older_boy_4.txt, 1984, 4
m05c_mary_anne_and_the_secret_in_the_attic_1.txt, 2087, 1
008c_boy_crazy_stacey_8.txt, 1609, 8
m16c_claudia_and_the_clue_in_the_photograph_6.txt, 1883, 6
030c_mary_anne_and_the_great_romance_12.txt, 1808, 12
m28c_abby_and_the_mystery_baby_4.txt, 2239, 4
080c_mallory_pike_no_1_fan_12.txt, 2376, 12
034c_mary_anne_and_too_many_boys_5.txt, 1654, 5
123c_claudias_big_party_8.txt, 770, 8
m08c_jessi_and_the_jewel_thieves_5.txt, 1839, 5
030c_mary_anne_and_the_great_romance_2.txt, 2511, 2
m12c_dawn_and_the_surfer_ghost_14.txt, 1936, 14
078c_claudia_and_crazy_peaches_2.txt, 2715, 2
106c_claudia_queen_of_the_seventh_grade_14.txt, 1201, 14
059c_mallory_hates_boys_and_gym_13.txt, 1413, 13
m29c_stacey_and_the_fashion_victim_12.txt, 1552, 12
m19c_kristy_and_the_missing_fortune_11.txt, 1526, 11
032c_kristy_and_the_secret_of_susan_5.txt, 1398, 5
m17c_dawn_and_the_halloween_mystery_14.txt, 1994, 14
004c_mary_anne_saves_the_day_10.txt, 1810, 10
105c_stacey_the_math_whiz_7.txt, 1240, 7
080c_mallory_pike_no_1_fan_3.txt, 2123, 3
m21c_claudia_and_the_recipe_for_danger_15.txt, 1757, 15
006c_kristys_big_day_5.txt, 2268, 5
m03c_mallory_and_the_ghost_cat_14.txt, 2194, 14
serr1c_logans_story_14.txt, 2144, 14
067c_dawns_big_move_12.txt, 1873, 12
049c_claudia_and_the_genius_of_elm_street_9.txt, 1168, 9
082c_jessi_and_the_troublemaker_3.txt, 1511, 3
m04c_kristy_and_the_missing_child_3.txt, 1993, 3
014c_hello_mallory_9.txt, 1631, 9
128c_claudia_and_the_little_liar_4.txt, 985, 4
022c_jessi_ramsey_petsitter_14.txt, 1448, 14
m06c_the_mystery_at_claudias_house_6.txt, 1931, 6
m21c_claudia_and_the_recipe_for_danger_5.txt, 1636, 5
077c_dwn_and_whitney_friends_forever_1.txt, 2311, 1
110c_abby_and_the_bad_sport_3.txt, 1892, 3
m28c_abby_and_the_mystery_baby_14.txt, 1800, 14
002c_claudia_and_the_phantom_phone_calls_8.txt, 1738, 8
m07c_dawn_and_the_disappearing_dogs_9.txt, 1979, 9
006c_kristys_big_day_10.txt, 1727, 10
m03c_mallory_and_the_ghost_cat_3.txt, 2070, 3
m31c_mary_anne_and_the_music_box_secret_9.txt, 1669, 9
104c_abbys_twin_4.txt, 1099, 4
041c_mary_anne_vs_logan_6.txt, 1537, 6
m11c_claudia_and_the_mystery_at_the_museum_9.txt, 1587, 9
089c_kristy_and_the_dirty_diapers_6.txt, 2377, 6
m31c_mary_anne_and_the_music_box_secret_8.txt, 1986, 8
108c_dont_give_up_mallory_1.txt, 1724, 1
m03c_mallory_and_the_ghost_cat_2.txt, 2422, 2
089c_kristy_and_the_dirty_diapers_7.txt, 1752, 7
m11c_claudia_and_the_mystery_at_the_museum_8.txt, 1436, 8
104c_abbys_twin_5.txt, 2686, 5
041c_mary_anne_vs_logan_7.txt, 1039, 7
m07c_dawn_and_the_disappearing_dogs_8.txt, 1373, 8
006c_kristys_big_day_11.txt, 2119, 11
m28c_abby_and_the_mystery_baby_15.txt, 1450, 15
002c_claudia_and_the_phantom_phone_calls_9.txt, 1979, 9
m36c_kristy_and_the_cat_burglar_1.txt, 1797, 1
m21c_claudia_and_the_recipe_for_danger_4.txt, 1581, 4
110c_abby_and_the_bad_sport_2.txt, 2238, 2
m06c_the_mystery_at_claudias_house_7.txt, 1597, 7
022c_jessi_ramsey_petsitter_15.txt, 1678, 15
128c_claudia_and_the_little_liar_5.txt, 1445, 5
014c_hello_mallory_8.txt, 1549, 8
m04c_kristy_and_the_missing_child_2.txt, 2411, 2
082c_jessi_and_the_troublemaker_2.txt, 3499, 2
049c_claudia_and_the_genius_of_elm_street_8.txt, 1438, 8
067c_dawns_big_move_13.txt, 1675, 13
serr1c_logans_story_15.txt, 1699, 15
m03c_mallory_and_the_ghost_cat_15.txt, 1810, 15
m21c_claudia_and_the_recipe_for_danger_14.txt, 1898, 14
006c_kristys_big_day_4.txt, 1896, 4
080c_mallory_pike_no_1_fan_2.txt, 2948, 2
105c_stacey_the_math_whiz_6.txt, 2103, 6
004c_mary_anne_saves_the_day_11.txt, 2921, 11
m17c_dawn_and_the_halloween_mystery_15.txt, 1933, 15
m25c_kristy_and_the_middle_school_vandal_1.txt, 2149, 1
032c_kristy_and_the_secret_of_susan_4.txt, 2288, 4
m19c_kristy_and_the_missing_fortune_10.txt, 1959, 10
059c_mallory_hates_boys_and_gym_12.txt, 1678, 12
106c_claudia_queen_of_the_seventh_grade_15.txt, 1488, 15
m29c_stacey_and_the_fashion_victim_13.txt, 1850, 13
030c_mary_anne_and_the_great_romance_3.txt, 2022, 3
m08c_jessi_and_the_jewel_thieves_4.txt, 2014, 4
078c_claudia_and_crazy_peaches_3.txt, 1557, 3
m12c_dawn_and_the_surfer_ghost_15.txt, 1406, 15
123c_claudias_big_party_9.txt, 1403, 9
m28c_abby_and_the_mystery_baby_5.txt, 1691, 5
m16c_claudia_and_the_clue_in_the_photograph_7.txt, 1962, 7
030c_mary_anne_and_the_great_romance_13.txt, 1285, 13
034c_mary_anne_and_too_many_boys_4.txt, 1479, 4
080c_mallory_pike_no_1_fan_13.txt, 1672, 13
043c_staceys_emergency_12.txt, 1934, 12
008c_boy_crazy_stacey_9.txt, 1543, 9
037c_dawn_and_the_older_boy_5.txt, 1640, 5
055c_jessis_gold_medal_12.txt, 2310, 12
m23c_abby_and_the_secret_society_1.txt, 1899, 1
101c_claudia_kishi_middle_school_dropout_7.txt, 1786, 7
009c_the_ghost_at_dawns_house_9.txt, 2637, 9
017c_mary_annes_bad_luck_mystery_6.txt, 1815, 6
051c_staceys_ex_best_friend_2.txt, 2995, 2
095c_kristy_plus_bart_equals_questionmark_10.txt, 1561, 10
m34c_mary_anne_and_the_haunted_bookstore_3.txt, 3625, 3
046c_mary_anne_misses_logan_7.txt, 1851, 7
083c_stacey_vs_the_bsc_8.txt, 1116, 8
m14c_stacey_and_the_mystery_at_the_mall_7.txt, 1919, 7
079c_mary_anne_breaks_the_rules_1.txt, 2371, 1
052c_mary_anne_plus_too_many_babies_6.txt, 1451, 6
095c_kristy_plus_bart_equals_questionmark_2.txt, 1420, 2
129c_kristy_at_bat_4.txt, 1964, 4
085c_claudia_kishi_live_from_wsto_6.txt, 1132, 6
084c_dawn_and_the_school_spirit_war_7.txt, 1738, 7
001c_kristys_great_idea_8.txt, 1266, 8
serr3c_shannons_story_8.txt, 1595, 8
082c_jessi_and_the_troublemaker_15.txt, 726, 15
005c_dawn_and_the_impossible_three_8.txt, 2322, 8
104c_abbys_twin_14.txt, 990, 14
121c_abby_in_wonderland_12.txt, 1259, 12
060c_mary_annes_makeover_13.txt, 3064, 13
035c_jessis_babysitter_4.txt, 1528, 4
111c_staceys_secret_friend_13.txt, 2171, 13
098c_dawn_and_too_many_sitters_4.txt, 1586, 4
m02c_beware_dawn_15.txt, 1327, 15
123c_claudias_big_party_12.txt, 1575, 12
007c_claudia_and_mean_jeanine_9.txt, 1719, 9
m02c_beware_dawn_11.txt, 1651, 11
056c_keep_out_claudia_4.txt, 1356, 4
070c_stacey_and_the_cheerleaders_12.txt, 1106, 12
084c_dawn_and_the_school_spirit_war_3.txt, 1745, 3
004c_mary_anne_saves_the_day_9.txt, 2063, 9
104c_abbys_twin_10.txt, 1585, 10
082c_jessi_and_the_troublemaker_11.txt, 983, 11
060c_mary_annes_makeover_9.txt, 1504, 9
079c_mary_anne_breaks_the_rules_5.txt, 1298, 5
m14c_stacey_and_the_mystery_at_the_mall_3.txt, 2036, 3
046c_mary_anne_misses_logan_3.txt, 1700, 3
m34c_mary_anne_and_the_haunted_bookstore_7.txt, 1163, 7
054c_mallory_and_the_dream_horse_13.txt, 1935, 13
095c_kristy_plus_bart_equals_questionmark_14.txt, 1748, 14
051c_staceys_ex_best_friend_6.txt, 1487, 6
m31c_mary_anne_and_the_music_box_secret_12.txt, 1944, 12
085c_claudia_kishi_live_from_wsto_2.txt, 2685, 2
095c_kristy_plus_bart_equals_questionmark_6.txt, 1521, 6
052c_mary_anne_plus_too_many_babies_2.txt, 2673, 2
129c_kristy_at_bat_13.txt, 1717, 13
037c_dawn_and_the_older_boy_1.txt, 2052, 1
m05c_mary_anne_and_the_secret_in_the_attic_4.txt, 2046, 4
045c_kristy_and_the_baby_parade_4.txt, 1884, 4
039c_poor_mallory_12.txt, 1521, 12
m16c_claudia_and_the_clue_in_the_photograph_3.txt, 1839, 3
m28c_abby_and_the_mystery_baby_1.txt, 1924, 1
017c_mary_annes_bad_luck_mystery_2.txt, 2288, 2
101c_claudia_kishi_middle_school_dropout_3.txt, 2066, 3
m23c_abby_and_the_secret_society_5.txt, 1889, 5
106c_claudia_queen_of_the_seventh_grade_11.txt, 1238, 11
091c_claudia_and_the_first_thanksgiving_8.txt, 1204, 8
m19c_kristy_and_the_missing_fortune_14.txt, 1992, 14
078c_claudia_and_crazy_peaches_13.txt, 1251, 13
m12c_dawn_and_the_surfer_ghost_11.txt, 1493, 11
078c_claudia_and_crazy_peaches_7.txt, 1997, 7
030c_mary_anne_and_the_great_romance_7.txt, 1722, 7
065c_staceys_big_crush_13.txt, 1629, 13
093c_mary_anne_and_the_memory_garden_9.txt, 1721, 9
105c_stacey_the_math_whiz_2.txt, 2745, 2
080c_mallory_pike_no_1_fan_6.txt, 2564, 6
105c_stacey_the_math_whiz_12.txt, 1768, 12
013c_goodbye_stacey_goodbye_12.txt, 1924, 12
124c_stacey_mcgill_matchmaker_13.txt, 1173, 13
m21c_claudia_and_the_recipe_for_danger_10.txt, 1840, 10
117c_claudia_and_the_terrible_truth_8.txt, 1538, 8
m25c_kristy_and_the_middle_school_vandal_5.txt, 2091, 5
m17c_dawn_and_the_halloween_mystery_11.txt, 1729, 11
004c_mary_anne_saves_the_day_15.txt, 1491, 15
022c_jessi_ramsey_petsitter_11.txt, 1823, 11
m06c_the_mystery_at_claudias_house_3.txt, 1728, 3
serr1c_logans_story_11.txt, 1259, 11
m03c_mallory_and_the_ghost_cat_11.txt, 2114, 11
m04c_kristy_and_the_missing_child_6.txt, 1720, 6
082c_jessi_and_the_troublemaker_6.txt, 1376, 6
128c_claudia_and_the_little_liar_1.txt, 1843, 1
088c_farewell_dawn_9.txt, 1554, 9
041c_mary_anne_vs_logan_3.txt, 2241, 3
104c_abbys_twin_1.txt, 1902, 1
089c_kristy_and_the_dirty_diapers_3.txt, 2307, 3
m03c_mallory_and_the_ghost_cat_6.txt, 2324, 6
026c_claudia_and_the_sad_goodbye_12.txt, 1708, 12
108c_dont_give_up_mallory_5.txt, 1804, 5
110c_abby_and_the_bad_sport_6.txt, 1036, 6
m36c_kristy_and_the_cat_burglar_5.txt, 1816, 5
077c_dwn_and_whitney_friends_forever_4.txt, 2379, 4
126c_the_all_new_mallory_pike_12.txt, 1913, 12
m28c_abby_and_the_mystery_baby_11.txt, 1488, 11
m28c_abby_and_the_mystery_baby_10.txt, 1882, 10
110c_abby_and_the_bad_sport_7.txt, 1830, 7
126c_the_all_new_mallory_pike_13.txt, 1651, 13
077c_dwn_and_whitney_friends_forever_5.txt, 1517, 5
m21c_claudia_and_the_recipe_for_danger_1.txt, 2026, 1
m36c_kristy_and_the_cat_burglar_4.txt, 1627, 4
089c_kristy_and_the_dirty_diapers_2.txt, 2526, 2
041c_mary_anne_vs_logan_2.txt, 2933, 2
108c_dont_give_up_mallory_4.txt, 2050, 4
026c_claudia_and_the_sad_goodbye_13.txt, 1943, 13
m03c_mallory_and_the_ghost_cat_7.txt, 2234, 7
006c_kristys_big_day_14.txt, 1312, 14
088c_farewell_dawn_8.txt, 1112, 8
082c_jessi_and_the_troublemaker_7.txt, 2044, 7
m04c_kristy_and_the_missing_child_7.txt, 1259, 7
m03c_mallory_and_the_ghost_cat_10.txt, 2558, 10
serr1c_logans_story_10.txt, 1029, 10
m06c_the_mystery_at_claudias_house_2.txt, 2029, 2
022c_jessi_ramsey_petsitter_10.txt, 1664, 10
004c_mary_anne_saves_the_day_14.txt, 3412, 14
117c_claudia_and_the_terrible_truth_9.txt, 1475, 9
m17c_dawn_and_the_halloween_mystery_10.txt, 1861, 10
m25c_kristy_and_the_middle_school_vandal_4.txt, 1834, 4
013c_goodbye_stacey_goodbye_13.txt, 1794, 13
006c_kristys_big_day_1.txt, 1851, 1
105c_stacey_the_math_whiz_13.txt, 1225, 13
m21c_claudia_and_the_recipe_for_danger_11.txt, 1770, 11
124c_stacey_mcgill_matchmaker_12.txt, 1228, 12
093c_mary_anne_and_the_memory_garden_8.txt, 2054, 8
065c_staceys_big_crush_12.txt, 1369, 12
080c_mallory_pike_no_1_fan_7.txt, 1890, 7
105c_stacey_the_math_whiz_3.txt, 1997, 3
078c_claudia_and_crazy_peaches_6.txt, 1516, 6
m12c_dawn_and_the_surfer_ghost_10.txt, 1986, 10
030c_mary_anne_and_the_great_romance_6.txt, 1639, 6
m08c_jessi_and_the_jewel_thieves_1.txt, 2026, 1
078c_claudia_and_crazy_peaches_12.txt, 1272, 12
091c_claudia_and_the_first_thanksgiving_9.txt, 1376, 9
032c_kristy_and_the_secret_of_susan_1.txt, 1790, 1
m19c_kristy_and_the_missing_fortune_15.txt, 1827, 15
106c_claudia_queen_of_the_seventh_grade_10.txt, 1409, 10
m23c_abby_and_the_secret_society_4.txt, 2167, 4
017c_mary_annes_bad_luck_mystery_3.txt, 1508, 3
101c_claudia_kishi_middle_school_dropout_2.txt, 2280, 2
039c_poor_mallory_13.txt, 1447, 13
034c_mary_anne_and_too_many_boys_1.txt, 2653, 1
m16c_claudia_and_the_clue_in_the_photograph_2.txt, 2970, 2
m05c_mary_anne_and_the_secret_in_the_attic_5.txt, 1604, 5
045c_kristy_and_the_baby_parade_5.txt, 1892, 5
052c_mary_anne_plus_too_many_babies_3.txt, 1946, 3
129c_kristy_at_bat_12.txt, 1759, 12
085c_claudia_kishi_live_from_wsto_3.txt, 1696, 3
129c_kristy_at_bat_1.txt, 1967, 1
m31c_mary_anne_and_the_music_box_secret_13.txt, 1805, 13
095c_kristy_plus_bart_equals_questionmark_7.txt, 1688, 7
054c_mallory_and_the_dream_horse_12.txt, 1414, 12
m34c_mary_anne_and_the_haunted_bookstore_6.txt, 3059, 6
051c_staceys_ex_best_friend_7.txt, 1418, 7
095c_kristy_plus_bart_equals_questionmark_15.txt, 1315, 15
060c_mary_annes_makeover_8.txt, 1118, 8
046c_mary_anne_misses_logan_2.txt, 2346, 2
m14c_stacey_and_the_mystery_at_the_mall_2.txt, 2293, 2
079c_mary_anne_breaks_the_rules_4.txt, 1535, 4
082c_jessi_and_the_troublemaker_10.txt, 1059, 10
035c_jessis_babysitter_1.txt, 1755, 1
104c_abbys_twin_11.txt, 1510, 11
004c_mary_anne_saves_the_day_8.txt, 1758, 8
084c_dawn_and_the_school_spirit_war_2.txt, 3168, 2
070c_stacey_and_the_cheerleaders_13.txt, 1226, 13
m02c_beware_dawn_10.txt, 1690, 10
056c_keep_out_claudia_5.txt, 1613, 5
007c_claudia_and_mean_jeanine_8.txt, 1679, 8
098c_dawn_and_too_many_sitters_1.txt, 1668, 1
111c_staceys_secret_friend_14.txt, 1118, 14
112c_kristy_and_the_sister_war_8.txt, 1269, 8
098c_dawn_and_too_many_sitters_3.txt, 2930, 3
131c_the_fire_at_mary_annes_house_8.txt, 1903, 8
056c_keep_out_claudia_7.txt, 1355, 7
m02c_beware_dawn_12.txt, 1758, 12
123c_claudias_big_party_15.txt, 932, 15
070c_stacey_and_the_cheerleaders_11.txt, 1502, 11
082c_jessi_and_the_troublemaker_12.txt, 1238, 12
104c_abbys_twin_13.txt, 955, 13
035c_jessis_babysitter_3.txt, 1797, 3
121c_abby_in_wonderland_15.txt, 894, 15
060c_mary_annes_makeover_14.txt, 1223, 14
051c_staceys_ex_best_friend_5.txt, 1108, 5
m34c_mary_anne_and_the_haunted_bookstore_4.txt, 4945, 4
054c_mallory_and_the_dream_horse_10.txt, 1924, 10
079c_mary_anne_breaks_the_rules_6.txt, 1102, 6
129c_kristy_at_bat_10.txt, 1688, 10
052c_mary_anne_plus_too_many_babies_1.txt, 1584, 1
070c_stacey_and_the_cheerleaders_8.txt, 2067, 8
095c_kristy_plus_bart_equals_questionmark_5.txt, 1381, 5
m31c_mary_anne_and_the_music_box_secret_11.txt, 1793, 11
129c_kristy_at_bat_3.txt, 1768, 3
085c_claudia_kishi_live_from_wsto_1.txt, 2088, 1
m24c_mary_anne_and_the_silent_witness_9.txt, 1678, 9
030c_mary_anne_and_the_great_romance_14.txt, 1906, 14
m28c_abby_and_the_mystery_baby_2.txt, 2122, 2
034c_mary_anne_and_too_many_boys_3.txt, 1759, 3
039c_poor_mallory_11.txt, 1430, 11
080c_mallory_pike_no_1_fan_14.txt, 1808, 14
043c_staceys_emergency_15.txt, 1795, 15
045c_kristy_and_the_baby_parade_7.txt, 1358, 7
037c_dawn_and_the_older_boy_2.txt, 1999, 2
m05c_mary_anne_and_the_secret_in_the_attic_7.txt, 2044, 7
055c_jessis_gold_medal_15.txt, 1240, 15
024c_kristy_and_the_mothers_day_surprise_9.txt, 1716, 9
m23c_abby_and_the_secret_society_6.txt, 1903, 6
017c_mary_annes_bad_luck_mystery_1.txt, 2244, 1
032c_kristy_and_the_secret_of_susan_3.txt, 1988, 3
106c_claudia_queen_of_the_seventh_grade_12.txt, 1473, 12
059c_mallory_hates_boys_and_gym_15.txt, 1181, 15
m29c_stacey_and_the_fashion_victim_14.txt, 1980, 14
m08c_jessi_and_the_jewel_thieves_3.txt, 1981, 3
030c_mary_anne_and_the_great_romance_4.txt, 1441, 4
066c_maid_mary_anne_8.txt, 1455, 8
m12c_dawn_and_the_surfer_ghost_12.txt, 1725, 12
078c_claudia_and_crazy_peaches_4.txt, 1699, 4
078c_claudia_and_crazy_peaches_10.txt, 1650, 10
m33c_stacey_and_the_stolen_hearts_8.txt, 1648, 8
124c_stacey_mcgill_matchmaker_10.txt, 1104, 10
m21c_claudia_and_the_recipe_for_danger_13.txt, 1585, 13
105c_stacey_the_math_whiz_11.txt, 1503, 11
006c_kristys_big_day_3.txt, 2071, 3
013c_goodbye_stacey_goodbye_11.txt, 1418, 11
105c_stacey_the_math_whiz_1.txt, 1677, 1
080c_mallory_pike_no_1_fan_5.txt, 1450, 5
065c_staceys_big_crush_10.txt, 1248, 10
m25c_kristy_and_the_middle_school_vandal_6.txt, 2148, 6
m17c_dawn_and_the_halloween_mystery_12.txt, 1980, 12
022c_jessi_ramsey_petsitter_12.txt, 2007, 12
m04c_kristy_and_the_missing_child_5.txt, 1772, 5
082c_jessi_and_the_troublemaker_5.txt, 1504, 5
128c_claudia_and_the_little_liar_2.txt, 2899, 2
serr1c_logans_story_12.txt, 1315, 12
m03c_mallory_and_the_ghost_cat_12.txt, 2277, 12
067c_dawns_big_move_14.txt, 1649, 14
m03c_mallory_and_the_ghost_cat_5.txt, 2236, 5
059c_mallory_hates_boys_and_gym_8.txt, 1466, 8
026c_claudia_and_the_sad_goodbye_11.txt, 1746, 11
108c_dont_give_up_mallory_6.txt, 1914, 6
104c_abbys_twin_2.txt, 3233, 2
m28c_abby_and_the_mystery_baby_12.txt, 1832, 12
063c_claudias_freind_friend_8.txt, 1963, 8
m36c_kristy_and_the_cat_burglar_6.txt, 1787, 6
m21c_claudia_and_the_recipe_for_danger_3.txt, 2013, 3
126c_the_all_new_mallory_pike_11.txt, 1940, 11
077c_dwn_and_whitney_friends_forever_7.txt, 2102, 7
110c_abby_and_the_bad_sport_5.txt, 880, 5
077c_dwn_and_whitney_friends_forever_6.txt, 2376, 6
126c_the_all_new_mallory_pike_10.txt, 1739, 10
m36c_kristy_and_the_cat_burglar_7.txt, 2320, 7
m21c_claudia_and_the_recipe_for_danger_2.txt, 2537, 2
110c_abby_and_the_bad_sport_4.txt, 1291, 4
m28c_abby_and_the_mystery_baby_13.txt, 1788, 13
063c_claudias_freind_friend_9.txt, 1490, 9
108c_dont_give_up_mallory_7.txt, 1636, 7
026c_claudia_and_the_sad_goodbye_10.txt, 1568, 10
m03c_mallory_and_the_ghost_cat_4.txt, 2377, 4
059c_mallory_hates_boys_and_gym_9.txt, 1512, 9
089c_kristy_and_the_dirty_diapers_1.txt, 2839, 1
104c_abbys_twin_3.txt, 1057, 3
041c_mary_anne_vs_logan_1.txt, 1830, 1
067c_dawns_big_move_15.txt, 1366, 15
m03c_mallory_and_the_ghost_cat_13.txt, 2206, 13
serr1c_logans_story_13.txt, 1134, 13
128c_claudia_and_the_little_liar_3.txt, 1662, 3
082c_jessi_and_the_troublemaker_4.txt, 1326, 4
m04c_kristy_and_the_missing_child_4.txt, 1734, 4
022c_jessi_ramsey_petsitter_13.txt, 1981, 13
m06c_the_mystery_at_claudias_house_1.txt, 2039, 1
m17c_dawn_and_the_halloween_mystery_13.txt, 1906, 13
m25c_kristy_and_the_middle_school_vandal_7.txt, 1219, 7
080c_mallory_pike_no_1_fan_4.txt, 1892, 4
065c_staceys_big_crush_11.txt, 1225, 11
m21c_claudia_and_the_recipe_for_danger_12.txt, 1814, 12
124c_stacey_mcgill_matchmaker_11.txt, 1157, 11
013c_goodbye_stacey_goodbye_10.txt, 1820, 10
006c_kristys_big_day_2.txt, 2101, 2
105c_stacey_the_math_whiz_10.txt, 1873, 10
m33c_stacey_and_the_stolen_hearts_9.txt, 1850, 9
078c_claudia_and_crazy_peaches_11.txt, 1452, 11
030c_mary_anne_and_the_great_romance_5.txt, 1546, 5
m08c_jessi_and_the_jewel_thieves_2.txt, 2650, 2
078c_claudia_and_crazy_peaches_5.txt, 2169, 5
m12c_dawn_and_the_surfer_ghost_13.txt, 1921, 13
066c_maid_mary_anne_9.txt, 1714, 9
059c_mallory_hates_boys_and_gym_14.txt, 1924, 14
106c_claudia_queen_of_the_seventh_grade_13.txt, 1869, 13
m29c_stacey_and_the_fashion_victim_15.txt, 1452, 15
032c_kristy_and_the_secret_of_susan_2.txt, 2499, 2
101c_claudia_kishi_middle_school_dropout_1.txt, 1939, 1
055c_jessis_gold_medal_14.txt, 1926, 14
m23c_abby_and_the_secret_society_7.txt, 1826, 7
024c_kristy_and_the_mothers_day_surprise_8.txt, 1728, 8
045c_kristy_and_the_baby_parade_6.txt, 2008, 6
043c_staceys_emergency_14.txt, 1517, 14
m05c_mary_anne_and_the_secret_in_the_attic_6.txt, 2006, 6
037c_dawn_and_the_older_boy_3.txt, 2047, 3
m28c_abby_and_the_mystery_baby_3.txt, 2194, 3
m16c_claudia_and_the_clue_in_the_photograph_1.txt, 2400, 1
030c_mary_anne_and_the_great_romance_15.txt, 1876, 15
080c_mallory_pike_no_1_fan_15.txt, 1050, 15
039c_poor_mallory_10.txt, 1475, 10
034c_mary_anne_and_too_many_boys_2.txt, 2146, 2
095c_kristy_plus_bart_equals_questionmark_4.txt, 1494, 4
m24c_mary_anne_and_the_silent_witness_8.txt, 1784, 8
129c_kristy_at_bat_2.txt, 2453, 2
m31c_mary_anne_and_the_music_box_secret_10.txt, 1559, 10
129c_kristy_at_bat_11.txt, 1772, 11
070c_stacey_and_the_cheerleaders_9.txt, 976, 9
046c_mary_anne_misses_logan_1.txt, 1760, 1
m14c_stacey_and_the_mystery_at_the_mall_1.txt, 2254, 1
079c_mary_anne_breaks_the_rules_7.txt, 2041, 7
051c_staceys_ex_best_friend_4.txt, 1871, 4
054c_mallory_and_the_dream_horse_11.txt, 1638, 11
m34c_mary_anne_and_the_haunted_bookstore_5.txt, 2295, 5
104c_abbys_twin_12.txt, 1927, 12
121c_abby_in_wonderland_14.txt, 792, 14
035c_jessis_babysitter_2.txt, 2226, 2
082c_jessi_and_the_troublemaker_13.txt, 1571, 13
070c_stacey_and_the_cheerleaders_10.txt, 1616, 10
084c_dawn_and_the_school_spirit_war_1.txt, 1725, 1
123c_claudias_big_party_14.txt, 1677, 14
056c_keep_out_claudia_6.txt, 1232, 6
m02c_beware_dawn_13.txt, 2023, 13
111c_staceys_secret_friend_15.txt, 1630, 15
131c_the_fire_at_mary_annes_house_9.txt, 1830, 9
098c_dawn_and_too_many_sitters_2.txt, 2039, 2
112c_kristy_and_the_sister_war_9.txt, 2007, 9

I put the output files into Tableau (Gantt visualization, configuring length as a dimension under “rows”) after running the code on the full text of all the series, and the chapter length of the main and mystery series (remember, each of those books has 15 chapters).

The books range from around 12,600 words (California Diaries: Amalia 3, which is shorter than this DSC book!), to nearly 45,000 words (Super Mystery #1: Baby-Sitters’ Haunted House). On the chapter level, there’s not a ton of variation in word length between chapters, though chapter 15 tends to be a bit shorter, and chapter 2 tends to be longer – there’s a lot of tropes to pack in!

Gantt chart of book and chapter lengths

But if we’re using Euclidean distance to compare even chapter 2s, BSC #75: Jessi’s Horrible Prank is 1,266 words and BSC #99: Stacey’s Broken Heart is 4,293 words. That alone is going to lead to a big difference in the word-count values.

When I first started playing with these text-comparison metrics (before taking the care to properly clean the data and ensure there weren’t problems with my chapter-separating code), I first tried Euclidean distance, and was fascinated by the apparent similarity of chapter 2 in the first Baby-Sitters Club book and a chapter in a California Diaries book. “What,” I wondered, “does wholesome Kristy’s Great Idea have to do with salacious California Diaries?

I laughed out loud when I opened the text files containing the text of those chapters, and immediately saw the answer: what they had in common was data cleaning problems that led to their truncation after a sentence or two. As a Choose Your Own Adventure book might put it, “You realize that your ‘findings’ are nothing more than your own mistakes in preparing your data set. You sigh wearily. The end.” Hopefully you, like childhood me, left a bookmark that last decision point you were unsure of, and you can go back and make a different choice. But even if you have to start over from the beginning, you can almost try again when doing DH.

Cosine similarity

Cosine similarity offers a workaround for the text-scale problems we encountered with Euclidean distance. Instead of trying to measure the distance between two points (which can be thrown off due to issues of magnitude, when one point represents a text that’s much longer than the other), it measures the cosine of the angle between them and calls it similarity. You may have also filed “cosine” away under “high school math I hoped to never see again”, but don’t panic! As trigonometry starts to flood back at you, you might find yourself wondering, “Why cosine similarity, and not any of its little friends, like sine or tangent?” After all, wouldn’t it be fun to burst into the chorus of Ace of Base’s “I Saw the Sine” whenever you worked out the text similarity?

Mostly it works out to a matter of numerical convenience in setting up the framing for measuring similarity: If the angle between two points is 0, then that means any difference is just one of magnitude (which we don’t worry about with cosine similarity) and you can say the texts are extremely similar. If the angle is 90 degrees, which is as far as you can get while staying in all-positive numbers (we don’t have any negative word counts), then there’s a huge difference. Cos(0) = 1, and cos(90) = 0, so with cosine similarity, you want larger numbers for more similarity. Which is the opposite of Euclidean distance, where you want smaller numbers for more similarity (because using that measure, 0 means “there’s no distance between these things and they are the same”). I’ve screwed this up more than once, getting excited about large numbers when using an algorithm where you want smaller numbers, and vice versa. Always double-check the scale you’re using and what counts as “similar” if you’re not sure. Or, as you might find in a Choose Your Own Adventure book: “The tweet was written, delayed only by the search for the perfect celebratory emoji to decorate its conclusion, when a small voice echoes in the back of your head. ‘Should these be large numbers? What algorithm did you use?’ You pause and think for a moment… then sigh, delete the tweet, and return to your code to start over. The end.”

But before you start writing “EUCLIDEAN = SMALL, COSINE = BIG” in sharpie on a sticky note and putting it on your wall with extra tape for reinforcement, the people who write Python packages realized it’s going to be a problem if they write a package where you can easily swap in different metrics, but some of them use large numbers for similarity, while others use small numbers. So what you’ll see in the Jupyter notebook is that it’s calculating cosine distance – which is just (1 - cosine similarity). After that bit of subtraction, “exactly the same” has a value of 0, just like you’d get in Euclidean distance.

We’re still not exactly comparing apples to apples here: you’re going to get much bigger numbers when calculating Euclidean distance than when calculating cosine distance, which makes sense. Euclidean distance is a kind of actual distance. Cosine distance is still just an angle between two vectors, which looks like a percentage, with a bit of manipulation to make “identical” work out to 0. The numbers are a lot smaller, and their range is a lot more compressed (from 0 to .99 for cosine distance, vs. 0 to 650 in our data set for Euclidean distance). The Euclidean distance score can be more nuanced, but this is a situation where nuance is a bad thing. I’m not doing this particular analysis to find precisely how different the texts are from each other – which is a good thing, because I know the variable length is a distorting factor that would prevent me from getting to that perfect number anyway. What I’m looking for is book pairings that stand out as noteworthy, either for their similarity or dissimilarity. And the compressed range of possible values for cosine distance makes those differences more visible.

Running the Euclidean distance calculation didn’t do anything to the results of our count vectorizer, so if you’re working through this book in order, you should be able to just run the cosine distance calculation below. If you have trouble, you can rerun the code cell with the CountVectorizer code in it – just make sure you’ve got it pointing to the right directory with the full text files.

cosine_distances = pd.DataFrame(squareform(pdist(wordcounts, metric='cosine')), index=filekeys, columns=filekeys)
cosine_distances
000c_the_summer_before 001c_kristys_great_idea 002c_claudia_and_the_phantom_phone_calls 003c_the_truth_about_stacey 004c_mary_anne_saves_the_day 005c_dawn_and_the_impossible_three 006c_kristys_big_day 007c_claudia_and_mean_jeanine 008c_boy_crazy_stacey 009c_the_ghost_at_dawns_house ... m36c_kristy_and_the_cat_burglar pc1c_staceys_book pc2c_claudias_book pc3c_dawns_book pc4c_mary_annes_book pc5c_kristys_book pc6c_abbys_book serr1c_logans_story serr2c_logan_bruno_boy_babysitter serr3c_shannons_story
000c_the_summer_before 0.000000 0.713682 0.619032 0.582249 0.573231 0.896098 0.832126 0.572193 0.755619 0.764822 ... 0.939076 0.507250 0.556577 0.913384 0.671589 0.758748 0.920182 0.880565 0.913763 0.885829
001c_kristys_great_idea 0.713682 0.000000 0.492413 0.681616 0.783353 0.911498 0.859378 0.789081 0.888678 0.731887 ... 0.967335 0.939854 0.872472 0.943598 0.904584 0.898065 0.980270 0.932188 0.952004 0.918028
002c_claudia_and_the_phantom_phone_calls 0.619032 0.492413 0.000000 0.438189 0.664592 0.928928 0.848969 0.470468 0.908284 0.777697 ... 0.808612 0.924519 0.645428 0.949983 0.641644 0.874637 0.980874 0.932040 0.896295 0.910428
003c_the_truth_about_stacey 0.582249 0.681616 0.438189 0.000000 0.731113 0.964981 0.866939 0.663324 0.942381 0.792115 ... 0.831450 0.706039 0.797737 0.952098 0.835715 0.936946 0.984321 0.935222 0.896311 0.887956
004c_mary_anne_saves_the_day 0.573231 0.783353 0.664592 0.731113 0.000000 0.892197 0.880304 0.431425 0.811704 0.816308 ... 0.962725 0.944486 0.721621 0.948991 0.709130 0.899979 0.977696 0.773900 0.919541 0.934259
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
pc5c_kristys_book 0.758748 0.898065 0.874637 0.936946 0.899979 0.927234 0.882085 0.836247 0.923280 0.951684 ... 0.961104 0.928659 0.805224 0.893659 0.823510 0.000000 0.958598 0.886807 0.936619 0.889512
pc6c_abbys_book 0.920182 0.980270 0.980874 0.984321 0.977696 0.986311 0.979436 0.989931 0.967925 0.986640 ... 0.937815 0.962131 0.940998 0.973152 0.938346 0.958598 0.000000 0.984834 0.984298 0.974587
serr1c_logans_story 0.880565 0.932188 0.932040 0.935222 0.773900 0.957424 0.949962 0.923127 0.877722 0.915415 ... 0.966749 0.957765 0.929770 0.951530 0.951290 0.886807 0.984834 0.000000 0.752683 0.948513
serr2c_logan_bruno_boy_babysitter 0.913763 0.952004 0.896295 0.896311 0.919541 0.934367 0.961223 0.972836 0.876257 0.888373 ... 0.907251 0.957883 0.941948 0.969854 0.910845 0.936619 0.984298 0.752683 0.000000 0.945843
serr3c_shannons_story 0.885829 0.918028 0.910428 0.887956 0.934259 0.910544 0.915238 0.952263 0.915999 0.948721 ... 0.961958 0.943121 0.869385 0.945358 0.875529 0.889512 0.974587 0.948513 0.945843 0.000000

192 rows × 192 columns

cosine_distances.to_csv('cosine_distances_count.csv')
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(cosine_distances)
#Displays the image
plt.show()

Cosine distance with count vectorizer

A sort of light salmon in the Euclidean distance visualization represented a value of 500, and the same color represents .8 in the cosine distance visualization. To my mind, the overall impression is less of Mary Anne’s classic plaid, and more like a dirty Kristy’s Krushers baseball jersey with flecks and blobs of spaghetti sauce here and there. (I’ll note that there’s some disagreement here within the DSC; Katia’s reaction was “Plaid in salmon and pink? Sickening, but still something Mary Anne’s dad would make her wear.”)

It’s not pretty, but it’s clarifying.

First, those super-light bands that are quite similar to one another (where they intersect in a box around the black diagonal line), but quite dissimilar from everything else? That’s the California Diaries series. And California Diaries: Dawn 1 is still a little lighter than all the rest of that sub-series, but not so much so. This visualization makes it easier to see that the California Diaries are much more similar to regular-series books set in California, like BSC #23: Dawn on the Coast and BSC #72: Dawn and the We ♥️ Kids Club. It’s not a groundbreaking discovery, but it immediately makes sense! And honestly, “boring” DH results are often a sign that you’ve done something right.

Abby’s Book is still fairly distinct, but this visualization makes it easier to see some of the points of overlap for the other Portrait Collection books, like the overlap between Claudia’s and Mary Anne’s autobiographies and BSC #7: Claudia and Mean Janine, which features Kishi family drama and a focus on Claudia’s grandmother Mimi, who was an important figure in both girls’ lives. There are also speckles of dark spots on the visualization, which mostly seem to correspond to books with the same narrator. It’s particularly prominent with distinctive narrators, like Jessi, whose interests and perspective are not shared by the other characters.

The phenomenon involving books #83-101 forming a cluster (including, we can see here, the mystery novels published around the same time period) is still visible here. I don’t have an explanation (though Anouk suspects possible editorial influence since the books are sequential), but this could be something worth exploring later.

But while this has been an interesting diversion, let’s get back to chapter 2! After running just the chapter 2s through the same cosine distance calculation, here’s what we get.

ch2dir = '/Users/qad/Documents/dsc_chapters/ch2'
os.chdir(ch2dir)
# Use the glob library to create a list of file names, sorted alphabetically
# Alphabetical sorting will get us the books in numerical order
filenames = sorted(glob.glob("*.txt"))
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
vectorizer = CountVectorizer(input="filename", max_features=1000, max_df=0.7)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
ch2 = vectorizer.fit_transform(filenames).toarray()
ch2_cosine = pd.DataFrame(squareform(pdist(ch2, metric='cosine')), index=filekeys, columns=filekeys)
ch2_cosine
001c_kristys_great_idea_2 002c_claudia_and_the_phantom_phone_calls_2 003c_the_truth_about_stacey_2 004c_mary_anne_saves_the_day_2 005c_dawn_and_the_impossible_three_2 006c_kristys_big_day_2 007c_claudia_and_mean_jeanine_2 008c_boy_crazy_stacey_2 009c_the_ghost_at_dawns_house_2 010c_logan_likes_mary_anne_2 ... m30c_kristy_and_the_mystery_train_2 m31c_mary_anne_and_the_music_box_secret_2 m32c_claudia_and_the_mystery_in_the_painting_2 m33c_stacey_and_the_stolen_hearts_2 m34c_mary_anne_and_the_haunted_bookstore_2 m35c_abby_and_the_notorius_neighbor_2 m36c_kristy_and_the_cat_burglar_2 serr1c_logans_story_2 serr2c_logan_bruno_boy_babysitter_2 serr3c_shannons_story_2
001c_kristys_great_idea_2 0.000000 0.766090 0.645483 0.531795 0.675465 0.656225 0.688112 0.764615 0.766119 0.761255 ... 0.786214 0.802706 0.737756 0.731130 0.723651 0.715903 0.886659 0.791623 0.780976 0.797041
002c_claudia_and_the_phantom_phone_calls_2 0.766090 0.000000 0.666760 0.739101 0.765984 0.691542 0.731188 0.737672 0.750627 0.664367 ... 0.798273 0.844518 0.790636 0.804286 0.799606 0.762009 0.889835 0.705636 0.738903 0.795357
003c_the_truth_about_stacey_2 0.645483 0.666760 0.000000 0.661717 0.707359 0.665165 0.663676 0.710884 0.760346 0.748608 ... 0.764393 0.807352 0.749600 0.758404 0.712565 0.653595 0.895618 0.692688 0.718258 0.814498
004c_mary_anne_saves_the_day_2 0.531795 0.739101 0.661717 0.000000 0.639950 0.625909 0.702424 0.740980 0.720565 0.753377 ... 0.760781 0.806796 0.758356 0.748060 0.756648 0.738210 0.903046 0.764599 0.748895 0.836030
005c_dawn_and_the_impossible_three_2 0.675465 0.765984 0.707359 0.639950 0.000000 0.721849 0.613405 0.763808 0.852449 0.786941 ... 0.769214 0.773123 0.740475 0.686315 0.706921 0.691431 0.911044 0.697064 0.668238 0.854810
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
m35c_abby_and_the_notorius_neighbor_2 0.715903 0.762009 0.653595 0.738210 0.691431 0.758548 0.731529 0.773304 0.762364 0.804598 ... 0.590592 0.555394 0.604087 0.547327 0.537541 0.000000 0.820411 0.645871 0.607143 0.804980
m36c_kristy_and_the_cat_burglar_2 0.886659 0.889835 0.895618 0.903046 0.911044 0.891551 0.914885 0.911541 0.911166 0.915891 ... 0.911671 0.921123 0.901557 0.903830 0.866497 0.820411 0.000000 0.913101 0.898953 0.925601
serr1c_logans_story_2 0.791623 0.705636 0.692688 0.764599 0.697064 0.768548 0.778784 0.819493 0.793814 0.818823 ... 0.712838 0.695023 0.723235 0.648015 0.692043 0.645871 0.913101 0.000000 0.543022 0.821760
serr2c_logan_bruno_boy_babysitter_2 0.780976 0.738903 0.718258 0.748895 0.668238 0.740903 0.745459 0.808759 0.832129 0.782887 ... 0.683610 0.655754 0.645678 0.609764 0.631424 0.607143 0.898953 0.543022 0.000000 0.842790
serr3c_shannons_story_2 0.797041 0.795357 0.814498 0.836030 0.854810 0.817909 0.833261 0.854468 0.886418 0.763493 ... 0.808888 0.864046 0.862112 0.870870 0.824117 0.804980 0.925601 0.821760 0.842790 0.000000

167 rows × 167 columns

ch2_cosine.to_csv('ch2_cosine_count.csv')
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(ch2_cosine)
#Displays the image
plt.show()

Cosine distance for chapter 2 with count vectorizer

I did a double-take when I saw it, and went back to check the code and make sure I hadn’t accidentally run Euclidean distance again. The chapter 2s are a lot closer than the books overall. Which makes sense – the reason we’re looking at chapter 2 is because we know it’s repetitive. This is a smaller data set than what we used for the full book comparison, including only chapter 2s from the main and mystery series (which follow the 15-chapter structure). Even the chapter 2s show the pattern of similarity for books #83-101 and temporally-similar mysteries, and there’s another cluster from books #30-48. The light-colored lines reflect another known phenomenon about chapter 2, where sometimes the typical “chapter 2” content actually appears in chapter 3.

To drive home the point that there’s something different going on here with chapter 2, I re-ran cosine distance on four other chapters: 1, 5 (top row), 9, and 15.

(I’m not going to repeat the code for calculating these here; it’s the same as the chapter 2 code above, with different source folders.)

Cosine distance for chapters 1, 5, 9, and 15

There are some interesting things that we could dig into here! It looks like there’s more overlap in how the books end (ch. 15, bottom right) than how the middle of the book goes, though there are lots of individual speckles of high similarity for the middle chapters. Chapter 1 starts similarly in the early books, but is pretty dispersed by the end. The cluster in books #83-101 isn’t really visible in these chapters. But the crucial thing we’re seeing is just that chapter 2s are much more similar to one another than other chapters.

Word counts or word frequencies?

I ran this part by Mark, pleased with myself for having worked through a tutorial, modified it to fit what I wanted to work on, and come up with a largely interpretable result that was brimming with possibilities for things to explore next.

His response caught me completely off-guard: “You scaled, or otherwise normalized, your word counts, right? RIGHT? RIGHT?!?!? I only ask because you don’t mention it anywhere, and if you don’t normalize your word counts by turning them into word frequencies, you are only really going to ever find out about what texts are longer than others.”

Uh-oh. That Programming Historian tutorial hadn’t said anything about word frequencies. In fact, it’d used the word count vectorizer in its code. I knew that would be a problem for Euclidean distance, but I’d hoped that cosine distance would… solve it?

“If you use frequencies instead of counts, then you can compare texts that are of somewhat different lengths (within an order of magnitude) pretty effectively,” suggested Mark. “The big problem with Euclidean distances are 0 values. When you use too many dimensions, especially when you use word frequencies, there are a lot of 0s, and these are overweighted by Euclidean distance so that similar texts of very different lengths look much more different than they should – because the longer text has a lot of words that the shorter text doesn’t have (and the reverse is not as true – the shorter text has far fewer words that the longer text doesn’t have). So, when you compare a novel to a short story (or a LONG novel to a normal novel), this becomes a real problem. Cosine is still probably a better metric for the kind of work that you are doing, but here too it is crucial to scale/normalize your counts – otherwise size just keeps becoming a factor. Normalizing word counts is such a crucial point in the process and you don’t actually mention it, that it has me worried.”

Now I was worried, too. I definitely had not normalized the word counts. I guess I could figure out how to create a table with each word and its word count and then generate a frequency by dividing by the sum of all the words, but how would I then feed those frequencies back into the vectorizer pipeline? In the peaceful, dark hours of Insomnia O’Clock, I curled up with the documentation for scikit-learn, the Python library I used for the vectorizer, to see if it offered any better options.

And to my delight, it did! The TF-IDF vectorizer was there to save the day. Now, TF-IDF (term frequency - inverse document frequency, which tries to get at distinctive words in each text) wasn’t what I wanted – not yet. (We’ll get to that soon enough; it’s a very different method for evaluating similarity.) But you can’t spell TF-IDF without TF, and since TF is “term frequency”, it’s exactly the thing I was looking for!

If using term frequency helps accounting for differences in length, I expected that running Euclidean distance on a matrix of word frequencies should look something like the Cosine distance on a matrix of word counts, right? Let’s compare the first version and the normalized version comparing the full books using Euclidean distance!

Euclidean distance with word frequencies

Because we were in the directory with the chapter 2’s, we need to go back to the directory with the full text.

filedir = '/Users/qad/Documents/dsc_corpus_clean'
os.chdir(filedir)

This time we’re using the TF-IDF vectorizer, with the “IDF” part turned off:

from sklearn.feature_extraction.text import TfidfVectorizer

# Use the glob library to create a list of file names, sorted alphabetically
# Alphabetical sorting will get us the books in numerical order
filenames = sorted(glob.glob("*.txt"))
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
vectorizer = TfidfVectorizer(input="filename", stop_words=None, use_idf=False, norm=None, max_features=1000)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
wordfreqs = vectorizer.fit_transform(filenames).toarray()
Note: See what happened here? I had to figure out a method to do something, where there wasn't an out-of-the-box solution I could just pull from a tutorial I was following. As a result, I thought about all the parameters and picked better ones-- and did not throw out words shared by 70% of the corpus.
(What I also didn't know yet was that, in the process, I'd made another consequential mistake with the vectorizer, but I wouldn't discover that until later still.)

So that was good. But the surprise that followed wasn’t enough to make me suspicious about the parameters from the first time I ran the vectorizer.

I guess I’ve managed to be a walking case study in the point Mark was making about the dangers of just reusing things you find online without being very critical about everything that goes into them. But at least I’m a self-aware walking case study… even if it takes until the 11th hour.

euclidean_distances_freq = pd.DataFrame(squareform(pdist(wordfreqs, metric='euclidean')), index=filekeys, columns=filekeys)
euclidean_distances_freq
000c_the_summer_before 001c_kristys_great_idea 002c_claudia_and_the_phantom_phone_calls 003c_the_truth_about_stacey 004c_mary_anne_saves_the_day 005c_dawn_and_the_impossible_three 006c_kristys_big_day 007c_claudia_and_mean_jeanine 008c_boy_crazy_stacey 009c_the_ghost_at_dawns_house ... m36c_kristy_and_the_cat_burglar pc1c_staceys_book pc2c_claudias_book pc3c_dawns_book pc4c_mary_annes_book pc5c_kristys_book pc6c_abbys_book serr1c_logans_story serr2c_logan_bruno_boy_babysitter serr3c_shannons_story
000c_the_summer_before 0.000000 1688.674924 1585.480684 1505.149826 1414.203663 1475.682215 1565.950829 1772.344775 1769.620016 1681.382170 ... 1766.014722 1540.024675 1497.271852 1804.691940 1627.442165 1926.672520 2106.107785 1826.547289 1973.011911 1588.887347
001c_kristys_great_idea 1688.674924 0.000000 617.234153 612.441834 655.706489 671.600328 550.912879 590.251641 609.888514 663.587975 ... 735.306059 818.899872 720.864758 766.713767 778.387436 710.456191 880.771821 565.996466 655.877275 640.468578
002c_claudia_and_the_phantom_phone_calls 1585.480684 617.234153 0.000000 584.428781 544.864203 578.813441 618.004045 672.802348 633.221920 502.952284 ... 703.283016 767.949868 678.567609 729.702679 726.657416 805.578053 1032.753117 606.353032 688.671910 635.008661
003c_the_truth_about_stacey 1505.149826 612.441834 584.428781 0.000000 533.235408 659.113040 657.785679 690.512853 735.329178 734.152573 ... 846.965170 737.004749 748.245281 814.447666 789.385837 885.817701 1016.655300 732.601529 836.303773 708.062850
004c_mary_anne_saves_the_day 1414.203663 655.706489 544.864203 533.235408 0.000000 608.815243 661.913892 660.938726 771.198418 718.149010 ... 857.289916 717.803594 676.698604 789.976582 693.593541 891.163846 1067.084814 725.124127 849.795269 635.373119
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
pc5c_kristys_book 1926.672520 710.456191 805.578053 885.817701 891.163846 956.911177 761.767025 755.255586 651.803651 718.507481 ... 739.081186 691.876434 665.522351 477.403393 559.309396 0.000000 599.125196 525.301818 522.293979 646.618899
pc6c_abbys_book 2106.107785 880.771821 1032.753117 1016.655300 1067.084814 1113.974865 928.422318 847.522271 821.579576 921.745084 ... 962.580906 891.654642 885.428145 677.029541 841.609173 599.125196 0.000000 745.013423 743.158126 787.810256
serr1c_logans_story 1826.547289 565.996466 606.353032 732.601529 725.124127 765.571029 629.125584 642.606411 475.892845 546.162064 ... 678.774631 755.434312 643.962732 587.903053 661.846659 525.301818 745.013423 0.000000 351.378713 573.656692
serr2c_logan_bruno_boy_babysitter 1973.011911 655.877275 688.671910 836.303773 849.795269 878.529453 754.179024 704.905667 580.464469 629.216974 ... 695.925283 834.915565 784.807620 641.763975 701.231773 522.293979 743.158126 351.378713 0.000000 695.635681
serr3c_shannons_story 1588.887347 640.468578 635.008661 708.062850 635.373119 676.266959 564.208295 651.099839 635.235389 574.436245 ... 803.830206 605.439510 475.243096 526.231888 644.810825 646.618899 787.810256 573.656692 695.635681 0.000000

192 rows × 192 columns

euclidean_distances_freq.to_csv('euclidean_distances_freq.csv')
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(euclidean_distances_freq)
#Displays the image
plt.show()

Euclidean distance using term frequency

Oh.

Once you normalize for length, all the Baby-Sitters Club books look… mostly the same. Even with Euclidean distance. So what am I even going to get for Cosine distance using term frequencies?

Cosine distance with word frequencies

We’ve already used the TF-IDF vectorizer, so now we just need to do a different distance calculation.

cosine_distances_freq = pd.DataFrame(squareform(pdist(wordfreqs, metric='cosine')), index=filekeys, columns=filekeys)
cosine_distances_freq
000c_the_summer_before 001c_kristys_great_idea 002c_claudia_and_the_phantom_phone_calls 003c_the_truth_about_stacey 004c_mary_anne_saves_the_day 005c_dawn_and_the_impossible_three 006c_kristys_big_day 007c_claudia_and_mean_jeanine 008c_boy_crazy_stacey 009c_the_ghost_at_dawns_house ... m36c_kristy_and_the_cat_burglar pc1c_staceys_book pc2c_claudias_book pc3c_dawns_book pc4c_mary_annes_book pc5c_kristys_book pc6c_abbys_book serr1c_logans_story serr2c_logan_bruno_boy_babysitter serr3c_shannons_story
000c_the_summer_before 0.000000 0.044943 0.049422 0.046609 0.035678 0.046631 0.040590 0.056732 0.047167 0.050052 ... 0.070188 0.041035 0.021208 0.042741 0.034981 0.034575 0.069217 0.035023 0.049642 0.029316
001c_kristys_great_idea 0.044943 0.000000 0.036930 0.030929 0.033798 0.035493 0.030381 0.039936 0.042907 0.048608 ... 0.059447 0.067061 0.055227 0.068699 0.067711 0.052707 0.084284 0.034375 0.042900 0.045442
002c_claudia_and_the_phantom_phone_calls 0.049422 0.036930 0.000000 0.031004 0.025655 0.028958 0.037465 0.043617 0.036552 0.024306 ... 0.049127 0.057083 0.045569 0.048338 0.052466 0.049403 0.094011 0.025650 0.028017 0.039682
003c_the_truth_about_stacey 0.046609 0.030929 0.031004 0.000000 0.024914 0.038010 0.039380 0.039519 0.043885 0.048400 ... 0.066079 0.049764 0.051218 0.053225 0.056318 0.051572 0.071363 0.034661 0.041751 0.044552
004c_mary_anne_saves_the_day 0.035678 0.033798 0.025655 0.024914 0.000000 0.032055 0.038245 0.032416 0.045906 0.043641 ... 0.065060 0.045676 0.039438 0.044694 0.039605 0.046447 0.076709 0.028795 0.038371 0.032554
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
pc5c_kristys_book 0.034575 0.052707 0.049403 0.051572 0.046447 0.059819 0.045754 0.066528 0.051282 0.049524 ... 0.052561 0.027891 0.033378 0.028447 0.024882 0.000000 0.058998 0.039247 0.043466 0.037280
pc6c_abbys_book 0.069217 0.084284 0.094011 0.071363 0.076709 0.088233 0.073567 0.082781 0.083394 0.088067 ... 0.097452 0.058017 0.069955 0.059107 0.071658 0.058998 0.000000 0.080784 0.091230 0.057229
serr1c_logans_story 0.035023 0.034375 0.025650 0.034661 0.028795 0.034767 0.032406 0.049176 0.027702 0.028677 ... 0.048107 0.049055 0.038349 0.046504 0.047611 0.039247 0.080784 0.000000 0.016501 0.032945
serr2c_logan_bruno_boy_babysitter 0.049642 0.042900 0.028017 0.041751 0.038371 0.043324 0.044590 0.056601 0.039091 0.033917 ... 0.044664 0.055314 0.055859 0.055186 0.048879 0.043466 0.091230 0.016501 0.000000 0.046193
serr3c_shannons_story 0.029316 0.045442 0.039682 0.044552 0.032554 0.037420 0.032265 0.047409 0.045042 0.035973 ... 0.070188 0.035800 0.023650 0.028882 0.045738 0.037280 0.057229 0.032945 0.046193 0.000000

192 rows × 192 columns

cosine_distances_freq.to_csv('cosine_distances_freq.csv')
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(cosine_distances_freq)
#Displays the image
plt.show()

Cosine distance using term frequency

We’ve gone from Mary Anne Plaid to a sort of Claudia Eggplant. Could that be right? Is most of the difference really attributable to length? Even the clear-as-day California Diaries cluster has mostly washed out, except for those shining lights of difference: Ducky, and to a lesser extent, Amalia. (I guess after normalizing for length, what really makes a difference in this corpus is East Coast people vs. West Coast people… and Dawn has assimilated to Connecticut more than she realizes.)

This is something that we can check pretty easily! We already wrote up some code to do word counts for all the books. Are the books that stood out before, and have now disappeared into the purple morass, particularly long or short? That does turn out to be the answer with the California Diaries cluster: all of them are shorter than your average BSC book. And it’s also the answer with Abby’s Portrait Collection looking different than the other Portrait Collection books, coming in at only 78% of the length of Stacey’s Portrait Collection book.

Note: Remember, I didn't realize it at the time, but there were two things that this variant was accounting for: text length, and also not throwing out words that 70% of the books have in common, which includes important things in this corpus like character names!
Or, at least, I thought there were two things this variant was accounting for...

So what happens when we look at cosine distance for the chapter 2’s?

ch2dir = '/Users/qad/Documents/dsc_chapters/ch2'
os.chdir(ch2dir)
# Use the glob library to create a list of file names, sorted alphabetically
# Alphabetical sorting will get us the books in numerical order
filenames = sorted(glob.glob("*.txt"))
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
vectorizer = TfidfVectorizer(input="filename", stop_words=None, use_idf=False, norm=None, max_features=1000)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
ch2freqs = vectorizer.fit_transform(filenames).toarray()
ch2_cosine_freq = pd.DataFrame(squareform(pdist(ch2freqs, metric='cosine')), index=filekeys, columns=filekeys)
ch2_cosine_freq
001c_kristys_great_idea_2 002c_claudia_and_the_phantom_phone_calls_2 003c_the_truth_about_stacey_2 004c_mary_anne_saves_the_day_2 005c_dawn_and_the_impossible_three_2 006c_kristys_big_day_2 007c_claudia_and_mean_jeanine_2 008c_boy_crazy_stacey_2 009c_the_ghost_at_dawns_house_2 010c_logan_likes_mary_anne_2 ... m30c_kristy_and_the_mystery_train_2 m31c_mary_anne_and_the_music_box_secret_2 m32c_claudia_and_the_mystery_in_the_painting_2 m33c_stacey_and_the_stolen_hearts_2 m34c_mary_anne_and_the_haunted_bookstore_2 m35c_abby_and_the_notorius_neighbor_2 m36c_kristy_and_the_cat_burglar_2 serr1c_logans_story_2 serr2c_logan_bruno_boy_babysitter_2 serr3c_shannons_story_2
001c_kristys_great_idea_2 0.000000 0.206412 0.113674 0.150443 0.125802 0.115092 0.178066 0.125936 0.190399 0.198563 ... 0.164629 0.197586 0.143086 0.180067 0.163321 0.125262 0.268084 0.156328 0.185042 0.189323
002c_claudia_and_the_phantom_phone_calls_2 0.206412 0.000000 0.163208 0.160833 0.114802 0.183622 0.139114 0.192687 0.173620 0.183036 ... 0.156327 0.185861 0.126021 0.147126 0.148118 0.128377 0.231286 0.142476 0.126149 0.187776
003c_the_truth_about_stacey_2 0.113674 0.163208 0.000000 0.114385 0.121479 0.112804 0.147225 0.132875 0.125324 0.127333 ... 0.147476 0.167143 0.114144 0.135722 0.126646 0.092395 0.242669 0.119585 0.146606 0.138360
004c_mary_anne_saves_the_day_2 0.150443 0.160833 0.114385 0.000000 0.155826 0.149403 0.182614 0.167282 0.134859 0.120849 ... 0.198136 0.212283 0.163036 0.188255 0.181516 0.154926 0.225048 0.163295 0.201479 0.161856
005c_dawn_and_the_impossible_three_2 0.125802 0.114802 0.121479 0.155826 0.000000 0.150850 0.126401 0.131118 0.193453 0.204176 ... 0.102185 0.143733 0.091190 0.107729 0.119510 0.085736 0.265262 0.107738 0.115631 0.176527
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
m35c_abby_and_the_notorius_neighbor_2 0.125262 0.128377 0.092395 0.154926 0.085736 0.138655 0.133336 0.147063 0.147175 0.161961 ... 0.082940 0.081991 0.058391 0.065696 0.057936 0.000000 0.268627 0.072937 0.067835 0.145117
m36c_kristy_and_the_cat_burglar_2 0.268084 0.231286 0.242669 0.225048 0.265262 0.218984 0.281148 0.268710 0.251897 0.240916 ... 0.327963 0.336943 0.295650 0.312225 0.284412 0.268627 0.000000 0.286544 0.303065 0.267370
serr1c_logans_story_2 0.156328 0.142476 0.119585 0.163295 0.107738 0.182890 0.193107 0.166469 0.180923 0.192775 ... 0.087625 0.108703 0.072981 0.086351 0.116175 0.072937 0.286544 0.000000 0.072080 0.159806
serr2c_logan_bruno_boy_babysitter_2 0.185042 0.126149 0.146606 0.201479 0.115631 0.177894 0.179056 0.187980 0.172765 0.208924 ... 0.081543 0.079132 0.062328 0.081043 0.072673 0.067835 0.303065 0.072080 0.000000 0.158507
serr3c_shannons_story_2 0.189323 0.187776 0.138360 0.161856 0.176527 0.138519 0.149076 0.185551 0.149572 0.160927 ... 0.153772 0.169683 0.146668 0.156993 0.151115 0.145117 0.267370 0.159806 0.158507 0.000000

167 rows × 167 columns

ch2_cosine_freq.to_csv('ch2_cosine_freq.csv')
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(ch2_cosine_freq)
#Displays the image
plt.show()

Cosine distance for chapter 2 using term frequency

Now wait a minute!! Why on earth do the full books look so much more similar than the chapter 2’s?! We know the chapter 2’s are more similar than the full books! WTF is going wrong?!

I was so irked at the direction this had gone that I entirely forgot about the typical mutual inquiry about well-being and all those social conventions at my next meeting with Mark. The first words out of my mouth, flying forth as soon as his audio connected on Zoom, were, “I tried to normalize the word counts and now the novels are more similar than the chapter 2’s WHAT IS EVEN GOING ON HERE?!?!

And then I remembered– as Kristy’s teacher, Mr. Redmont, would put it– “decorum”, and managed to collect myself. “Also, hello! How are you?”

Mark was gracious and generous, as always. “I’m interested! Tell me more!”

So I showed him, grumbling and annoyed as I pulled up the code and data. Mark thought about it. “I think you’re really comparing apples to oranges here. Changing word counts to word frequencies helps when your texts are different lengths, but, say, within an order of magnitude.” I stared, quizzically, into my laptop’s video camera. “So what I think is happening with your chapter 2’s is that they’re short enough that the difference between 10 and 13 instances of the word ‘the’ is going to make them look more ‘different’. And the same thing for every other word. With the end result being that the chapter 2’s look more different. But across the entirety of the novel, though, small differences in word frequencies even out. So they end up looking more similar.”

“Wait, so, there’s no way to compare chapters vs. whole books?” I asked.

“You could do that,” said Mark. “What you’d need to do is sample a chapter-2’s length of text from the set of all the words in a whole book. And then use that sample as the point of comparison.”

“Wait, what? If you randomly grab, say, 2,500 words from a novel, you’d be comparing chapter 2 vs. a text that doesn’t make any sense!”

Mark shrugged. “I mean, you could generate a text of chapter 2 length using a Markov chain if that would make you feel better,” he said, referencing a text-generation model where the probability of each word occurring depends only on the previous word generated. It’d probably have basically the same effect overall, but would be likely to make more sense to the human reader.

But that seemed like a task for a future BSC book. For now, though, a better point of comparison would be comparing how similar the chapter 2’s were, vs. other chapters, just like what we’d done earlier for cosine distance using word counts:

Cosine distance using word frequency for chapters 1, 5, 9, and 15

And clearly, even though the chapters are less similar than the books overall using this metric, the chapter 2’s are much more similar than other sets of chapters. So we’ve found the same overall result, but we’ve also saved ourselves from chasing false leads – like the “difference” in Abby’s Portrait Collection book that only really have to do with text length. Not everything is as purple as everything else in this visualization, and there are still things we can follow up on. But we’ve leveled out the differences that are just about difference in length.

I think we’ve said all we can say about Euclidean and Cosine distance for this book, and how the results you get vary depending on how you count (or ratio) your words. It’s time to move on to a different method.

Slow down, Quinn: Before moving on to the next text comparison method, it's important to wrap up some loose ends. We wanted to differentiate the effect of the TF-IDF vectorizer from the effect of no longer using the `max_df` setting to drop terms that appear in 70% of texts. So let's compare three visualizations, all showing Euclidean distance, but with different vectorizer settings: from left to right, the count vectorizer that we used when we first ran Euclidaen distance, which drops the terms that appear in 70% in the text. In the middle, the TF-IDF vectorizer that should get us term frequencies instead of counts, and thereby normalize for length. And then finally, the TF-IDF vectorizer without dropping any terms.
Now wait just a minute here.

Why do the count vectorizer and TF-IDF vectorizer results look identical? Are they actually identical? Shouldn't dropping common words make it even more important to use word frequencies?

This was bad news.

I was already up past midnight trying to get this Data-Sitter's Club book ready for publication, and as an insomniac morning person, that was never a good thing. This was a huge roadblock. I couldn't publish this book without figuring out what was going on.

I re-ran the code again and again, ditching the visualization and comparing the numbers in the table. Every single time, the numbers were identical, regardless of which vectorizer I used or what max_df value I used.

I spent the early morning insomnia hours desperately Googling, and scouring the scikit-learn documentation. I couldn't find anyone else having this problem, and I was completley stumped.

It was time to throw myself on the mercy of DH Python Twitter.

DH Python Twitter is a thing.

I've been surprised at how often it's worked out that I complain about something involving coding (usually Python, but sometimes other tools) on Twitter and someone will show up and help me solve it. Sometimes it's someone I know, sometimes it's a random person who works on data science, machine learning, or just knows a lot of Python. It feels like a kind of positive, helpful inverse of mansplaining: instead of guys showing up to talk over me and explain things I already know, they show up, listen to the problem I'm having, and help me understand what's going on. (I mean, sometimes they show up and don't read the question and suggest something I and any other reasonable person would've already tried first, but I've gotten lucky with more helpful replies than not.)

Part of it is definitely the privilege of my weird job -- there's no professional risk for me in publicly not-knowing things. That's not the case for a lot of people. But since I can do this, I do, with the hope that other people who don't know can follow along and learn, too.

A lot of the Data-Sitters Club is active on Twitter, and if you're trying to do something from one of our books and you've got a question, please don't feel weird about tagging us and asking, if you're comfortable! People who write DH tutorials and stuff are generally really happy to see that people are using their work, and often don't mind helping you debug it. And that's what saved the day this time.

Closing the narrative loop

I was so relieved when Zoe LeBlanc offered to take a look at my code. She's my favorite non-English DH developer-turned-tenure-track faculty. As luck would have it, she was meeting with John R. Ladd that afternoon... the same John R. Ladd who'd written the Programming Historian tutorial from which I copied the code that triggered this whole subplot! And he also offered to help!

And that's how I found myself meeting with Zoe and John, which felt like an apt conclusion to this strange computational subplot.

As soon as he took a look at my code, John knew the answer.

"Everything here looks great-- the only problem is you told it not to normalize," he said.

I gaped. "Wait, what? I told it to use the TF-IDF vectorizer. I mean, I read all the scikit-learn documentation on normalization and I was pretty sure I didn't want it to do... whatever it was exactly that the normalization parameter did? I just wanted term frequencies."

John shook his head sympathetically. "Yeah, the scikit-learn documentation really doesn't help sometimes. This happened to me a couple years ago when I was teaching a workshop on text comparison using scikit-learn. People were concerned about normalization, and I couldn't figure out how to make it work with scikit-learn, and it made me wonder if it was the right package for the job. But here's how normalization works with the TF-IDF vectorizer: if you set it to 'l1', you get relative frequencies. What it does is make the sum (of absolute values, but we don't have any negative word counts here) of all the features (word counts) add up to 1. Now, l2 is the standard machine learning normalization for text analysis. L2 normalization makes it so that the sum of the *squares* of features is equal to 1. This better accounts for outliers. It basically uses the Pythagorean theorem to normalize the vectors."

So there you have it. If your middle-school-age kid ever complains about having to learn the Pythagorean theorem, and refuses to believe it has any real-world utility, you can tell them that it's really important for machine learning.

John wasn't kidding about the scikit-learn documentation not helping, though; I don't think I would have ever understood that "‘l1’: Sum of absolute values of vector elements is 1." would mean "turns counts into frequencies".

Word frequencies... now with actual word frequencies!

Thanks to John and Zoe, I knew how to change my code to actually get what I was aiming for. Let's look at what real word frequencies look like, compared to just not throwing out common shared words, like it turns out we just did, above.

filedir = '/Users/qad/Documents/dsc_corpus_clean'
os.chdir(filedir)
# Use the glob library to create a list of file names, sorted alphabetically
# Alphabetical sorting will get us the books in numerical order
filenames = sorted(glob.glob("*.txt"))
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
# Like, actually, the parameters you need, including not disabling normalization
vectorizer = TfidfVectorizer(input="filename", stop_words=None, use_idf=False, norm='l1', max_features=1000)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
wordfreqs4real = vectorizer.fit_transform(filenames).toarray()
Euclidean distance with real word frequencies
euclidean_distances_freq = pd.DataFrame(squareform(pdist(wordfreqs4real, metric='euclidean')), index=filekeys, columns=filekeys)
euclidean_distances_freq
000c_the_summer_before 001c_kristys_great_idea 002c_claudia_and_the_phantom_phone_calls 003c_the_truth_about_stacey 004c_mary_anne_saves_the_day 005c_dawn_and_the_impossible_three 006c_kristys_big_day 007c_claudia_and_mean_jeanine 008c_boy_crazy_stacey 009c_the_ghost_at_dawns_house ... m36c_kristy_and_the_cat_burglar pc1c_staceys_book pc2c_claudias_book pc3c_dawns_book pc4c_mary_annes_book pc5c_kristys_book pc6c_abbys_book serr1c_logans_story serr2c_logan_bruno_boy_babysitter serr3c_shannons_story
000c_the_summer_before 0.000000 0.030808 0.031985 0.030986 0.027231 0.031000 0.029166 0.034007 0.031367 0.032735 ... 0.038119 0.030155 0.021441 0.030842 0.027550 0.026789 0.037886 0.026910 0.031911 0.025035
001c_kristys_great_idea 0.030808 0.000000 0.027510 0.023543 0.024813 0.026684 0.025889 0.026987 0.029894 0.032758 ... 0.034351 0.038860 0.034875 0.039416 0.038349 0.032537 0.040630 0.025338 0.027749 0.031696
002c_claudia_and_the_phantom_phone_calls 0.031985 0.027510 0.000000 0.025008 0.022770 0.024174 0.027839 0.029445 0.027406 0.022852 ... 0.031613 0.035421 0.031248 0.032757 0.033589 0.031729 0.043777 0.022701 0.023801 0.029014
003c_the_truth_about_stacey 0.030986 0.023543 0.025008 0.000000 0.021406 0.027317 0.028574 0.026989 0.029887 0.032247 ... 0.036090 0.033610 0.033296 0.034769 0.034934 0.032044 0.037542 0.025444 0.027613 0.030941
004c_mary_anne_saves_the_day 0.027231 0.024813 0.022770 0.021406 0.000000 0.025119 0.028106 0.024534 0.030501 0.030639 ... 0.035852 0.032204 0.029418 0.032017 0.029616 0.030447 0.038945 0.023252 0.026593 0.026670
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
pc5c_kristys_book 0.026789 0.032537 0.031729 0.032044 0.030447 0.034790 0.030786 0.036312 0.032493 0.032451 ... 0.032744 0.025048 0.026815 0.025376 0.023313 0.000000 0.034727 0.028040 0.029492 0.028142
pc6c_abbys_book 0.037886 0.040630 0.043777 0.037542 0.038945 0.042253 0.039032 0.040453 0.041438 0.043190 ... 0.044594 0.035709 0.038642 0.036123 0.039200 0.034727 0.000000 0.040054 0.042367 0.034806
serr1c_logans_story 0.026910 0.025338 0.022701 0.025444 0.023252 0.026149 0.025888 0.030399 0.023851 0.025080 ... 0.030947 0.033119 0.028898 0.032429 0.032124 0.028040 0.040054 0.000000 0.017608 0.026663
serr2c_logan_bruno_boy_babysitter 0.031911 0.027749 0.023801 0.027613 0.026593 0.029112 0.030279 0.032335 0.028262 0.027364 ... 0.029828 0.035202 0.034630 0.035292 0.032691 0.029492 0.042367 0.017608 0.000000 0.031427
serr3c_shannons_story 0.025035 0.031696 0.029014 0.030941 0.026670 0.028176 0.026253 0.031740 0.030955 0.027921 ... 0.038499 0.028237 0.022698 0.025445 0.031618 0.028142 0.034806 0.026663 0.031427 0.000000

192 rows × 192 columns

euclidean_distances_freq.to_csv('euclidean_distances_freq.csv')
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(euclidean_distances_freq)
#Displays the image
plt.show()

Interesting! Similar to what I had before, without the word frequency normalization, but a little lighter in color, meaning less similar. Which sounds better to me, knowing the corpus? Let’s see how cosine distance plays out.

Cosine distance with word frequencies

cosine_distances_freq = pd.DataFrame(squareform(pdist(wordfreqs4real, metric='cosine')), index=filekeys, columns=filekeys)
cosine_distances_freq
000c_the_summer_before 001c_kristys_great_idea 002c_claudia_and_the_phantom_phone_calls 003c_the_truth_about_stacey 004c_mary_anne_saves_the_day 005c_dawn_and_the_impossible_three 006c_kristys_big_day 007c_claudia_and_mean_jeanine 008c_boy_crazy_stacey 009c_the_ghost_at_dawns_house ... m36c_kristy_and_the_cat_burglar pc1c_staceys_book pc2c_claudias_book pc3c_dawns_book pc4c_mary_annes_book pc5c_kristys_book pc6c_abbys_book serr1c_logans_story serr2c_logan_bruno_boy_babysitter serr3c_shannons_story
000c_the_summer_before 0.000000 0.044943 0.049422 0.046609 0.035678 0.046631 0.040590 0.056732 0.047167 0.050052 ... 0.070188 0.041035 0.021208 0.042741 0.034981 0.034575 0.069217 0.035023 0.049642 0.029316
001c_kristys_great_idea 0.044943 0.000000 0.036930 0.030929 0.033798 0.035493 0.030381 0.039936 0.042907 0.048608 ... 0.059447 0.067061 0.055227 0.068699 0.067711 0.052707 0.084284 0.034375 0.042900 0.045442
002c_claudia_and_the_phantom_phone_calls 0.049422 0.036930 0.000000 0.031004 0.025655 0.028958 0.037465 0.043617 0.036552 0.024306 ... 0.049127 0.057083 0.045569 0.048338 0.052466 0.049403 0.094011 0.025650 0.028017 0.039682
003c_the_truth_about_stacey 0.046609 0.030929 0.031004 0.000000 0.024914 0.038010 0.039380 0.039519 0.043885 0.048400 ... 0.066079 0.049764 0.051218 0.053225 0.056318 0.051572 0.071363 0.034661 0.041751 0.044552
004c_mary_anne_saves_the_day 0.035678 0.033798 0.025655 0.024914 0.000000 0.032055 0.038245 0.032416 0.045906 0.043641 ... 0.065060 0.045676 0.039438 0.044694 0.039605 0.046447 0.076709 0.028795 0.038371 0.032554
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
pc5c_kristys_book 0.034575 0.052707 0.049403 0.051572 0.046447 0.059819 0.045754 0.066528 0.051282 0.049524 ... 0.052561 0.027891 0.033378 0.028447 0.024882 0.000000 0.058998 0.039247 0.043466 0.037280
pc6c_abbys_book 0.069217 0.084284 0.094011 0.071363 0.076709 0.088233 0.073567 0.082781 0.083394 0.088067 ... 0.097452 0.058017 0.069955 0.059107 0.071658 0.058998 0.000000 0.080784 0.091230 0.057229
serr1c_logans_story 0.035023 0.034375 0.025650 0.034661 0.028795 0.034767 0.032406 0.049176 0.027702 0.028677 ... 0.048107 0.049055 0.038349 0.046504 0.047611 0.039247 0.080784 0.000000 0.016501 0.032945
serr2c_logan_bruno_boy_babysitter 0.049642 0.042900 0.028017 0.041751 0.038371 0.043324 0.044590 0.056601 0.039091 0.033917 ... 0.044664 0.055314 0.055859 0.055186 0.048879 0.043466 0.091230 0.016501 0.000000 0.046193
serr3c_shannons_story 0.029316 0.045442 0.039682 0.044552 0.032554 0.037420 0.032265 0.047409 0.045042 0.035973 ... 0.070188 0.035800 0.023650 0.028882 0.045738 0.037280 0.057229 0.032945 0.046193 0.000000

192 rows × 192 columns

cosine_distances_freq.to_csv('cosine_distances_freq.csv')
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(cosine_distances_freq)
#Displays the image
plt.show()

Very similar! Honestly, there's less difference between cosine distance with word counts and cosine distance with word frequencies... which makes sense, because the cosine distance measure already helps account for different text lengths, at least up to a certain point. Let's try cosine distance on the chapter 2's.

Cosine distance with chapter 2’s

ch2dir = '/Users/qad/Documents/dsc_chapters/ch2'
os.chdir(ch2dir)
# Use the glob library to create a list of file names, sorted alphabetically
# Alphabetical sorting will get us the books in numerical order
filenames = sorted(glob.glob("*.txt"))
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
# Like, actually, the parameters you need, including not disabling normalization
vectorizer = TfidfVectorizer(input="filename", stop_words=None, use_idf=False, norm='l1', max_features=1000)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
ch2freqs4real = vectorizer.fit_transform(filenames).toarray()
ch2_cosine_freq = pd.DataFrame(squareform(pdist(ch2freqs4real, metric='cosine')), index=filekeys, columns=filekeys)
ch2_cosine_freq
001c_kristys_great_idea_2 002c_claudia_and_the_phantom_phone_calls_2 003c_the_truth_about_stacey_2 004c_mary_anne_saves_the_day_2 005c_dawn_and_the_impossible_three_2 006c_kristys_big_day_2 007c_claudia_and_mean_jeanine_2 008c_boy_crazy_stacey_2 009c_the_ghost_at_dawns_house_2 010c_logan_likes_mary_anne_2 ... m30c_kristy_and_the_mystery_train_2 m31c_mary_anne_and_the_music_box_secret_2 m32c_claudia_and_the_mystery_in_the_painting_2 m33c_stacey_and_the_stolen_hearts_2 m34c_mary_anne_and_the_haunted_bookstore_2 m35c_abby_and_the_notorius_neighbor_2 m36c_kristy_and_the_cat_burglar_2 serr1c_logans_story_2 serr2c_logan_bruno_boy_babysitter_2 serr3c_shannons_story_2
001c_kristys_great_idea_2 0.000000 0.206412 0.113674 0.150443 0.125802 0.115092 0.178066 0.125936 0.190399 0.198563 ... 0.164629 0.197586 0.143086 0.180067 0.163321 0.125262 0.268084 0.156328 0.185042 0.189323
002c_claudia_and_the_phantom_phone_calls_2 0.206412 0.000000 0.163208 0.160833 0.114802 0.183622 0.139114 0.192687 0.173620 0.183036 ... 0.156327 0.185861 0.126021 0.147126 0.148118 0.128377 0.231286 0.142476 0.126149 0.187776
003c_the_truth_about_stacey_2 0.113674 0.163208 0.000000 0.114385 0.121479 0.112804 0.147225 0.132875 0.125324 0.127333 ... 0.147476 0.167143 0.114144 0.135722 0.126646 0.092395 0.242669 0.119585 0.146606 0.138360
004c_mary_anne_saves_the_day_2 0.150443 0.160833 0.114385 0.000000 0.155826 0.149403 0.182614 0.167282 0.134859 0.120849 ... 0.198136 0.212283 0.163036 0.188255 0.181516 0.154926 0.225048 0.163295 0.201479 0.161856
005c_dawn_and_the_impossible_three_2 0.125802 0.114802 0.121479 0.155826 0.000000 0.150850 0.126401 0.131118 0.193453 0.204176 ... 0.102185 0.143733 0.091190 0.107729 0.119510 0.085736 0.265262 0.107738 0.115631 0.176527
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
m35c_abby_and_the_notorius_neighbor_2 0.125262 0.128377 0.092395 0.154926 0.085736 0.138655 0.133336 0.147063 0.147175 0.161961 ... 0.082940 0.081991 0.058391 0.065696 0.057936 0.000000 0.268627 0.072937 0.067835 0.145117
m36c_kristy_and_the_cat_burglar_2 0.268084 0.231286 0.242669 0.225048 0.265262 0.218984 0.281148 0.268710 0.251897 0.240916 ... 0.327963 0.336943 0.295650 0.312225 0.284412 0.268627 0.000000 0.286544 0.303065 0.267370
serr1c_logans_story_2 0.156328 0.142476 0.119585 0.163295 0.107738 0.182890 0.193107 0.166469 0.180923 0.192775 ... 0.087625 0.108703 0.072981 0.086351 0.116175 0.072937 0.286544 0.000000 0.072080 0.159806
serr2c_logan_bruno_boy_babysitter_2 0.185042 0.126149 0.146606 0.201479 0.115631 0.177894 0.179056 0.187980 0.172765 0.208924 ... 0.081543 0.079132 0.062328 0.081043 0.072673 0.067835 0.303065 0.072080 0.000000 0.158507
serr3c_shannons_story_2 0.189323 0.187776 0.138360 0.161856 0.176527 0.138519 0.149076 0.185551 0.149572 0.160927 ... 0.153772 0.169683 0.146668 0.156993 0.151115 0.145117 0.267370 0.159806 0.158507 0.000000

167 rows × 167 columns

ch2_cosine_freq.to_csv('ch2_cosine_freq.csv')
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(ch2_cosine_freq)
#Displays the image
plt.show()

It's largely the same as cosine distance using just word counts! With the same questions and disappointments with regard to the similarity of the chapter 2's, compared to the full books, when using cosine distance. We probably don't need to rerun this for chapters 1, 5, 9, and 15; you get the point.

But now we've found it using code that legitimately works, without any confusions or misunderstandings about what's happening (at least, I hope?). That's satisfying. A satisfying kind of dissatisfying.

Now we can move on to another method.

TF-IDF

As I mentioned before, TF-IDF stands for term frequency - inverse document frequency. TF-IDF tries to get at distinctive words. For each text, what are the words that set it apart from all the other texts you’re comparing against? To calculate TF-IDF, you don’t have to imagine 1000-dimensional space or anything like that. Term frequency is just how often the word occurs, divided by the total number of words in the text. Inverse document frequency is a way to reduce the importance of words that are high-frequency everywhere (like “the”) in order to surface the words that are high frequency in a particular text because they’re important. You calculate it using another concept from high school math: your old pal logarithm. The inverse document frequency for a word is: log_e(Total number of documents / Number of documents with term t in it).

The TF-IDF calculation is inherently comparative: it doesn’t make sense to run it on just one text, if you’re looking for what’s unique about a text in relation to other texts. But the output we get from TF-IDF is a list of words and numerical values, which isn’t something we can use to visualize a comparison of the texts, the way we could with the output of the vectorizer we used to plot points in 1000-dimensional space. We can use the TF-IDF calculations for each word in our vectorizer instead of simple word counts, which will generate a different set of points for each text, and from there we can use Euclidean or Cosine distance. But before we go there, let’s take a look at what we get out of the TF-IDF calculation, using our full-text corpus (not just the chapter 2s).

The word “baby-sitters” is going to appear in most or all of the books (maybe not California Diaries). On the other hand, the word “Lowell” (the surname of the racist family in BSC #56: Keep Out, Claudia!) only occurs in two books: Keep Out, Claudia! and BSC #3: The Truth About Stacey (where “Lowell” actually refers to a different person, Lowell Johnston). Lowell Johnston is only mentioned twice in The Truth About Stacey, so it’s still not going to get a high TF-IDF score in that book (it comes in #103 with a score of 10.64). But in Keep Out, Claudia!, Lowell appears a lot, and that number isn’t scaled down much at all because it only occurs in two books. So it ends up getting the highest TF-IDF score for that book, 707.82. This is a large score, more similar to characters in “very special episodes” who appear in just one book, like Whitney (the girl with Down’s Syndrome who Dawn babysits in BSC #77: Dawn and Whitney, Friends Forever).

TF-IDF is one approach to getting at what a text is “about” – more straightforward to understand and faster to calculate than topic modeling. But especially working with a corpus of fiction, you’ll probably need to weed out the character names – either by pre-processing the text to remove them, or looking beyond the first few highest-scoring terms. (If anything, we’re getting fewer high-scoring character names than you’d expect in most fiction. The major characters occur frequently enough that they get weighted down, like words like “the” and “is”.)

Let’s go back to the directory with the full texts:

filedir = '/Users/qad/Documents/dsc_corpus_clean'
os.chdir(filedir)
# Use the glob library to create a list of file names, sorted alphabetically
# Alphabetical sorting will get us the books in numerical order
filenames = sorted(glob.glob("*.txt"))
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
vectorizer = TfidfVectorizer(input="filename", stop_words=None, use_idf=True, norm=None, max_features=1000)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
transformed_documents = vectorizer.fit_transform(filenames)
transformed_documents_as_array = transformed_documents.toarray()

The code from the Programming Historian tutorial generates a CSV file for each text, showing the TF-IDF value of each word. (You can find all these CSV files in the GitHub repo for this book.)

# construct a list of output file paths using the previous list of text files the relative path for tf_idf_output
output_filenames = [str(txt_file).replace(".txt", ".csv") for txt_file in filenames]

# loop each item in transformed_documents_as_array, using enumerate to keep track of the current position
for counter, doc in enumerate(transformed_documents_as_array):
    # construct a dataframe
    tf_idf_tuples = list(zip(vectorizer.get_feature_names(), doc))
    one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)

    # output to a csv using the enumerated value for the filename
    one_doc_as_df.to_csv(output_filenames[counter])

For BSC #54: Mallory and the Dream Horse, the top three terms are Nina (a little girl involved in the book’s babysitting sub-plot), Pax (the horse Mallory rides), and Lauren (Mallory’s equitation instructor), but by themselves they don’t help much with classifying this text. If you look in the top 10, though, you’ve got riding (#5), horse (#6), and horses (#8). In the top 25, there are lessons (#13), saddle (#15), riders (#17), reins (#18), stable (#19), canter (#21), and bridle (#25). It’s looking pretty horsey in here.

In BSC #57: Dawn Saves the Planet, we’ve got recycling (#2), planet (#5), ecology (#7), pollution (#10), garbage (#11), recycle (#12), styrofoam (#13), recycled (#20), and carton (#25).

BSC #110: Abby and the Bad Sport has coach (#3), soccer (#4), goal (#7), goalie (#8), players (#13), field (#15), referee (#17), defense (#18), cleats (#20), player (#21), kickers (#23), and benched (#24). You might not get the bad sportsmanship out of this, but there’s clearly some soccer afoot.

What about books with a less obvious theme? There are some other terms that might throw you off, but you could probably come to the conclusion that art plays a meaningful role in BSC #12: Claudia and the New Girl with sculpture (#4), sculpt (#5), portfolio (#12), gallery (#22), … despite hydrant (#15), vacuum (#19), and inanimate (#20). Indeed, the aforementioned new girl is into art, just like Claudia.

If I were thinking of some distinctive words for BSC #87: Stacey and the Bad Girls, what would come to mind would be “shoplifting”, “concert”, “alcohol”, and “wine”. But the top 25 terms are almost all names – including the band whose concert they go see (#7 U4Me) and the department store where the shoplifting takes place (#10 Bellair). There are also trains (#19) and escalator (#24). “Concert” does the best of my terms at #40. “Alcohol” is #80, between “camera” and “rosebud”. “Shoplift” is #118, between “bikes” and “creature”. And “wine” is down at #1002, in the company of “sniffle” and “bees”. So don’t get too comfortable with the assumption that TF-IDF will get you to basically the same set of terms that a human would think of. Plot salience and distinctive content aren’t the same as distinctive frequency distribution.

BSC #83: Stacey vs. the BSC features Stacey being duplicitous, along with the inter-babysitter drama that ultimately leads to the misbehavior described above for BSC #87, but you can’t see it in the top 25 terms, which feature a lot of names, various instances of onomatopoeia (“clack”, “clomp”, and “plink”), piano, fiesta, talent, twinkle, recital, cheese, and jukebox. There’s something to this: Dawn hides behind a jukebox spying on Stacey after she sneaks out on a date. And Charlotte plays the piano at the BSC talent show. Score three for TF-IDF! Even if it’s fixating on objects, at least they’re plot-significant objects. So what’s up with the cheese? I don’t have a good explanation, but it comes up a lot, between Jamie’s macaroni and cheese, extra pepperoni and cheese on a pizza, multiple references to cream cheese, cheese and crackers, a fiesta burger (there’s the “fiesta” from our TF-IDF results) with melted cheese… maybe ghostwriter Peter Lerangis had a cheese craving while writing it?

TF-IDF for text comparison

Close-reading a distant reading method as a proxy for looking at the “topic” of individual texts is one way you can use the TF-IDF output. But you can also use it to compare texts at scale. You can also substitute in the TF-IDF vectorizer (with the IDF turned on this time) as your vectorizer of choice when trying out the Euclidean and cosine distance.

The TF-IDF vectorizer has some optional parameters for dropping words. You can drop words that appear in too many documents with max_df. So max_df = 0.9 means “ignore all words that appear in more than 90% of the documents”, or you can give it a specific number of documents with max_df = 100, for “ignore all words that appear in more than 100 documents”. You can get rid of words that appear too infrequently with min_df (e.g. min_df = 0.1 means “ignore all words that appear in less than 10% of the documents”.) In this case, we’ll keep everything by not using those parameters, but you can play with them with your own corpora to see how it impacts your result to remove super-high frequency words (which, in the Baby-Sitters Club corpus, would get rid of both words like “the” and “a”, and the main characters’ names) or super-low frequency words (like the names of characters in the “very special episode” books.)

Note: Remember, I wrote this before I had any idea at all about the problems with my code that triggered this book's subplot. If this were a horror-themed choose-your-own-adventure book, at this point you might read something like this: If only you could hear the screaming voices of the readers as you write this description of max_df. "CHECK YOUR CODE, YOU MADE THIS MISTAKE WITH YOUR FIRST EUCLIDEAN AND COSINE DISTANCE EXAMPLES!" But you cannot hear them. And so you remain ignorant of this fact for a few weeks longer. Turn the page..."

So let’s do Euclidean and cosine distance using the TF-IDF vectorizer with IDF set to true, and see how it compares to the other ways of comparing text that we’ve tried so far.

tfidf_comparison_output_euclidean = pd.DataFrame(squareform(pdist(transformed_documents_as_array, metric='euclidean')), index=filekeys, columns=filekeys)
tfidf_comparison_output_euclidean
000c_the_summer_before 001c_kristys_great_idea 002c_claudia_and_the_phantom_phone_calls 003c_the_truth_about_stacey 004c_mary_anne_saves_the_day 005c_dawn_and_the_impossible_three 006c_kristys_big_day 007c_claudia_and_mean_jeanine 008c_boy_crazy_stacey 009c_the_ghost_at_dawns_house ... m36c_kristy_and_the_cat_burglar pc1c_staceys_book pc2c_claudias_book pc3c_dawns_book pc4c_mary_annes_book pc5c_kristys_book pc6c_abbys_book serr1c_logans_story serr2c_logan_bruno_boy_babysitter serr3c_shannons_story
000c_the_summer_before 0.000000 1713.377124 1612.664014 1525.489173 1453.845926 1570.101303 1604.147402 1847.247725 1800.437885 1710.247038 ... 1807.055544 1576.812993 1524.863481 1841.737653 1655.444614 1951.887865 2226.032081 1852.239496 1998.037080 1618.645066
001c_kristys_great_idea 1713.377124 0.000000 656.876129 693.424693 731.237464 833.934799 592.260494 807.904332 666.805535 700.374581 ... 793.715920 970.020362 764.818214 823.158470 820.134900 742.132204 1121.810769 612.029801 692.573773 678.874266
002c_claudia_and_the_phantom_phone_calls 1612.664014 656.876129 0.000000 654.199535 624.772553 768.131077 684.948211 826.635694 701.590667 561.214764 ... 765.810528 933.606450 711.930147 794.356012 759.672601 845.750393 1247.749650 656.836566 731.084665 678.342019
003c_the_truth_about_stacey 1525.489173 693.424693 654.199535 0.000000 657.986068 868.969005 765.823335 896.167881 833.564880 805.355785 ... 929.845534 840.377792 825.819142 904.715197 863.168246 955.456784 1257.747669 811.001693 903.973220 781.667999
004c_mary_anne_saves_the_day 1453.845926 731.237464 624.772553 657.986068 0.000000 809.385573 755.198772 810.354784 848.343456 788.580283 ... 940.317742 912.258691 743.237317 870.767868 756.292327 948.797634 1290.469268 768.933180 907.305837 705.113058
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
pc5c_kristys_book 1951.887865 742.132204 845.750393 955.456784 948.797634 1075.300399 801.210096 937.559093 700.968345 756.699179 ... 788.909197 847.675280 698.988424 531.788097 593.998782 0.000000 897.519505 557.059034 550.384043 663.065114
pc6c_abbys_book 2226.032081 1121.810769 1247.749650 1257.747669 1290.469268 1380.230234 1163.994242 1211.355418 1075.755779 1150.326653 ... 1171.521844 1201.992446 1119.228644 960.208555 1076.537478 897.519505 0.000000 1004.879762 998.848888 1028.663383
serr1c_logans_story 1852.239496 612.029801 656.836566 811.001693 768.933180 905.439562 686.409319 849.635862 529.782108 590.123503 ... 732.222472 905.294036 685.926803 639.025930 698.016291 557.059034 1004.879762 0.000000 380.573170 595.515698
serr2c_logan_bruno_boy_babysitter 1998.037080 692.573773 731.084665 903.973220 907.305837 1000.733519 799.570264 907.885413 624.207563 662.355458 ... 737.885533 966.952976 817.489597 683.566913 731.316018 550.384043 998.848888 380.573170 0.000000 710.167240
serr3c_shannons_story 1618.645066 678.874266 678.342019 781.667999 705.113058 823.740518 618.676474 862.356924 679.167645 612.246382 ... 843.615081 771.620534 517.044586 567.419139 669.632308 663.065114 1028.663383 595.515698 710.167240 0.000000

192 rows × 192 columns

tfidf_comparison_output_euclidean.to_csv('tfidf_comparison_output_euclidean.csv')
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(tfidf_comparison_output_euclidean)
#Displays the image
plt.show()

Euclidean distance using TF-IDF

Okay. Now let’s try cosine distance with the TF-IDF vectorizer!

tfidf_comparison_output_cosine = pd.DataFrame(squareform(pdist(transformed_documents_as_array, metric='cosine')), index=filekeys, columns=filekeys)
tfidf_comparison_output_cosine
000c_the_summer_before 001c_kristys_great_idea 002c_claudia_and_the_phantom_phone_calls 003c_the_truth_about_stacey 004c_mary_anne_saves_the_day 005c_dawn_and_the_impossible_three 006c_kristys_big_day 007c_claudia_and_mean_jeanine 008c_boy_crazy_stacey 009c_the_ghost_at_dawns_house ... m36c_kristy_and_the_cat_burglar pc1c_staceys_book pc2c_claudias_book pc3c_dawns_book pc4c_mary_annes_book pc5c_kristys_book pc6c_abbys_book serr1c_logans_story serr2c_logan_bruno_boy_babysitter serr3c_shannons_story
000c_the_summer_before 0.000000 0.049702 0.054073 0.051052 0.042593 0.066249 0.047594 0.084053 0.054140 0.055126 ... 0.078758 0.052050 0.025529 0.050768 0.039348 0.039441 0.134936 0.039834 0.054325 0.033089
001c_kristys_great_idea 0.049702 0.000000 0.041635 0.040160 0.043012 0.055829 0.034892 0.071974 0.050777 0.053545 ... 0.068351 0.091338 0.061610 0.078550 0.074399 0.057600 0.151661 0.040424 0.048307 0.050714
002c_claudia_and_the_phantom_phone_calls 0.054073 0.041635 0.000000 0.038151 0.033520 0.049937 0.045442 0.067641 0.045910 0.030220 ... 0.057745 0.081921 0.049582 0.058456 0.056754 0.056046 0.160539 0.032136 0.034044 0.044850
003c_the_truth_about_stacey 0.051052 0.040160 0.038151 0.000000 0.037266 0.063910 0.052678 0.072323 0.058126 0.057566 ... 0.078612 0.063166 0.061440 0.067455 0.066535 0.063049 0.143516 0.045493 0.052056 0.053615
004c_mary_anne_saves_the_day 0.042593 0.043012 0.033520 0.037266 0.000000 0.054919 0.049548 0.055513 0.057334 0.052644 ... 0.077906 0.073005 0.047290 0.057083 0.047145 0.055903 0.146569 0.033976 0.047257 0.040245
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
pc5c_kristys_book 0.039441 0.057600 0.056046 0.063049 0.055903 0.079451 0.051951 0.098401 0.059321 0.055968 ... 0.061025 0.050849 0.038389 0.035875 0.029800 0.000000 0.125509 0.043983 0.047927 0.040212
pc6c_abbys_book 0.134936 0.151661 0.160539 0.143516 0.146569 0.166886 0.142048 0.175624 0.151161 0.154212 ... 0.158685 0.140245 0.136574 0.127076 0.136553 0.125509 0.000000 0.146407 0.155288 0.122145
serr1c_logans_story 0.039834 0.040424 0.032136 0.045493 0.033976 0.054189 0.040279 0.081729 0.034173 0.034312 ... 0.056283 0.072282 0.044419 0.054464 0.053285 0.043983 0.146407 0.000000 0.019467 0.036182
serr2c_logan_bruno_boy_babysitter 0.054325 0.048307 0.034044 0.052056 0.047257 0.062007 0.051839 0.091300 0.045317 0.038675 ... 0.051081 0.077285 0.061465 0.062424 0.053968 0.047927 0.155288 0.019467 0.000000 0.048875
serr3c_shannons_story 0.033089 0.050714 0.044850 0.053615 0.040245 0.055060 0.038415 0.081161 0.051596 0.040566 ... 0.076593 0.056528 0.027769 0.034361 0.049065 0.040212 0.122145 0.036182 0.048875 0.000000

192 rows × 192 columns

tfidf_comparison_output_cosine.to_csv('tfidf_comparison_output_cosine.csv')
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(tfidf_comparison_output_cosine)
#Displays the image
plt.show()

Cosine distance using TF-IDF

There’s less difference between the Euclidean and cosine distance when using a TF-IDF vectorizer (that actually uses the “-IDF” in “TF-IDF”) than the word count vectorizer. So what happens when we try to run cosine distance using TF-IDF on chapter 2’s?

ch2dir = '/Users/qad/Documents/dsc_chapters/ch2'
os.chdir(ch2dir)
# Use the glob library to create a list of file names, sorted alphabetically
# Alphabetical sorting will get us the books in numerical order
filenames = sorted(glob.glob("*.txt"))
# Parse those filenames to create a list of file keys (ID numbers)
# You'll use these later on.
filekeys = [f.split('/')[-1].split('.')[0] for f in filenames]

# Create a CountVectorizer instance with the parameters you need
vectorizer = TfidfVectorizer(input="filename", stop_words=None, use_idf=True, norm=None, max_features=1000)
# Run the vectorizer on your list of filenames to create your wordcounts
# Use the toarray() function so that SciPy will accept the results
ch2_tfidf = vectorizer.fit_transform(filenames).toarray()
ch2_cosine_tfidf = pd.DataFrame(squareform(pdist(ch2_tfidf, metric='cosine')), index=filekeys, columns=filekeys)
ch2_cosine_tfidf
001c_kristys_great_idea_2 002c_claudia_and_the_phantom_phone_calls_2 003c_the_truth_about_stacey_2 004c_mary_anne_saves_the_day_2 005c_dawn_and_the_impossible_three_2 006c_kristys_big_day_2 007c_claudia_and_mean_jeanine_2 008c_boy_crazy_stacey_2 009c_the_ghost_at_dawns_house_2 010c_logan_likes_mary_anne_2 ... m30c_kristy_and_the_mystery_train_2 m31c_mary_anne_and_the_music_box_secret_2 m32c_claudia_and_the_mystery_in_the_painting_2 m33c_stacey_and_the_stolen_hearts_2 m34c_mary_anne_and_the_haunted_bookstore_2 m35c_abby_and_the_notorius_neighbor_2 m36c_kristy_and_the_cat_burglar_2 serr1c_logans_story_2 serr2c_logan_bruno_boy_babysitter_2 serr3c_shannons_story_2
001c_kristys_great_idea_2 0.000000 0.246074 0.151527 0.188421 0.173291 0.172800 0.224098 0.181280 0.236114 0.247249 ... 0.213941 0.254544 0.186659 0.224199 0.203184 0.165524 0.612455 0.200885 0.223696 0.296085
002c_claudia_and_the_phantom_phone_calls_2 0.246074 0.000000 0.199677 0.205898 0.160327 0.240095 0.176995 0.236811 0.224528 0.221839 ... 0.201332 0.243469 0.169802 0.193133 0.190436 0.163584 0.592890 0.178847 0.163631 0.295182
003c_the_truth_about_stacey_2 0.151527 0.199677 0.000000 0.158336 0.168554 0.171509 0.187947 0.182646 0.173739 0.174705 ... 0.191784 0.219820 0.158270 0.178785 0.163439 0.126376 0.598587 0.157084 0.179818 0.245696
004c_mary_anne_saves_the_day_2 0.188421 0.205898 0.158336 0.000000 0.202878 0.211559 0.230825 0.221566 0.183476 0.175910 ... 0.248089 0.268533 0.212441 0.236738 0.225911 0.198259 0.592314 0.206621 0.239792 0.276399
005c_dawn_and_the_impossible_three_2 0.173291 0.160327 0.168554 0.202878 0.000000 0.220103 0.168376 0.190539 0.252589 0.257517 ... 0.157010 0.206133 0.143678 0.158212 0.165593 0.126658 0.615353 0.152453 0.155432 0.294697
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
m35c_abby_and_the_notorius_neighbor_2 0.165524 0.163584 0.126376 0.198259 0.126658 0.197771 0.169933 0.198773 0.193097 0.205686 ... 0.112034 0.116003 0.089299 0.090349 0.082690 0.000000 0.606522 0.103207 0.091476 0.249500
m36c_kristy_and_the_cat_burglar_2 0.612455 0.592890 0.598587 0.592314 0.615353 0.594343 0.622589 0.618557 0.607806 0.601490 ... 0.645133 0.654663 0.628074 0.636853 0.617428 0.606522 0.000000 0.622800 0.629522 0.643610
serr1c_logans_story_2 0.200885 0.178847 0.157084 0.206621 0.152453 0.241002 0.234209 0.220610 0.228961 0.242113 ... 0.129060 0.154466 0.115005 0.121168 0.150209 0.103207 0.622800 0.000000 0.096860 0.266514
serr2c_logan_bruno_boy_babysitter_2 0.223696 0.163631 0.179818 0.239792 0.155432 0.234046 0.214840 0.235139 0.219557 0.249190 ... 0.116504 0.120264 0.097254 0.110505 0.101533 0.091476 0.629522 0.096860 0.000000 0.263784
serr3c_shannons_story_2 0.296085 0.295182 0.245696 0.276399 0.294697 0.265536 0.265658 0.301320 0.266622 0.261669 ... 0.257334 0.286673 0.262540 0.271069 0.258751 0.249500 0.643610 0.266514 0.263784 0.000000

167 rows × 167 columns

ch2_cosine_tfidf.to_csv('ch2_tfidf.csv')
#Defines the size of the image
plt.figure(figsize=(100, 100))
#Increases the label size so it's more legible
sns.set(font_scale=3)
#Generates the visualization using the data in the dataframe
ax = sns.heatmap(ch2_cosine_tfidf)
#Displays the image
plt.show()

Cosine distance for chapter 2 using TF-IDF

Differences in chapter length do still matter (longer chapters probably have more distinct words, each of which will have a value that will go into the calculation of the overall score for that chapter). That said, length matters lot less than with the word count vectorizer… and a lot more than when you’re using word frequencies. What this gets us, mostly, is a fairly clear picture of when chapter 2s morph into the Home of the Tropes, and a very clear picture of when those tropes occur in chapter 3 rather than chapter 2 (the super-light lines in the otherwise-dark visualization).

What does this tell us about chapter 2?

Text distance metrics and things like TF-IDF are at their most useful when you’ve got a very large corpus, and/or one that you don’t know well. Imagine you’ve never read a Baby-Sitters Club book (perhaps that’s not hard for some of you!) and someone hands you a corpus of 250 text files and tells you to go find something interesting about them. Without having to read them at all, you could discover the things we’ve talked about here. The California Diaries sub-series is really different! There’s something weird going on with these super-repetitive chapter 2s!

But before you get too excited, it’s worth checking with someone who does know the corpus well, if such a person exists. (With something like Twitter data, there probably isn’t anyone who’s read every tweet in the corpus you’ve collected, but if your data is collected based on a hashtag, you might be able to find someone to talk to from the community that uses that hashtag.) In the case of the Baby-Sitters Club, anyone who’s read the books can tell you that the chapter 2 phenomenon is well-known. So what new insights are these distance metrics providing?

To be honest, in the case of chapter 2s, I think the answer is “not much”. As much as I love trying new methods to see what will happen, I should’ve seen where this was going and been more confident in the choice I’d made to use Scott’s 6-gram tool as the way to tackle the chapter 2 question. That approach got us something new, surfacing (albeit with some noise) a set of tropes repeated through the corpus, and showing that the highest amounts of repetition tend to happen among the works of a single ghostwriter. It’s not a shock, but it feels like some kind of contribution – and more of a contribution than just quantifying how much more similar the chapter 2s are compared to the other chapters.

We might be able to do something with these text comparison methods in the future – like using them as a jumping-off point for exploring the clusters around books 30-48 and 83-101. But sometimes you discover that you’ve spent a lot of time trying something that’s the wrong tool for the job. Or, as this DH choose-your-own adventure book might conclude, “You close the Jupyter notebook. You may not have any meaningful results, but you’ve written some code that works, and you can use it another day, for another project. To be continued…

Read until the very end

You probably close the book at that point, feeling dissatisfied with your reading experience.

But sometimes, in frustration and annoyance, you keep flipping pages even after the book ends. And sometimes, in those very last pages of your choose-your-own-adventure book, there’s an advertisement for a forthcoming book that catches your eye with promises of future adventures. And the same thing can happen in DH, when a collaborator points out something you’ve missed.

I wasn’t happy with how this DSC book ended, but I was resigned to ending it with a shrug. Sometimes projects work out that way. But when I ran it by our Associate Data-Sitter to make sure I hadn’t missed anything, Mark managed to convince me to keep turning pages: “Look, this is really the point before things get interesting: you’ve established a baseline that shows computational methods, based only on relative word frequency, can replicate an important aspect of the book that is evident at the level of reading, the ‘chapter 2 phenomenon’.”

He had a point – I hadn’t really thought about the significance of what it meant to be able to computationally find things we already know. The response doesn’t need to be, “Yeah, we knew that” but maybe instead it can be, “Cool – let’s add the ‘chapter 2 phenomenon’ to the list of things our current computational methods really can allow us to find in texts, unlike other things we’re still working out computationally.

“The exciting stuff is what happens next,” Mark added. “What features, in particular, are responsible for these similarities? And do these features change over time or between different groups of books? Also, you could move the same way to visualizations: what about a network where each chapter 2 is a node, and you connect it to the most similar other chapter 2’s based on a similarity threshold? You could easily wind up showing again that this varies by ghostwriter. Or you could find something else, which would be super cool. I’d be most interested in the chapter 2’s that were definitely chapter 2’s (that is, they took part in the same convention, not where chapter 3 is actually “chapter 2”), but whose language was LEAST similar – what are they doing differently?”

And so the very last page of this DH choose-your-own-adventure book reads: You add a handful of new research questions to your list of ideas for your corpus. They already number too many for you to get through in a decade, even with the help of six friends. But it doesn’t matter. Your brain is whirring away at trying to piece together the code for how to tackle this one. You know it’ll fuel your insomnia, but for this moment, you don’t mind. Your research question is fun again!

Acknowledgements

This book has been a journey spanning more than six months, and not one I could’ve done alone.

First, I’m grateful to Scott Enderle for sharing the code from the ill-fated project. It’s great code, for what it does, and it works. And his patience for all my questions and confusion was incredibly generous.

Similarly, a special thanks to our Associate Data-Sitter, Mark Algee-Hewitt, for answering random questions from me throughout the summer, and his incredibly thorough and thoughtful read of the draft of this piece.

As usual, editing this Data-Sitters Club book was a collective effort. But thank you, in particular, to Katia Bowers for calling me on it when I got too far into my own head and started rambling incoherently. And to Anouk Lang for helping refine my improved ending into something that involved less magic, and more realistic 90’s children’s literature paratext.

And finally, thank you to Zoe LeBlanc and John R. Ladd for saving the day when I was stumped with vectorizer problems that brought the publication of this book to a screeching halt. I never would have figured it out alone.

Suggested Citation

Dombrowski, Quinn. “DSC #8: Text-Comparison-Algorithm-Crazy Quinn.” Jupyter Notebook version. The Data-Sitters Club, October 21, 2020. https://github.com/datasittersclub/dsc8.