Another kind of character

Another kind of character#

Hide code cell content
# HIDDEN
# The standard set of libraries we need
import numpy as np
import matplotlib.pyplot as plt

# Make plots look a little bit more fancy
plt.style.use('fivethirtyeight')

# The standard library for data in tables
import pandas as pd
pd.set_option('mode.copy_on_write', True)

# A tiny function to read a file directly from a URL
from urllib.request import urlopen

def read_url(url):
    return urlopen(url).read().decode()

This page is largely derived from Another_Kind_Of_Character of the UC Berkeley course - see the license file on the main website.

Hide code cell content
# HIDDEN
# Read the text of Pride and Prejudice, split into chapters.
book_url = 'http://www.gutenberg.org/cache/epub/42671/pg42671.txt'
book_text = read_url(book_url)
# Break the text into Chapters
book_chapters = book_text.split('CHAPTER ')
# Drop the first "Chapter" - it's the Project Gutenberg header
book_chapters = book_chapters[1:]

In some situations, the relationships between quantities allow us to make predictions. This text will explore how to make accurate predictions based on incomplete information and develop methods for combining multiple sources of uncertain information to make decisions.

As an example of visualizing information derived from multiple sources, let us first use the computer to get some information that would be tedious to acquire by hand. In the context of novels, the word “character” has a second meaning: a printed symbol such as a letter or number or punctuation symbol. Here, we ask the computer to count the number of characters and the number of periods in each chapter of Pride and Prejudice.

# In each chapter, count the number of all characters;
# Also count the number of periods.
chars_periods = pd.DataFrame.from_dict({
        'Number of chars in chapter': [len(s) for s in book_chapters],
        'Number of periods': np.char.count(book_chapters, '.')
    })

Here are the data. Each row of the table corresponds to one chapter of the novel and displays the number of characters as well as the number of periods in the chapter. Not surprisingly, chapters with fewer characters also tend to have fewer periods, in general – the shorter the chapter, the fewer sentences there tend to be, and vice versa. The relation is not entirely predictable, however, as sentences are of varying lengths and can involve other punctuation such as question marks.

chars_periods
Number of chars in chapter Number of periods
0 4613 59
1 4420 63
2 9746 106
3 6045 54
4 5390 61
... ... ...
56 9672 72
57 13976 131
58 13952 149
59 9020 87
60 26357 271

61 rows × 2 columns

In the plot below, there is a dot for each chapter in the book. The horizontal axis represents the number of periods and the vertical axis represents the number of characters.

plt.figure(figsize=(6, 6))
plt.scatter(chars_periods['Number of periods'],
            chars_periods['Number of chars in chapter'],
            color='darkblue')
<matplotlib.collections.PathCollection at 0x77a011159550>
../_images/ec62672c18eab51afd370dc35fb7769bae0ef755b62f42627c02f5ae30b97b53.png

Notice how the blue points are roughly clustered around a straight line.

Now look at all the chapters that contain about 100 periods. The plot shows that those chapters contain about 10,000 characters to about 15,000 characters, roughly. That’s about 90 to 150 characters per period.

Indeed, it appears from looking at the plot that the chapters tend to have somewhere between 100 and 150 characters between periods, as a very rough estimate. Perhaps Jane Austen was announcing something familiar to us now: the original 140-character limit of Twitter.

Note

This page has content from the Another_Kind_Of_Character notebook of an older version of the UC Berkeley data science course. See the Berkeley course section of the license file.