Plotting the classics#
Show code cell content
# The standard set of libraries we need
import numpy as np
import matplotlib.pyplot as plt
# Make plots look a little bit more fancy
plt.style.use('fivethirtyeight')
# The standard library for data in tables
import pandas as pd
pd.set_option('mode.copy_on_write', True)
# A tiny function to read a file directly from a URL
from urllib.request import urlopen
def read_url(url):
return urlopen(url).read().decode()
Note
This page has content from the Plotting_the_Classics notebook of an older version of the UC Berkeley data science course. See the Berkeley course section of the license file.
In this example, we will explore statistics for: Pride and Prejudice by Jane Austen. The text of any book can be read by a computer at great speed. Books published before 1923 are currently in the public domain, meaning that everyone has the right to copy or use the text in any way. Project Gutenberg is a website that publishes public domain books online. Using Python, we can load the text of these books directly from the web.
This example is meant to illustrate some of the broad themes of this text. Don’t worry if the details of the program don’t yet make sense. Instead, focus on interpreting the images generated below. Later sections of the text will describe most of the features of the Python programming language used below.
First, we read the text of of the book into the memory of the computer.
# Get the text for Pride and Prejudice
book_url = 'http://www.gutenberg.org/ebooks/42671.txt.utf-8'
book_text = read_url(book_url)
On the last line, Python gets the text of the book (read_url(book_url)
) and
gives it a name (book_text
). In Python, a name cannot contain any spaces,
and so we will often use an underscore _
to stand in for a space. The =
in
gives a name (on the left) to the result of some computation
described on the right.
A uniform resource locator or URL is an address on the Internet for some
content; in this case, the text of a book. The #
symbol starts a comment,
which is ignored by the computer but helpful for people reading the code.
Now we have the text attached to the name book_text
, we can ask Python to
show us how the text starts:
# Show the first 500 characters of the text
print(book_text[:500])
The Project Gutenberg eBook of Pride and Prejudice
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this
You might want to check this is the same as the text you see by opening the URL in your browser: http://www.gutenberg.org/ebooks/42671.txt.utf-8
Now we have the text in memory, we can start to analyze it. First we break the text into chapters. Don’t worry about the details of the code, we will cover these in the rest of the course.
# Break the text into Chapters
book_chapters = book_text.split('CHAPTER ')
# Drop the first "Chapter" - it's the Project Gutenberg header
book_chapters = book_chapters[1:]
We can show the first half-line or so for each chapter, by putting the chapters into a table. You will see these tables or data frames many times during this course.
# Show the first few words of each chapter in a table.
pd.DataFrame(book_chapters, columns=['Chapters'])
Chapters | |
---|---|
0 | I.\r\n\r\n\r\nIt is a truth universally acknow... |
1 | II.\r\n\r\n\r\nMr. Bennet was among the earlie... |
2 | III.\r\n\r\n\r\nNot all that Mrs. Bennet, howe... |
3 | IV.\r\n\r\n\r\nWhen Jane and Elizabeth were al... |
4 | V.\r\n\r\n\r\nWithin a short walk of Longbourn... |
... | ... |
56 | XV.\r\n\r\n\r\nThe discomposure of spirits, wh... |
57 | XVI.\r\n\r\n\r\nInstead of receiving any such ... |
58 | XVII.\r\n\r\n\r\n"My dear Lizzy, where can you... |
59 | XVIII.\r\n\r\n\r\nElizabeth's spirits soon ris... |
60 | XIX.\r\n\r\n\r\nHappy for all her maternal fee... |
61 rows × 1 columns
This is your first view of a data frame. Ignore the first column for now - it
is just a row number. The second column shows the first few characters of the
text in the chapter. The text starts with the chapter number in Roman
numerals. You might want to check the text from the link above to reassure
yourself that this comes from the text we downloaded. Next you see some odd
characters with backslashes, such as \r
and \n
. These are representations
of new lines, or paragraph marks. Last you will see the beginning of the
first sentence of the chapter.
Note
This page has content from the Plotting_the_Classics notebook of an older version of the UC Berkeley data science course. See the Berkeley course section of the license file.