Project Gutenberg e-Text Sample

Project Gutenberg Random e-Text Sample

We took a random sample of online e-texts at Project Gutenberg using their random text feature. Then, using tools available online, we analyzed the first chapter of each text and recorded the readability (grade level), lexical density, and semicolon^* frequency (measured as the number of semicolons per 100 sentences).

While it may be a stretch to say that this body of literature is representative of all English language texts, it is nonetheless fascinating to look at the data. Moreover, we believe the data can give us at least a very rough sense of some general patterns of written English (in the same manner that shadows cast on the wall of a cave impart a vague sense of the object itself).

Perhaps, the main benefit of conducting this "research," if the reader will allow such a term to be applied to this effort, is the joy of discovery and the fruitful questions it brings forth.

The experiment below, which we have tried to describe in the best detail possible, can be repeated by our readers using the tools provided by Project Gutenberg and other online tools.

Some details of the experiment are a bit technical and the reader should not feel that they must read and understand them all right away. What is important is to understand the main points. If the reader requires further details, they may re-read more carefully later.

^*Why semicolons? Because semicolons are the most feared punctuation on Earth.

Sample Methodology

Our sample was taken by drawing a random e-book from Project Gutenberg. For each text, the webpage was refreshed and the first English language text to appear in the random list was then chosen. Then the first chapter (or other natural text subdivision) was analyzed for readability, lexical density, and semicolon usage.

We adhered to the following practices when taking the sample:

1) Publications such as lists, recipe books, and poetry were not considered.

2) Texts for which an author could not be attributed were also not considered. For example, folk tales of unknown origin and authorship.

3) Short Stories (which we defined to be 20000 words or less) were considered in their entirety regardless of whether or not the text was subdivided into chapters or otherwise.

4) Prefaces and other text preceding the main body of work were not considered.

5) If a novel or novella-length text had no clear subdividing structure, we did not consider the text.

6) Texts which were English translations were not considered.

Readability was measured as the median of the Gunning fog, Flesch-Kincaid, SMOG, Coleman-Liau, and Automated readability indices.

The year associated with each text is the earliest year of publication we could find (not necessarily the year it was written). If this information could not be found, we used the year the text was written. If none of the above information could not be determined, we discarded the text from our sample.

Descriptive Sample Statistics

Our sample consisted of data points (first chapters). The descriptive statistics of our sample are:

	Average (Mean)	Median	Standard Deviation
Readability
Lexical Density
Semicolon Frequency

with the distribution of Readability, Lexical Density, and Semicolon Frequency plotted below.

Percent of Sample
	Readability (Grade Level)

Percent of Sample
	Lexical Density

Percent of Sample
	Semicolon Frequency

Are Readability, Lexical Density,
or Semicolon Frequency Related?

Below, the reader will find a graph which compares readability, lexical density, and semicolon usage against one another and through time. The reader is encouraged to play and experiment with different combinations of these and to make conjectures based upon this data. You may also sort the data by gender and geographical location of each author's country of birth.

Sort data by:

Inferences We Can Make

Using the sample statistics from the above table and the well-known student's t-distribution we can infer the following:

There is a 95% chance that the true average grade level (as measured by Holt.Blue Text Analyzer) of first chapters of Project Gutenberg e-texts^* lies between 10.21 and 11.71.

There is a 95% chance that the true average lexical density (as measured by the Holt.Blue Text Analyzer) of first chapters of Project Gutenberg e-texts^* lies between 48.52% and 49.52%.

^*See sample methodology to see which kinds texts were excluded from consideration.

Concluding Remarks

The reader should be aware that any conclusion we make based upon the data we have collected, technically, holds only for Project Gutenberg e-texts since this is where our sample is exclusively drawn from. If we want to make a general conclusion for all English language texts, we must either convince ourselves and our readers that the "corpus" of Project Gutenberg texts are themselves representative of English language texts in general (which is certainly a point of debate), or find a way to randomly sample from "all" English language texts. Both the former and the latter present their own problems and difficulties.

If we want to make a general conclusion, we must draw from a corpus of texts which is generally agreed upon or somehow proven to be representative of the body of literature we want to study.

This is not to say, however, that there is nothing that can be gleaned from the work we have done, or that we have toiled in vain. Our sample, as the reader is sure to have noticed, brings up many interesting questions for further research.

• Has the complexity (readability) of English texts been steadily declining through time?

• Has lexical density been steadily increasing through time?

• Is semicolon usage in general decline?

• Can we really predict lexical density from semicolon usage and vice versa? That is, do more semicolons generally mean lower lexical density?

• Are readability and lexical density related? That is, do lexically dense texts tend to be easier to read? Do our conclusions agree with other research?

• Do authors from the U.K. write less lexically dense texts than their North American counterparts? Do they use semicolons more often? Is the complexity of these texts higher than those authored by North Americans?

• Do female author tend to write texts which are easier to read? Do they tend to use more lexical words than male authors?

As always, we encourage the reader to try these experiments themselves, and would be quite interested in any observations, questions, or hypotheses they might have. What relationships might exist between other variables such as word and sentence lengths?

Finally, we caution the reader with the well known phrase that correlation does not imply causation. For example, although there may be a general relationship between semicolon frequency and lexical density, does using semicolons more frequently cause a writer to use fewer lexical words or vice versa? We believe that the anyone would be hard pressed to prove such a claim provided that it is even true.

Links and References