We took a random sample of online e-texts at
Project Gutenberg
using their random text feature. Then, using our own website, we analyzed the first chapter
of each text and recorded the readability (grade level),
lexical density, and semicolon* frequency
(measured as the number of semicolons per 100 sentences).
While it may be a stretch to say that this body of literature is representative of all English
language texts, it is nonetheless fascinating to look at the data.
Moreover, we believe the data can give us at least a very rough sense of some general
patterns of written English
(in the same manner that shadows cast on the wall of a cave impart a vague sense of the object itself).
Perhaps, the main benefit of conducting this "research," if the reader will allow such a term to be applied
to this effort, is the joy of discovery and the fruitful questions it brings forth.
The experiment below, which we have tried to describe in the best detail possible,
can be repeated by our readers using the tools provided by Project Gutenberg and this website.
And we encourage our readers to do so!
Some details of the experiment are a bit technical and the reader should not feel that they must read
and understand them all right away. What is important is to understand the main points.
If the reader requires further details, they may re-read more carefully later.
Our sample was taken by drawing a random e-book from
Project Gutenberg.
For each text, the webpage was refreshed and the first English language text to appear in the random list
was then chosen. Then the first chapter (or other natural text subdivision) was analyzed
for readability, lexical density, and semicolon usage using our homepage.
We adhered to the following practices when taking the sample:
1) Publications such as lists, recipe books, and poetry were not considered.
2) Texts for which an author could not be attributed were also not considered.
For example, folk tales of unknown origin and authorship.
3) Short Stories (which we defined to be 20000 words or less) were considered in their entirety
regardless of whether or not the text was subdivided into chapters or otherwise.
4) Prefaces and other text preceding the main body of work were not considered.
5) If a novel or novella-length text had no clear subdividing structure, we did not consider the text.
6) Texts which were English translations were not considered.
Readability was measured as the median
of the Gunning fog, Flesch-Kincaid, SMOG, Coleman-Liau,
and Automated readability indices.
The year associated with each text is the earliest year of publication we could find
(not necessarily the year it was written). If this information could not be found,
we used the year the text was written.
If none of the above information could not be determined, we discarded the text from our sample.
Descriptive Sample Statistics
Our sample consisted of data points (first chapters).
The descriptive statistics of our sample are:
with the distribution of Readability, Lexical Density, and Semicolon Frequency plotted below.
Percent of Sample
Readability (Grade Level)
Percent of Sample
Lexical Density
Percent of Sample
Semicolon Frequency
Are Readability, Lexical Density,
or Semicolon Frequency Related?
Below, the reader will find a graph which compares readability,
lexical density, and semicolon usage against one another
and through time.
The reader is encouraged to play and experiment with different combinations of these
and to make conjectures based upon this data. You may also sort the data by gender and geographical location
of each author's country of birth.
Sort data by:
Inferences We Can Make
Using the sample statistics from the above table and the well-known
student's t-distribution
we can infer the following:
There is a 95% chance that the true average grade level
(as measured by this website) of first chapters of Project Gutenberg e-texts*
lies between 10.21 and 11.71.
There is a 95% chance that the true average lexical density
(as measured by this website) of first chapters of Project Gutenberg e-texts*
lies between 48.52% and 49.52%.
*See sample methodology to see which kinds texts were excluded from consideration.
Concluding Remarks
The reader should be aware that any conclusion we make based upon the
data we have collected, technically, holds only for Project Gutenberg e-texts since this
is where our sample is exclusively drawn from.
If we want to make a general conclusion for all English language texts,
we must either convince ourselves and our readers that the "corpus" of
Project Gutenberg texts are themselves representative of English language texts in general
(which is certainly a point of debate), or find a way to randomly sample from "all" English
language texts. Both the former and the latter present their own problems and difficulties.
If we want to make a general conclusion, we must draw from a corpus of texts which is generally agreed upon
or somehow proven to be representative of the body of literature we want to study.
This is not to say, however, that there is nothing that can be gleaned from the work we have done,
or that we have toiled in vain. Our sample, as the reader is sure to have noticed, brings up
many interesting questions for further research.
• Has the complexity (readability) of English texts been steadily declining through time?
• Has lexical density been steadily increasing through time?
• Is semicolon usage in general decline?
• Can we really predict lexical density from semicolon usage and vice versa? That is,
do more semicolons generally mean lower lexical density?
• Are readability and lexical density related?
That is, do lexically dense texts tend to be easier to read?
Do our conclusions agree with
other research?
• Do authors from the U.K. write less lexically dense texts than their North American
counterparts? Do they use semicolons more often?
Is the complexity of these texts higher than those authored by North Americans?
• Do female author tend to write texts which are easier to read?
Do they tend to use more lexical words than male authors?
As always, we encourage the reader to try these experiments themselves, and
would be quite interested in any observations, questions, or hypotheses they might have.
What relationships might exist between other variables such as word and sentence lengths?
Finally, we caution the reader with the well known phrase that
correlation does not imply causation.
For example, although there may be a general relationship between semicolon frequency and lexical density,
does using semicolons more frequently cause a writer to use fewer lexical words or vice versa?
We believe that the anyone would be hard pressed to prove such a claim provided that it is even true.