Comparing Lexical Density

A Tale of Two Densities:
Wikpedia vs. The Short Story

In this experiment we compare the lexical density of informative writing versus fiction. To simplify matters, we chose to look only at Wikipedia articles and short stories. The reason for this choice is that there are tools available which allow us to sample randomly from large bodies of literature representative of different styles of writing.

Using Wikipedia's random article feature, we pasted a random sample of 30 consecutive summaries of Wikipedia articles (the text before the contents section with citation marks removed) which were large enough to warrant having a "contents" section. This yielded an average lexical density of 59.92% as estimated by the Holt.Blue Text Analyzer. The results are shown below.

Next, we pasted a random sample of 30 consecutive first paragraphs of random short stories from a short-story website which allowed us to randomly draw short stories from their online collection. using their random story feature. This yielded an average lexical density of 49.43% as estimated by the text-analysis website we used. The results are shown below.

Note: The reader will notice that informative writing tends to be quite lean in term of pronouns. Also, the reader will notice that lexical density is simply the sum of the percentages of nouns, adjectives, verbs, and adverbs as stated in the definition of lexical density.

Comparing The Lexical Density of
Whole Articles and Short Stories

Finding the above analysis interesting, we decided it would be worth our while to analyze complete articles and complete short stories.

We analyzed complete Wikipedia articles and complete short stories from the short-story website. Every Wikipedia article and every short story was again drawn randomly using each website's respective random draw features.

The text analyzed for each Wikipedia article was all the text between the beginning of the article, which was defined to be the first bold word (this excluded disambiguation and issues sections), to the end of the article, which was defined to be where either the "references," "external links," or "See also," sections began. Analyzed text also included text from tables, and photo and figure captions. Due to time constraints, citation marks and other bracketed notations were not removed.

The descriptive statistics of our sample are:

	Average (Mean) L.D.	Median L.D.	Standard Deviation of L.D.
Wikipedia Articles
Short Stories

The histograms of both samples are shown below.

Percent of Sample
	Lexical Density Wikipedia Articles

Percent of Sample
	Lexical Density Short Stories

It is plainly visible from our histograms that there is quite a bit more variation among Wikipedia articles than short stories on the website we visited. In line with this observation, the reader will notice that the standard deviation for Wikipedia articles considerably higher than for short stories. We pose the question to the reader: is this true in general for short stories?

Inferences

Using the sample statistics from the above tables and the well-known student's t-distribution we can infer the following:

There is a 95% chance that the true average lexical density^* of Wikipedia articles lies between 55.29% and 57.63%.

There is a 95% chance that the true average lexical density^* of short stories on the short-story-website we used lies between 48.93% and 50.59%.

^*As measured by the Holt.Blue Text Analyzer.

Concluding Remarks

Lexical density, as measured by the Holt.Blue Text Analyzer, tends to be higher in informative writing than in fiction on the order of about 7%. Our informal observations pointed to lexical densities around 60% for informative articles such as Wikipedia articles and news articles and 50% for fiction. A more complete analysis which considered whole articles and short stories, however, revealed that that lexical densities for Wikipedia articles is likely somewhere between 55% and 58%.

An informal (non-random) sample of New York Times articles that we took yields similar results in the range of 58% for all the articles we looked at. We encourage the reader to take this analysis further. Is this true for news articles in general, or only New York Times articles? Is there a difference according to political spectrum, or British, Australian, Canadian, or U.S. news outlets?

Lastly, it is well-documented that lexical density tends to be lower in spoken English than in written English (Johansson, Ure). It would be very interesting to analyze interview transcripts or some other transcription of spoken English to verify this observation. Again, there is a wealth of tantalizing questions one could look at: does lexical density vary between interviews with celebrities, politicians, musicians, or visual artists? Maybe yes. Maybe no. That's for you, dear reader, to find out^*.

^*Update: We couldn't resist the temptation of trying this snippet ourselves.

Links and References

Holt.Blue Article: Lexical Density.

Johansson, V. (2008), Lexical diversity and lexical density in speech and writing: a developmental perspective, Working Papers 53, 61-79.

Ure, J. (1971), Lexical density and register differentiation. In G. Perren and J.L.M. Trim (eds), Applications of Linguistics, London: Cambridge University Press. 443-452.

A Tale of Two Densities: Wikpedia vs. The Short Story

Comparing The Lexical Density of Whole Articles and Short Stories

Inferences

Concluding Remarks

Links and References

A Tale of Two Densities:
Wikpedia vs. The Short Story

Comparing The Lexical Density of
Whole Articles and Short Stories