Every summer, Major League Baseball inducts a small numbers of players into the Hall of Fame, with each inductee giving a speech. This blog analyzes the language of speeches by three recent Hall of Famers, Ken Griffey Jr., John Smoltz, and Greg Maddux.
I wanted to analyze a broader set of speeches, but I could not locate an archive of speeches. The texts of the three above, though, were easily found from a Google Search.
Lexical Dispersion
One of the more interesting elements of text mining is lexical dispersion (that is, where words appear in a corpus of text). The below lexical dispersion plots display when Griffey, Smoltz, and Maddux mentioned the names of their teams. To note, it’s interesting that Maddux never actually said “Braves”, instead always referring to the team as Atlanta.
Lexical Dispersion for Ken Griffey Jr:
Lexical Dispersion for John Smoltz:
Lexical Dispersion for Greg Maddux:
Uni-Grams
Another staple of text mining is inspecting the most used words, which can help us quickly understand high-level themes of a corpus of text. Unsurprisingly, Griffey mentioned his dad a lot; Smoltz uttered a variant of “thanks” more than 40 times; Maddux had an emphasis on the word “first.”
Bi-Grams
Bi-grams are a set of two words used together. A HoF speech is a pretty small corpus of text, so there won’t be too many bi-grams. In fact, Maddux didn’t have enough bi-grams to really even make a plot. In the graphs below, we see bi-grams like “spring training” or “work hard” or “Tommy John.” Nothing too surprising. To note, words are “stemmed”, so items like “spring training” and “spring training’s” are classified as one term.
Latent Dirichlet Allocation
Lastly, Latent Dirichlet Allocation is a methodology that identifies underlying topics in a speech. Again, each HoF speech is not too large, so topics will not be the most robust in this setting. The tables below show the top words in the five identified topics for each speech. Some make sense, while others are scattered. For example, the words in Griffey’s Topic 2 go together fairly well, while those in Topic 3 really do not.
LDA results for Griffey:
LDA results for Smoltz:
LDA results for Maddux: